POPULARITY
We present gradient routing, a way of controlling where learning happens in neural networks. Gradient routing applies masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. We think gradient routing has the potential to train safer AI systems, for example, by making them more transparent, or by enabling the removal or monitoring of sensitive capabilities.In this post, we: Show how to implement gradient routing.Briefly state the main results from our paper, on... Controlling the latent space learned by an MNIST autoencoder so that different subspaces specialize to different digits;Localizing computation in language models: (a) inducing axis-aligned features and (b) demonstrating that information can be localized then removed by ablation, even when data is imperfectly labeled; andScaling oversight to efficiently train a reinforcement learning policy even with [...] ---Outline:(01:48) Gradient routing(03:02) MNIST latent space splitting(04:31) Localizing capabilities in language models(04:36) Steering scalar(05:46) Robust unlearning(09:06) Unlearning virology(10:38) Scalable oversight via localization(15:28) Key takeaways(15:32) Absorption(17:04) Localization avoids Goodharting(18:02) Key limitations(19:47) Alignment implications(19:51) Robust removal of harmful capabilities(20:19) Scalable oversight(21:36) Specialized AI(22:52) ConclusionThe original text contained 1 footnote which was omitted from this narration. --- First published: December 6th, 2024 Source: https://www.lesswrong.com/posts/nLRKKCTtwQgvozLTN/gradient-routing-masking-gradients-to-localize-computation --- Narrated by TYPE III AUDIO. ---Images from the article:
It's return guest season here at Latent Space! We last talked to Kanjun in October and Jonathan in May (and December post Databricks acquisition): Imbue and Databricks are back for a rare treat: a double-header interview talking about DBRX from Databricks and Imbue 70B, a new internal LLM that “outperforms GPT-4o” zero-shot on a range of reasoning and coding-related benchmarks and datasets, while using 7x less data than Llama 3 70B.While Imbue, being an agents company rather than a model provider, are not releasing their models today, they are releasing almost everything else: * Cleaned-up and extended versions of 11 of the most popular NLP reasoning benchmarks* An entirely new code-focused reasoning benchmark* A fine-tuned 70B model, built with Meta Llama 3, to identify ambiguity* A new dataset of 450,000 human judgments about ambiguity* Infrastructure scripts for bringing a cluster from bare metal to robust, high performance training* Our cost-aware hyperparameter optimizer, CARBS, which automatically and systematically fine-tunes all hyperparameters to derive optimum performance for models of any sizeAs well as EXTREMELY detailed posts on the infrastructure needs, hyperparameter search, and clean versions of the sorry state of industry standard benchmarks. This means for the FIRST TIME (perhaps since Meta's OPT-175B in 2022?) you have this level of educational detail into the hardware and ML nitty gritty of training extremely large LLMs, and if you are in fact training LLMs of this scale you now have evals, optimizers, scripts, and human data/benchmarks you can use to move the industry forward together with Imbue.We are busy running the sold-out AI Engineer World's Fair today, and so are unable to do our usual quality writeup, however, please enjoy our show notes and the excellent conversation! Thanks also to Kanjun, Ashley, Tom and the rest of team Imbue for setting up this interview behind the scenes.Video podTimestamps* [00:00:00] Introduction and catch up with guests* [00:01:55] Databricks' text to image model release* [00:03:46] Details about the DBRX model* [00:05:26] Imbue's infrastructure, evaluation, and hyperparameter optimizer releases* [00:09:18] Challenges of training foundation models and getting infrastructure to work* [00:12:03] Details of Imbue's cluster setup* [00:18:53] Process of bringing machines online and common failures* [00:22:52] Health checks and monitoring for the cluster* [00:25:06] Typical timelines and team composition for setting up a cluster* [00:27:24] Monitoring GPU utilization and performance* [00:29:39] Open source tools and libraries used* [00:32:33] Reproducibility and portability of cluster setup* [00:35:57] Infrastructure changes needed for different model architectures* [00:40:49] Imbue's focus on text-only models for coding and reasoning* [00:42:26] CARBS hyperparameter tuner and cost-aware optimization* [00:51:01] Emergence and CARBS* [00:53:18] Evaluation datasets and reproducing them with high quality* [00:58:40] Challenges of evaluating on more realistic tasks* [01:06:01] Abstract reasoning benchmarks like ARC* [01:10:13] Long context evaluation and needle-in-a-haystack tasks* [01:13:50] Function calling and tool use evaluation* [01:19:19] Imbue's future plans for coding and reasoning applications* [01:20:14] Databricks' future plans for useful applications and upcoming blog postsTranscriptSWYX [00:00:00]: Welcome to the Latent Space Podcast, another super special edition. Today, we have sort of like a two-header. John Frankel from Mosaic Databricks, or Databricks Mosaic, and Josh Albrecht from MBU. Welcome.JOSH [00:00:12]: Hey, glad to be here.SWYX [00:00:14]: Thank you for having us. Hey, so both of you are kind of past guests. Jonathan, you were actually one of the most popular episodes from last year talking about MPT7B. Remember the days when we trained large models and there was 7B?JONATHAN [00:00:30]: Yeah, back when reproducing LLAMA1-7B was considered a huge accomplishment for the field. Those are the good old days. I miss that.SWYX [00:00:38]: As the things have accelerated a lot. Actually, let's do a quick catch up and Josh, you can chime on in as well. So Databricks got acquired. I talked to you at New York.JONATHAN [00:00:45]: Mosaic got acquired, although sometimes it feels like Mosaic acquired Databricks because, you know, we're having a lot of fun being here. But, you know, yeah.SWYX [00:00:52]: Yeah. I mean, you are chief scientist now of Databricks.JONATHAN [00:00:55]: Chief AI scientist. Careful with the title. As much as I would love to understand how Spark works, I'm going to have to defer that to much smarter people than me.SWYX [00:01:03]: Got it. And I don't know about like what you would highlight so far as a post-acquisition, but the most recent news is that you guys released DBRX. Is that the thing that most people should be aware of?JONATHAN [00:01:13]: Actually, that's no longer the most recent news. Honestly, the most recent news, we announced this, but it was at our Data and AI Summit last week. So it was announced among like 100,000 other things, is that we finally released our text to image model, which has been a year in the making through a collaboration directly with Shutterstock. There was a lot of work put into finding a dataset that we were comfortable with working on and trying to build a model that honestly, I felt like I could trust and that others might be able to trust to put out in the world. So that model was released last week. It's unfortunately just available via API due to the fact that the data is quite sensitive and quite valuable. It's Shutterstock's entire business in a lot of ways, but I'm still really excited that there's now a model that is trained on a dataset where the provenance of every single image is known, and it's a damn good model. So I'm really proud of the team on that.SWYX [00:01:55]: Yeah, amazing. Josh, do you have any thoughts on image model questions?JOSH [00:01:59]: That is not my area of expertise, but I was excited to see the release of it last week as well, and very happy that you guys did a nice job on the data side of everything there. So that was cool to see.SWYX [00:02:09]: I think what's unusual is like, I think Shutterstock's doing multiple deals in multiple labs. So what is the Shutterstock model? Like, I guess, is this the house model for Shutterstock? Is this Databricks' version of the Shutterstock model? Like, what is this?JONATHAN [00:02:22]: The way that I would think about it is that Shutterstock is doing an amazing business in AI across the board. Their dataset is kind of widely known to be the best stock photos dataset in the world, the most comprehensive, the biggest. When you think about like, what dataset am I going to train a multimodal model on? You call Shutterstock. And I, at least I've heard in the news, like OpenAI, Google, Meta, Apple have all called Shutterstock and made those deals. So a lot of models have had Shutterstock data incorporated into them. But this is the only model I know of so far where it was, you know, exclusively and specifically trained just on the vanilla Shutterstock data. There was nothing else mixed in. We didn't go and scrape the web and find other data or combined datasets or anything like that. And so this is, in some sense, the house blend. But the other piece is that it's just a dataset where the provenance of every image is known in public. Where did the data come from? It is the Shutterstock collection. That's it. You know, nothing less, nothing more. And certainly being at Databricks, if I've learned one thing, I've learned about enterprise customers and what they want out of AI. And one of the things they ask for most is just, what can you tell me about the data the model was trained on? And here, especially for text to image models, where images are just tricky subject matter, there's been a lot of kind of legal conversation about images, especially. It's nice to just have something where I can point to it and say, you know, if you want to know where the images came from, these are what they are and this is how they got there.SWYX [00:03:36]: I will talk a little bit about Databricks because it's relevant to the rest of today's episode. So Databricks, sorry, I keep misspeaking. It's DBRX.JONATHAN [00:03:46]: DBRX, actually, there's been a pronunciation update. It is now D-B-Rex. So we have decided to add a dinosaur mascot because what model doesn't like a mascot? So literally, I wish I could pull it up. There is a little plush dinosaur that we had made. It's like the world's cutest dinosaur, but it is the official mascot of D-B-Rex. And there's a little dinosaur logo that, you know, you'll probably see around a little bit more because DBRX is a mouthful, but D-B-Rex, like, you know, it's just kind of...SWYX [00:04:13]: Rolls off the tongue. I love mascots. Like every company should have a mascot. And I think Hugging Face got it right. You need an emoji mascot because that's the minimal viable image.JONATHAN [00:04:21]: I probably shouldn't talk at all about, you know, Velociraptor, but, you know, that's a, maybe that's something we can talk about later in the summer. I'll just leave it at that.SWYX [00:04:28]: Okay. That's a hint to names. I feel like your names leak a lot of alpha. So just to quickly cover the headline details, DBRX, as Make Sure Experts model, that's fairly big, 132 billion total parameters, so 36 billion active on any input, pre-trained on 12 trillion tokens of text and code, and did really well on evals to the point where you had to dye your hair blue. That's my high level conclusion.JONATHAN [00:04:53]: Never make a bet with your team two weeks out from model launch, even when, you know, human eval is looking quite bad. Because if you set some bar, even if it's arbitrary and you think there's no way in hell they're going to hit it, apparently money doesn't motivate people anymore. Humiliating their boss motivates people. So Josh, you should really take a hint from this. You know, you cannot pay someone enough money to make up for you dyeing your hair blue.JOSH [00:05:15]: I'll keep that in mind for our next model.SWYX [00:05:17]: It works. So speaking of Imbue's next model, perhaps Josh, you want to actually just say hi to the general sort of latent space audience and talk about what we're releasing today. Yeah.JOSH [00:05:26]: I'm Josh, CTO of Imbue, and we're not releasing the model. We're not releasing the weights, but we are releasing a bunch of different things that should make it easier for other people to make their own models. So I think right now, training foundation models from scratch is like a very difficult, time-consuming, expensive, kind of risky endeavor, especially for smaller companies. And the things that we're releasing hopefully make that at least a little bit easier. So the things that we're releasing fall into kind of three different buckets. One is infrastructure and scripts for dealing with the kind of hardware and hardware failures and understanding how well is the actually lowest level of thing actually working so that you can actually do your training at all and at a reasonable speed without having to constantly restart, etc. So infrastructure and training scripts. A second set of things is around the evaluation. So after you've trained it, like how well is this actually working and how do you know how well it's working? We're releasing a whole bunch of different data there, a new benchmark about code, reasoning, understanding, as well as our own private versions of 11 different open source benchmarks. So things like pool queue or ANLI, where we've gone through and kind of cleaned up the data as much as possible by looking at all the ones that models get wrong or that are flagged for ambiguity and also our own kind of private reproductions of those where we've done like a kind of clean room black box, like, okay, this is what the data set is supposed to be. Here are some examples. Let's make our own version of this to make sure that there is no data contamination, etc. To make sure that we're actually, you know, not testing on train. And then I think a final thing that we're releasing there is around 450,000 human judgments about ambiguity and question quality, which we used in the process of cleaning these evaluations and we also hope will be helpful for other people training kind of similar models. And then the third thing is CARBS, our hyperparameter, our cost-aware hyperparameter optimizer, which was especially helpful for being able to experiment at much smaller scales and then scale those experiments up to the much larger scale kind of on the first try without having to retry it. You don't want to be training, you know, 10, 20 different 70B models. You really want to get these larger modelsSWYX [00:07:30]: right on the first try.JOSH [00:07:30]: And so the ability to kind of tune things very precisely and learn scaling laws, not just for, you know, the like data and flops, but also for learning rate and all the other hyperparameters and see like how should you scale these things up was extremely valuable to us as we were training the larger models. Yeah, that's a lot of stuff.SWYX [00:07:49]: Yeah, exactly. So there's a bunch of stuffJOSH [00:07:50]: we'll have to go through all of it.JONATHAN [00:07:52]: Yeah, I just want to throw in how excited I am about this. This is the stuff that nobody ever talks about. That is the difference between success and failure in this stuff. Like, can you get your cluster to run? Can you get software on your cluster? Can you figure out what broke? Because fault tolerance is still not really built into any of the fundamental primitives of training models. And so if something breaks, you have to go figure out what broke, your job stops, you have to restart your job. It is a nightmare just to get to the point where anything can train on the cluster. A basic MPI hello world that has the GPUs talk to each other is hard enough, let alone actually training a model, let alone getting good performance out of the GPUs, let alone actually getting a model that converges to anything interesting. There's so many levels of things you have to accomplish. This is the kind of stuff that matters. I think to a point that Josh made earlier, before we got on here, there are plenty of weights out there. Nobody's released this.JOSH [00:08:46]: Yeah, that was part of the motivation actually is that there are lots of other things that are complimentary, but I have not seen nearly as much discussion about some of these other things that we think are pretty important. I mean, in some sense,SWYX [00:08:56]: I'm very excited to have Jonathan on because this is a little bit, you're a bread and butter with Mosaic. And I think you've released some part with Composer. And I think it's just really interesting to see like a different take, basically a full stack take that's kind of open source today.JONATHAN [00:09:18]: Yeah, it's really kind of, it's been an ordeal to figure this out. And every time something changes, whether it's a new GPU or even a new driver update, you get new creative errors and new things go wrong. And, you know, we've dealt with the weirdest things from, you know, our InfiniBand cables getting stolen from the data center twice, like in boxes before they arrived at the data center. Like, you know, Porch Pirate basically had stolen our InfiniBand cables back when those were hard to come by. To like, you know, weird recalls of switches to like the strangest stuff has happened. I have my favorite GPU failures I've seen, like ones where the GPU doesn't fail, it has a correctable memory issue and the memory correction causes the GPU to become a straggler and hold up the whole job. Like weird stuff happens and figuring out how to not just identify all of that, but then eventually productize it, is in some sense, the entire story of Mosaic and now Databricks in terms of our ML offering. Really, the thing we offer is we have gone through this suffering and figured out how to even productize that. It has been a pain in the butt.SWYX [00:10:20]: Yeah, it's a lot of work.JOSH [00:10:20]: I think my favorite failure was GPU is just giving wrong math. Like if they give errors, great, because you can see the errors, but if they just give you the wrong math back, not so fun.SWYX [00:10:30]: When did they give you wrong math?JOSH [00:10:32]: Like literally you could just, you know, add two things. For example, the numbers come back. They're not the numbers that they're supposed to be.JONATHAN [00:10:40]: I think it's important to say at this stage, just because like it, I think it goes without saying for Josh and I, but it's worth saying here, this isn't to say that like anything is wrong with us. It's not like NVIDIA did a bad job or, you know, Mellanox did a bad job or the like the server builder, the data center operator, the cloud provider, like the million other parties that are involved in building this. We are running these insane chips that are huge and complicated and built on tiny transistors at insane frequencies with insane heat in data centers that for the most part, were not built remotely for this kind of power or heat and have been retrofitted for this. Like failures happen on a good day with normal CPUs. And this is not a good day and not a normal CPU for the most part. It's fun to joke about all the weird things we see. This is not to say anybody's done anything wrong. This is just kind of part and parcel of working on a massive cluster running at multiple megawatts of power at a time.SWYX [00:11:32]: It's crazy. Yeah.JONATHAN [00:11:33]: So optical cables, like all sorts, like everything.SWYX [00:11:37]: I'll take the opportunity to start going to the sort of infra piece. There's just like a description of the infra just to give people a sense of what we talk about when we talk about massive clusters. So I'm just going to read off the blog post here. This post is about one cluster that has 4,092 H100 GPUs spread across 511 computers. They use unified fabric manager nodes, which manage the infinite band network. And you talk a little bit about your networking. Is there anything unusual about this setup that you'll call out to people?JOSH [00:12:03]: Yeah, actually this particular cluster is a little bit non-standard. The normal, like vanilla setup for these large clusters as vanilla as it can be is what's normally like a 127 node cluster. So closer to like 1024 GPUs instead of 4,000. Here we have a larger cluster. As you start to get into the larger clusters, the networking becomes a little bit more custom. It's a little bit more, it's a little bit trickier. It's a little bit more difficult to get these things to all be able to talk to each other at the same speed. And so this has, in this particular case, this is a three tier network architecture instead of two tiers, kind of the normal one. So most of the clusters are a little bit smaller. As you get to even larger scales, then this becomes even much more complicated,SWYX [00:12:43]: much more expensive.JOSH [00:12:43]: So we chose this particular scale, kind of knowing our own workloads and kind of what we wanted to do. This was kind of the right size for us. But yeah, I think it's not exactly vanilla already. It's already getting into kind of the custom territory.SWYX [00:12:54]: So my understanding is that there, and is there any part of this that comes with the Voltage Park deal that you guys had? Is that part of the hardware that you got from the deal with them?JOSH [00:13:04]: Yeah, so we worked really closely with Voltage Park to set up all their clusters and infrastructure and everything and kind of decide even like what to order, how should the networking work? Like we were very involved in kind of the construction and bring up of this. And that's what this post is about, is about that process of like bringing up all these, there's like different clusters in different places of different scales. So in this particular post, we're talking about this one 4096 GPU, but there are other clusters that they have as well. And we were very closely involved with figuring out the exact architecture and kind of the trade-offs that go along with picking, you know, those exact components. You really don't want to like place the wrong order because it takes months to get it and it's very expensive. So yeah, we were happy to help out with that.JONATHAN [00:13:43]: And then your bit of good cables get stolen.SWYX [00:13:44]: Yeah, yeah, exactly.JOSH [00:13:47]: We wanted to make sure that we ended up with compute that would work for us and that would also work for their other customers. And so we kind of helped design something so that we would get exactly what we were looking for. We knew that these kinds of details would be super important and that getting down to the level of the hardware and like having these good scripts and everything was going to be a core part of like actually getting this to work. I'm very glad that we did that. I don't think that most companies kind of take that full stack approach, but for us, it certainly paid off.SWYX [00:14:12]: Yeah, it's basically sort of built to spec. It's interesting that relationship because you usually, for the rest of us who don't operate at your scale, we take whatever we can get from cloud providers, but you are basically co-designing from the single machine up. And you described that a little bit. Do you want to take us through the process that you described here?JOSH [00:14:27]: Yeah, so for the actual, like the blog post and kind of bringing these machines online.SWYX [00:14:32]: Yeah.JOSH [00:14:32]: So yeah, I think the process, as we have it broken down in the blog post, there's kind of a few different layers. First is like getting the individual machines to work at all and then getting the machines to actually be able to talk to each other. So getting the InfiniBand networking to work and then getting to a point where, you know, not just the machines are working and they can talk to each other, but everything is actually working correctly. There's a big gap between like it's working at all to it's working perfectly correctly. And then after you have all this stuff working perfectly correctly, nice and healthy, then now you get into kind of the software data, like training issues. And then after that, you're still not done. Like now, even once you're training at full speed, things are going to fail over time. Things are going to change. There's going to be new, you know, firmware updates. Like how do you kind of deal with this change and flux over time without going crazySWYX [00:15:16]: and pulling your hair out,JOSH [00:15:16]: trying to like reproduce things or understand why there were regressions. And so there's a lot of work to kind of automate the infrastructure tooling as well. And kind of the first step, like bringing these things online in the first place, you know, you have hundreds of machines at this point. So you don't necessarily want to be like walking around with like a CD-ROM or a USB drive, like plugging it in with your keyboard, like hitting next, next, next on the OS install. That's not how this works. You do that for one machine. And then you use, we use this thing called Metal as a Service to bring up all the other machines. So it's a kind of server that can kind of install the operating system on these other machines. So most like when you're talking about these machines, like each machine is, you know, on the order of hundreds of thousands of dollars. So they usually come with a kind of out-of-band management interface as well. So they don't, they have their InfiniBand networking. They have their normal 100 gigabit per second Ethernet networking. These are like dual, redundant, et cetera. And then you also have this extra out-of-band management network. So you can log in and you can see like the boot screen or you can see the blue screen of death. You can like get in there and actually see what was wrong, which is pretty fun. And it makes it like possible to automate a lot of this work. So the beginning of that, and the blog post goes into much more detail about like exactly how we set these up and kind of the other errors that we ran into. When you're bringing these online, you'll definitely have failures. Even if they all worked in the factory, they get shipped, some parts come loose, something fails, something goes wrong. So when you're bringing them online, there'll be some that don't quite work for all sorts of reasons. As you start to be working with machines at this scale, like if something happens one in a thousand times, you're like pretty likely to see it. And so you can get pretty rare, weird things, especially since we had fairly early builds and fairly early versions of this hardware. Like these are some of the like first machines that were ever produced, some of the first GPUs. So you've got some extra special things there. We definitely worked with Dell, for example, on making fixes in the firmware level to be like, okay, like this thing is wrong. Like we need to update this at the firmware to like actually fix this particular thing. So we worked pretty closely with Dell and Nvidia. Yeah, that's what I'm saying. Like this stuff gets complicated. And the thing is like, you know, taking a step back, the whole reason we're doing this, right, is that we knew that this was going to be complicated. There would be these kinds of failures. And if we're just using, you know, AWS or some other cloud provider, these errors are still gonna be there and you're gonna have no way to know and no way to debug this and no way to diagnose what's going wrong. And so we would much rather be able to like call up Dell and say, hey, this isn't working. And they're like, yep, okay, cool. Let's debug it together. Oh, I see. Yeah, cool. We'll ship a firmware update and actually fix this for you. That was a much better experience than like, great, just magically fails. I guess we restart and hope that that machine goes away. Like that's not a very good place to be. So yeah, that's kind of the first place is getting to a place where like GPU training is working on your single node machines. You can observe stuff. We have tons of tooling around like, you know, Prometheus and all sorts of other tools for understanding what's going on in these machines because you don't want to be like logging into each one and looking at the temperature or something you really need to have tooling to collect all these metrics, et cetera. Unfortunately, all of the scripts that we have for this are like for this entire cluster and for all this infrastructure are a little bit like special purpose for our particular thing. So it's not that every script that we have, it's not that you can just like take this and plug this in. Even if we did open source all the tooling that we have, you'd still have to do like a lot of work to open source it. What we are releasing is as many of the things that we can that are going to be useful for other people. You're still going to have to have some way of kind of managing these things, making your own like logging aggregators, et cetera, et cetera. So that's kind of bringing them up to the like, you know, the single nodes that are working. From there, it goes into, I'm happy to keep going if you want. Well, I just want to leave the opportunity for JohnSWYX [00:18:53]: to comment if there's anything that's different from how he runs things.JONATHAN [00:18:57]: Oh, I mean, all I'll say is I'll endorse this and say this s**t is hard. Like this is really, really hard. And, you know, I have a special props to, you know, the folks in Vue because they were building this from the ground up. You know, at Databricks and at Mosaic, we typically work with cloud providers because some of this stuff is just, there's too much to handle. It's complicated. There's a lot to deal with. And this doesn't even get into things like physical security, you know, securing power if you're the data center operator. Like this gets infinitely complicated and you have to abstract somewhere. Like, you know, and then you get to the folks who are literally building their own custom chips and like, good God.SWYX [00:19:36]: Like, oh my God, that's, you know,JONATHAN [00:19:38]: if you're one of those folks, you're having, you know, pour one out for the infra people at some of the AI chip startups who are having a really, really interesting time right now. But this stuff is really hard. And I don't think we talk about it much because there's so many other things that are hard. But the other hard things, I think everybody's becoming pretty familiar with at this point. This is something that I don't think there's ever really been a comprehensive discussion of, at least not that I've seen.SWYX [00:20:00]: Yeah, so my impression is that you guys, Mosaic, have your own software for sort of spinning up and down machines, just like Imbue had to build. But Imbue probably, it sounds like Imbue, you guys went fuller stack. I don't know how to describe it. Like Mosaic is not working with Dell on like their firmware.JONATHAN [00:20:21]: No, no, we're typically working with like, you know, pick your cloud provider on their Dell firmware or what have you. Like, it's kind of, I think one of the things, I don't know, Josh, you can correct me on this. It's kind of impossible if you're doing training to not go all the way through the entire stack, regardless of what happens. Like somehow I'm still chatting with cloud providers about power contracts, even though the whole point of dealing with the cloud provider is not to have to think about power contracts. Somehow I'm still asking them about which InfiniBand provider they used this time to see if this is part of the bad batch of cables I encountered on that cloud provider or what have you. Or like, we're still talking about a firmware update from pick your provider. You can't not do this. It's convenient that they have data center staff who are worrying about what to send back to which provider when, and they have people who can go and wait for the InfiniBand cables so they don't get stolen outside. But, you know, it's kind of, it's impossible not to really go full stack if you're thinking about the infrastructure at all. I don't know, Josh, correct me. No, I think that's right.JOSH [00:21:17]: That's what we expected from the beginning as well, is that we would inevitably have to get into the details here. And I'm glad that we kind of just planned for it. I think it made it a lot easier from our perspective to have direct control over this. Instead of having to go to the cloud provider that goes to the data center, that goes to the supplier, we could just go direct to NVIDIA or DellSWYX [00:21:37]: or the data center,JOSH [00:21:37]: whoever was responsible and be like, hey, this thing needs to change. And they're like, oh, okay. Yeah, that is our responsibility. Great, we can fix that. So it was just a lot easier for us to fix these bugs than if we had to go through an extra layer of email.SWYX [00:21:48]: Something we discussed in the pre-show was that you had a rule of thumb for your cluster of reliability. You say here in the post, by and large, you expect around 3% of your machines to break every week. So you're basically going to turn through all your machines in a year.JOSH [00:22:04]: As it says in the post. So that would be true if it was a uniform failure like that. But as it says in the post, it's usually these kind of problematic nodes. And to be clear, that is the number that we've heard from other people is like they're having about 3%. I don't think we're experiencing failure rates that are that high. I think ours is actually quite a bit lower than that, probably because we've taken the time to like dig into a large, maybe larger number than we should have of these failures and get to the root cause of it and be like, oh, okay, like that's exactly what's going wrong.SWYX [00:22:33]: How do we fix this?JOSH [00:22:33]: How do we prevent this from happening? How do we make automated checks for this so that if it does happen, it just goes back to whoever owns that particular part of the process and they can fix it immediately.SWYX [00:22:43]: And that's part of what you're also open sourcing, which is the health checks, right? You got the NIC health checks, GPU health check, this space health check, Docker D message. I don't know what that is.JOSH [00:22:52]: That one is just a lot of stuff.SWYX [00:22:54]: Yeah.JOSH [00:22:55]: That one is one where we realized that actually like when these machines boot, sometimes they wouldn't actually boot cleanly all the way. Or when they rebooted, they had problems that they didn't have when they were working before, which was kind of frustrating. Like usually if you restart your computer,SWYX [00:23:08]: it gets better.JOSH [00:23:08]: Here you restart. It did not get better.SWYX [00:23:10]: It got worse.JOSH [00:23:10]: That was very frustrating. So this health check looks at every particular line we've ever seen from the boot, like in D message, like every single log line that your computer emitsSWYX [00:23:21]: and says like,JOSH [00:23:21]: have we ever seen this before?SWYX [00:23:23]: Is this expected?JOSH [00:23:23]: Is this in the right order? Or is there something out of place? If there's anything out of place, let me say, okay, great. Like now it goes into this, like longer, more triage list of like, all right, great. Like, is this acceptable?SWYX [00:23:33]: Should we flag this?JOSH [00:23:33]: Like, should someone take a look at this? So we're looking down at a very, very granular detail level, what's happening on these computers to make sure that nothing is out of place. And that's critical because without that, if you're running your training, as Jonathan said, and this thing is slow, like what are you supposed to do? Right?SWYX [00:23:49]: Like you really,JOSH [00:23:49]: you really want to be very certain that like all 4,000 of these GPUs are working like they're supposed to.SWYX [00:23:54]: We know that.JOSH [00:23:54]: And so if it's slow, it's because like we messed up the config or something else and not because of this earlier thing that's like really hard to detect in software later.JONATHAN [00:24:01]: Yeah. I think the, I'm just curious to ask,SWYX [00:24:03]: like, you know,JONATHAN [00:24:03]: suppose you were to set up another, let's say another H100 cluster and it were at a different data center. And instead of the vendor being Dell, it was super micro or what have you. How much of this would be repeatable? And how much of this would you have to redo? I, you know, I genuinely don't know.SWYX [00:24:18]: A decent amount.JOSH [00:24:19]: I think it would go a lot faster the second time. I think there's lots of learnings that we had. And also the blog post,SWYX [00:24:24]: you know, yes,JOSH [00:24:24]: we are releasing the health checks, releasing some scripts, but a lot of the valuable stuff is also in the blog post itself, in the details and kind of the, you know, the learnings that we've had and the sort of errors that we run into. We tried to as much as possible surface those to other peopleSWYX [00:24:36]: could learn from thoseJOSH [00:24:36]: and avoid the same mistakes or failures as well. But I think it would go a lot faster.SWYX [00:24:41]: Although, yes,JOSH [00:24:41]: there would certainly be some things that'd be a little bit different. I mean, there'd probably be different CPUsSWYX [00:24:46]: or whatever,JOSH [00:24:46]: but I think a lot of that stuff is less,SWYX [00:24:49]: it's less,JOSH [00:24:49]: that's the like, that's less variable. I think most of it would apply the second time around. Although I'm sure next timeSWYX [00:24:56]: we're building one,JOSH [00:24:56]: it'll probably be, you know, at a scale that's 10x as big with a different chip or something like this.SWYX [00:25:00]: And then who knows?JOSH [00:25:01]: Yeah, with Kinect X8,JONATHAN [00:25:02]: that will have its own fun behavior and all that good stuff. Yeah.SWYX [00:25:06]: Perhaps there's something that people don't discuss about, and you don't even talk about this in the blog, but I always wonder is what is the timeline that's like kind of reasonable for this amount of work, at least the initial stages? And also what does the team composition look like for setting up a cluster, right? Like what are the mix of skills that you typically would require to get all this going?JOSH [00:25:27]: I'm, I can't really speak to typical. One thing I am very proud of is how much we accomplished with such a ridiculously small team. Like our infrastructure team is like, you know, fluctuates from week to week, depending on like how many things are on fire and how much we need to build. But it's like between like three and six people, like it's small. It's not like some huge team of like tons and tons of engineers. But those people are very, very good at what they do. And so that has allowed us to get a lot of mileage out of out of these things. I think it's not that we're building everything, right? It's not that three to six people build this whole thing. I definitely want to like, you know, say thanks very much to Dell and H5 and NVIDIA and the other people that have done a lot of the work, like to bring up this cluster, you know, with 4000 GPUs and three tier networking, networking architecture, you have 12,000 cables. So that's 24,000 things that need to be plugged in. Like that's just a lot of stuff to plug in, right? And you don't want to mess it up. Like each one needs to be done correctly. Like it's a little bit loose. Like it doesn't really work.SWYX [00:26:23]: If you break it,JOSH [00:26:23]: you need to replace it. Like there's a lot of workSWYX [00:26:26]: that goes into this.JOSH [00:26:27]: Yeah.SWYX [00:26:28]: And then, you know,JOSH [00:26:28]: that's just like that's it. That's if you were to do everything right the first time.SWYX [00:26:32]: And if you didn'tJOSH [00:26:32]: have to fix anything. But inevitably, you know, you will have to replace something, which means like taking all the wires out, pulling the thing out, taking all the GPUs out, going and fixing some cable, putting it all back correctly, putting it back in, doing this every time. So there were a lot of people at Dell, NVIDIA and at H5 that all helped a ton with this stuff. I don't know the exact size of the Dell team. It also fluctuated over time.SWYX [00:26:55]: Yeah, excellent. And then, you know, you so you have all the hardware set up and now you're firing it up for a single node. There's a long description that you guys have about just like monitoring the MFU, right? And what each situation might look might be indicative of. One of the most interesting things to me that I saw from here is like, you know, if training immediately starts off at 60 to 80% MFU, something's wrong.SWYX [00:27:24]: But like, you know, like what what are like, you know, some anecdotes or, you know, notable scenarios here that you might you might call out as maybe counterintuitive or super interesting.JOSH [00:27:36]: There's just so many of them. I mean, one of them, which I think is probably pretty common, like common knowledge by this point. But like we did have a sort of likeSWYX [00:27:46]: which one was this exactly?JOSH [00:27:47]: I think for the MFU, like gradually getting worse over time. I think that one, when we saw that the first time we were like, what the heck is going on? Like, why does it get just like a little bit worse? This is so strange. Like, what is it getting lazy or tired or something? Like, is it heat? Like what's going on? And in this particular case, it was memory fragmentation. Because you have hundreds of machines, they're doing garbage collection slightly different times. And then they get slightly further apart and slightly more and more jittered until eventually they're all happening kind of at random times. And just like really messing up each one of your steps. So you just turn off garbage collection and call it a day, basically,SWYX [00:28:20]: to be honest.JOSH [00:28:20]: There's other things you can do if you want to be a little bit more sophisticated about it. But you can also just manuallyJONATHAN [00:28:25]: have it all garbage collect on some interval. Like that's what we've done. We just have a garbage collection callback that just runs. But I've seen the exact same thing.JOSH [00:28:33]: Yeah, yeah, exactly. So I thought that one was kind of funny. And we did trace that one down and look and we did find the actual call. Like, again, this goes to like having good tools. So we had really good tools where we could look at a bunch of like actual traces in C and be like, OK, cool. This is the thing that's taking a lot of time. Or like, you know, this is the thing that doesn't quite line up here. Like, oh, I guess it's garbage collection. OK, cool.SWYX [00:28:52]: Interesting.JOSH [00:28:52]: Yeah, let's just try taking it off.SWYX [00:28:54]: OK, great.JOSH [00:28:54]: That's what it was. Now we can fix it. So for each of them, like basically bugs are not hard if you have good tools. But if you don't have good tools, bugs can be very, very hard. So similarly for like heat, another thing that we saw was like, oh, you know, the CPU is getting throttled. OK, well, it's easy to see if you're monitoring the CPU throttling or monitoring the heat. If you're not monitoring that, it's really hard to know why it's just suddenly one of them is going slower. I noticed also in the pieceSWYX [00:29:17]: that you mentioned FSDP with 0.3. Actually, we met, I went to iClear and Guanhua from the DSP team was there presenting 0++. I was wondering if you want to make any call outs to, you know, particular open source or open library or open whatever implementation teams that were super helpful in your process. I think we ended up actuallyJOSH [00:29:39]: pulling from a whole bunch of different ones to pull things in into our own particular pipeline. So we use things from NVIDIA's, you know, Megatron stuff. We use stuff from probably DeepSpeed. I think we pulled in a bunch of different pieces from a bunch of different places. So it was really nice to see all these working open source like examples. I think I really appreciate all the effort that has gone into actually tuning these things because you can tune them, but it's a lot of work to like tune this stuff and do all this stuff from scratch. It's really nice to have like a working example. I think those are probably the two biggest ones, DeepSpeed and Megatron alone, but there are probably other ones as well.SWYX [00:30:13]: Is there a particular thing in the ecosystem where you would call out as like, you know, there should be something here that is open source, but like it's not really, it's like everyone kind of builds it on their own. I want to say something with the file system because everyone talks about the file system eventually.JOSH [00:30:28]: The file system actually was,SWYX [00:30:30]: I mean, we did somethingJOSH [00:30:31]: kind of dumb there. Like we have our own sort of local mirror so that we can, you know, like a crappy version of S3SWYX [00:30:38]: that's local,JOSH [00:30:38]: but it's just a pretty simple script, right?SWYX [00:30:41]: Like I think we run likeJOSH [00:30:41]: a little web server that just like serves files and then, you know, it can upload themSWYX [00:30:45]: and download them.JOSH [00:30:45]: Okay, great. And part of the reason we did that is that our internet connectionSWYX [00:30:50]: in the beginningJOSH [00:30:50]: was not the like full speedSWYX [00:30:52]: one that we wouldJOSH [00:30:52]: eventually have. And so we are a little bit more kind of bottlenecked in terms of internet bandwidth. And so we had this. I think we looked at a bunch of services out there like Minio and some other ones, but a lot of these like come with a lot of extra overhead and maintenance. And since we already have so much infrastructureSWYX [00:31:09]: to deal with,JOSH [00:31:09]: we kind of didn't want to, you know, bring in a whole other like cloud provider, virtualize something, something.SWYX [00:31:14]: We just wanted something simple.JOSH [00:31:14]: So we went with that, which has been quite helpful. Like our toolsSWYX [00:31:19]: are usually quite simple.JOSH [00:31:19]: It's like Bash and Python and SSH and Docker. Like we'd like to keep things simple so that's easier to debug, like less layers of infrastructure, less layers of abstraction, make it a lot easier to work with. Like we don't use Kubernetes,SWYX [00:31:30]: for example,JOSH [00:31:30]: and we just directly launch these things. And it's just been much easier to debug this way. One tool actually that does come into mind that I will call out is Kraken from Uber. That was great. We love that tool. We were a little bit skeptical. What is it?SWYX [00:31:44]: I'm sorry. Yeah.JOSH [00:31:45]: So Kraken is this, yeah, it's a distributed like Docker registry, basically, that uses BitTorrent to like transfer things between the machines in a sort of nice optimal way. Like in the very beginning, the naive way is like you have this one Docker registry, which was outside of the cluster. So every time we change an image, you know, there's many gigabytes that each of the 500 machines needs to download.SWYX [00:32:07]: So that just takesJOSH [00:32:07]: a really long time. So what this thing does is like just one of them downloads it and then like they all sort of broadcast all the pieces to each other. And it was just like a really nice, fast way of getting these images down. And it was very robust.SWYX [00:32:19]: Like there's a lotJOSH [00:32:19]: going on under the hood, but I think it's a pretty cool tool that we haven't really had any bugs with it at all. Amazing.SWYX [00:32:26]: Yeah. I mean, that's all my questions, I guess, for the info piece. I don't know if, John, you had something that you were sort of burning to ask or.JONATHAN [00:32:33]: No, all I can say is just sameSWYX [00:32:36]: in a lot of places, like, you know, and they're done thatJONATHAN [00:32:38]: seeing this plus one. I think the one big difference, you know, perhaps in philosophies is we've tried to basically standardize on as much commodity stuff as possible, just because, you know, I think the reason I asked about trying to do thisSWYX [00:32:50]: on multiple differentJONATHAN [00:32:50]: pieces of infrastructure is like, I think we're running on like six or seven different clouds right now. And everybody has done something slightly different. And my gosh, the little differences add up as you know, you've seen. And so, you know,SWYX [00:33:04]: our philosophy has been like, whatever the hellJONATHAN [00:33:05]: we can standardize, please let's standardize it. Like vanilla off the shelf FSDB.SWYX [00:33:10]: And like, you know,JONATHAN [00:33:10]: we wrote our own data loader, but we've tried to make that as much of a standard as we can across our infrastructure and in Databricks, because things just start getting really complicatedSWYX [00:33:18]: or like we useJONATHAN [00:33:18]: Kubernetes extensively because it at least gives us a uniform set of APIs. Like that's our hardware abstraction layer to a certain extent for everything else. So it's just, you know, a difference in philosophy there. But otherwise, like, yeah, this stuff is really, really hard. And I feel like we take for granted how much of this, you know, is done for us when you go and you just query chat GPT, for example. Like, oh my God, everything going on underneath that, you know, it's kind of a miracle that the machines boot up, let alone that you can like query a giant language model that's probably doing inference across multiple machines and was trained across thousands of machines. Like, you know, minor miracle.SWYX [00:33:54]: Yeah, it is an awesome amount of power that we invoke with a single API call that we take for granted these days. It's absurd. Yeah, I mean, like Kubernetes, like that point about Kubernetes, I will say as a former AWS employee, like it seems like it would be ideal for imbue to at some point make it more abstracted or agnostic because you're going to want to, you know, replicate your setup. We do have our ownJOSH [00:34:19]: sort of replacement. It's just a much simpler version of Kubernetes. Kubernetes is really designed for running services, not for running experiments. Like that's not its like main architecture. And so for us, like we have everything that's like, cool, you're going to run an experiment. So you want it to run to completion, right?SWYX [00:34:34]: OK, great.JOSH [00:34:34]: Like the primitives are sort of built around a slightly different style. And that makes it a lot easier, like just a lot simpler to fit that the nature of like these machines are going to disappear. They will need to be rebooted for infrastructure upgrades. They will like something will happen to the GPUs. Failure is like baked into this as like a core part of our infrastructure. So it's not that we don't have an abstraction. It's that it's a sort of simpler, more tailored abstraction for the particular work that we're doing.JONATHAN [00:34:58]: Yeah, I think it all depends on what your goals are. And like, I think the challenge in a lot of the deep learning stuff right now is that people are trying to like, people often build things that are more complicated than necessary to get the job done. And the complication is the enemy of everything. You know, don't use a fancier parallelism strategy than you have to. Don't use a fancier set of libraries than you have to.SWYX [00:35:18]: Don't do anythingJONATHAN [00:35:18]: that you don't have to do because it's hard enough as it is. Like, don't overcomplicateSWYX [00:35:23]: your own life.JONATHAN [00:35:23]: Don't try to bring in more tools or more fancy architecture tweaks if you absolutely don't have to.SWYX [00:35:29]: Like getting to the minimumJONATHAN [00:35:30]: necessary to get the job done. And it's really tempting to want to try to use everything. So like, I totally understand that one.SWYX [00:35:37]: I think the last piece I'll maybe call out is that I'm just going to weave this in just because I see the opportunity to do it. Are there any infrastructure shifts that need to be, that need to rise because of changing architecture? So I think, for example,SWYX [00:35:57]: you're announcing a dense model, a 70B dense model, whereas John just worked on DBRX and the image-to-text model, which presumably has different bottlenecks.JONATHAN [00:36:10]: That's correct for us. You know, we train both dense and mixture of expert models. The one we happened to, you know, kind of get permission to open source was a mixture of expert model. And those models are very demanding when it comes to network bandwidth, at least if you're training them in kind of FSTP 03 style, where there's just a lot of parameters getting shuffled back and forth. And your ratio of kind of compute to amount of data that you have to shuffle back and forth becomes a lot worse because you're now, you know, you're only using a fraction of the parameters for every token instead of all the parameters. And so we had to really push the envelope on getting all the stuff to the right places on time. And so actually the networking part of DBRX was the single hardest thing, I think, of the entire process. Just get MOE training, working at scale across a big cluster. We still managed to, I think, do it all with commodity parts, which was very exciting. You know, we were using FSTP and we eventually used HSTP so that we could have HSTP as a version of FSTP where you have multiple smaller replicas and you're doing data parallel within those replicas. And that helped a lot with network latency issues that we were running into just because we were transmitting so much data, you know, for every single part of the process. I think it actually, like, it was instructive for how Google designs their hardware and software together personally. Their training, as far as I understand, using kind of a 03 style of training and have been for a while. They also train mixture of expert models. TPUs have a very different network bandwidth to compute ratio. They have a lot more bandwidth just objectively. And TPUs per chip tend to be a little bit less compute intensive and have a little bit less memory. You know, it's just a different design choice. So the ratio of flops to bandwidth is very different. And that means that it's much easier for Google to be able to pull offSWYX [00:37:54]: some of this stuff.JONATHAN [00:37:54]: They also have interesting, you know, Torus style network architecture or Torus style, like, literal network architectureSWYX [00:38:00]: is not like the model,JONATHAN [00:38:00]: but the network.SWYX [00:38:02]: Is this the sort of block attention? I forgot what you call it. So this is just more or the,JONATHAN [00:38:07]: yeah, this is more, not the ring attention, but these are the ring all reduces. Like you have three different dimensions of rings because they kind of put you in these three dimensional Toruses from what I understand. And so like, you know, Google's infrastructure in some sense is kind of, I wouldn't say built for this, but maybe the way that Google trains models is built for a slightly different bit of infrastructure they have. And it's kind of neat to think about that. You know, as one thing that I think NVIDIA announced for, you know, for, for both the GH200 and the GB200 is this hybrid networking where you'll have blocks of NVLink network chips. I think for the GB200, I think it's like groups of 72 GPUs will all have NVLink to each other. So higher bandwidth, then you'll have normal networking of some kind, InfiniBand or Rocky or what have you between these blocks. And that's kind of a, you know, it's a change due to the fact that, you know, it's hard to build really high bandwidth networks over very large groups, but it is now a blocked networking. And you have to think about how you architect your model and your parallelism differently. You also have to think about fault tolerance differently because it now matters where you lose a GPU, whereas it didn't before. So, you know, it's, it's, it's just all really interesting and really fun speaking personally, but it's going to mean new nightmares when we all move to that generation and have to think about, you know, new versions of these problems.JOSH [00:39:20]: As you go up to larger scales, it gets quite different. Like right now, you know, if you're experiencing, let's say, for example, you experience a GPU failure every day, that's fine.SWYX [00:39:31]: Just restart.JOSH [00:39:31]: If you make your thing 24 times as big, now it's once an hour. Now it stops being quite as easy to just restart, right? So now you have to kind of break, like bake in this sort of redundancy that you didn't have before. So I think as you go up in scale, you end up running into like a lot of really interesting problems that also inform the, the actual like design. Yeah, I mean, as an orchestration guy,SWYX [00:39:52]: this is why I always emphasize like very cheap storage or very fast storage. So you can checkpoint more, but I don't think that's probably not the best solution to for fast, you know, training.JONATHAN [00:40:05]: Which works fine when you're doing language and then you move to vision or video. And then, you know, you have multi petabyte datasetsSWYX [00:40:12]: and getting, you know,JONATHAN [00:40:13]: cheap, fast multi petabyte storage starts to bite. Like I've certainly encountered issues where the literal data center where my GPUs were did not have enough, you know, object store to fit the datasets that people wanted to bring into that data center from whichever users were, were trying to bring them in. And then you get to a wholeSWYX [00:40:31]: different world of hurtJONATHAN [00:40:31]: where you have to keep your data in a different region because the region is just out of storage. So things get fun really fast.SWYX [00:40:39]: Speaking of vision, Josh, actually, you know, Embu is an agents company, but you're only, you're announcing a text-only model. What, where does, where does the vision side come in?JOSH [00:40:49]: I think we've actually done a lot of work in the past and people can see kind of our blog posts about sort of self-supervised learning and some other kind of vision-related stuff in the past as well. So we're very familiar with, with that stuff. But I think our main focus right now is on kind of, as we say, coding and reasoning. And there, there's certainly a visual component to some problems. But, you know, it's not necessarily required for all problems. And actually we found that for most of the kind of like code writing and, and reasoning problems that we care about, the visual part isn't really a huge important part of it. Sometimes if you really need to, you can maybe describeSWYX [00:41:24]: the thing.JOSH [00:41:24]: There are other like, you know, multimodal models that you can use off the shelf to sort of plug in for those particular piecesSWYX [00:41:30]: that you need, right?JOSH [00:41:30]: Like if something is driving a browser or whatever, like you can sometimes get away with not having to have that baked into the original model. So our folk were, you know, in a sense, we kind of do a lot across the stack. We're working on our own infrastructure and pre-training and RL and fine tuning and products and everything. But in another sense, we're very narrowly focused on the application side. So all of the stuff across the stack is kind of going toward a very particular purpose. And so that particular purpose right now doesn't really need vision. So we think that people are going to make all sorts of really cool image modelsSWYX [00:42:00]: like Jonathan, right?JOSH [00:42:00]: And all sorts of interesting multimodal models into the future. We'll let them go do that. That's great. We'll take advantage of that, partner with those people in the future. And right now we're really focused on kind of the core reasoning and coding capabilities and aspects of the model.SWYX [00:42:14]: I wanted to go into carbs since that's kind of the next layer of the stack. We talked about carbs in the first episode with Kanjin because you've actually had a blog post about it like a couple of years ago. Maybe let's introduce it.JONATHAN [00:42:26]: Has that been a couple of years now?JOSH [00:42:28]: No, it must have been at least one year. Hopefully it's not multiple years.SWYX [00:42:32]: Sorry, I'm counting AI time. Yeah, yeah. Yeah, I was going to sayJONATHAN [00:42:35]: you're making me feel really old right now.SWYX [00:42:39]: I count everything before the generally intelligent rename as like, you know, prehistory. Yeah. And now sort of modernity, right? So I actually thought carbs was more about hyperparameter optimization in a sense of like sort of parameters, hyperparameter search. Whereas, you know, when you introduced it, especially in this blog post, it's more about scaling laws and predictability of like, are we sort of in the right ballpark before we scale things up? Maybe sort of recount the history of carbs.JOSH [00:43:10]: Yeah, so it really is a little bit of both. So carbs is, it's maybe a backronym, but it's for cost aware Pareto region Bayesian search. So this is about technically how it works, but carbs is like, you know, we like pastries and stuff.SWYX [00:43:26]: So great, why not? But the point is thatJOSH [00:43:29]: it's a cost aware hyperparameter tuner. So most hyperparameter tuners, you kind of say, OK, here's this objective function. I want you to make this number as big as possible or as small as possible, whichever direction you want to go. So yeah, just go make this number, you know, as small as possible. OK, so it'll try a bunch of differentSWYX [00:43:46]: hyperparameters,JOSH [00:43:46]: a bunch of different configurationsSWYX [00:43:48]: to figure out, like,JOSH [00:43:48]: how do I tweak your network and architecture, et cetera, to get the kind of best performance I possibly can. That's usually saying, like, you know, almost all of these hyperparameter configurations are, let's say they're all going to use the same number of GPUs or the same number of nodes.SWYX [00:44:01]: So it's going to runJOSH [00:44:01]: for the same amount of time.SWYX [00:44:03]: So you can do that.JOSH [00:44:03]: You can get a number out and that's great. But what carbs does is it says,SWYX [00:44:07]: OK, actually,JOSH [00:44:07]: what if we relax that constraint? What if we say each of these different points, we're going to model how expensive it will be to sample this configuration. So if what if we train with just one one hundredth of the data? Like, how well can we do?SWYX [00:44:19]: What if we trainJOSH [00:44:19]: with one tenth of the data? What if we train with all the data? That way you can understand, like, as we get more and more data, as we spend more and more compute,SWYX [00:44:26]: as we make a biggerJOSH [00:44:26]: and bigger network, how does performance change with these things that change? Like how expensive it is to even explore this data point. So by doing that, we can see the scaling laws for not just, you know,SWYX [00:44:36]: the scaling lawsJOSH [00:44:36]: from like the, you know, Chantilla paper, the scaling laws for all parameters. We can see how does how does the number of layers change with this? How does the, you know, the learning rate change? How do the like, you know, various types of regularization change? So you can see these nice scaling laws. And as you're going across costs, like how should this be changing as you're scaling up your model? So that, coupled with the kind of metric that we chose, which is a very precise way of measuring performance, allowed us to really like hone in on parameters that worked really wellSWYX [00:45:05]: and understand, like,JOSH [00:45:05]: how do we want to scale those up, especially as we're changingSWYX [00:45:08]: things about the network?JOSH [00:45:08]: Like one of the things that we did is we used a custom tokenizer. As we change this tokenizer, changes a bunch of other things about the model. So how should we scale up this entirely new tokenizer? Like no one has ever made a model this large with this tokenizer before. And so how do we want toSWYX [00:45:22]: change all these things?JOSH [00:45:22]: Harps kind of shows you, like, look, as you change these parameters, like these other ones are kind of dependent on this.SWYX [00:45:28]: Like this is the, these areJOSH [00:45:28]: the relationships between them. So you can better understand, like, OK, if I'm going to scale this up 10x or 100x, like, where do I want to be? I can only go so far. And so, you know, we did run, like, I think maybe it was like a 14b one or somethingSWYX [00:45:40]: like that to check.JOSH [00:45:41]: But and so we had a bunch of like 1b or 14b and then at 70b. I don't think we had a, I think we just did like one at 14b. So you can, we get to check that like, oh, is this on the curve? Like, is this where we expect? It was like right there. So then great, go on to the next one. Yeah, I mean, that makes a lot of sense.SWYX [00:45:56]: I wonder if, so one of the key questions, and correct me if I'm wrong, but like usually people do search or do their evals just based on loss. But you actually evaluate based on, you know, the sort of end state evals that people might expect, like HellaSwag and Lombata, whatever. What is the norm here? Is there a norm?JOSH [00:46:20]: Yeah, I don't know if there's a hundred percent.SWYX [00:46:21]: I don't know. I only see loss on most people's reports.JOSH [00:46:25]: I think it's easy to, like, loss is very nice because it's very precise. It will tell you, like, very fine grained differences between like really small changes in your hyperparameters or network architecture. Whereas, especially at the smaller scales, if you're looking at like accuracy, it's very noisy. Like it might be zero or a hundred or like, you know, fluctuating by like 10 or 20 percentage points, which makes it really hard to tell, like, did that change actually mean anything? So our loss is sort of a combination of these two. Instead of saying, like, let's just look at perplexity, we say, let's look at perplexity on the tasks that we care about for multiple choice questions effectively.SWYX [00:47:00]: So we're saying like, yes,JOSH [00:47:00]: this is formulated as a multiple choice question, and we're going to look at the, like, you know, the loss of perplexity for this particular answer token. And that ends up being something that's like both targeted to what you actually care about and also very precise. The nice thing about this though is that it's independent of the data that you train on. One thing that's annoying about perplexity or about loss is that as you change your data set, this is really obnoxious because now it fundamentally changes your loss, right? And so you can't tell, like, how do I tweak my data set? But because we have this held out evaluation dat
What's up everyone, today we have the pleasure of sitting down with the acclaimed Britney Muller, Founder and Consultant at Data Sci 101 and former Senior SEO Scientist at Moz. Summary: Britney takes us on a wild ride through the intersection of marketing and AI, emphasizing the importance of adaptability, continuous learning, and ethical considerations. Britney's journey from SEO to AI illustrates the need for data literacy and strategic decision-making in marketing. She delves into the ethical nuances of AI, discussing the limitations of LLMs and the importance of transparency and responsible development. Highlighting the human element in AI, Britney advocates for balancing technological advancements with human creativity and intuition, and underscores the transformative potential of AI across various sectors. This episode is a compelling call to action for professionals to harmoniously blend technical expertise with ethical mindfulness in the rapidly evolving martech landscape.About BritneyBritney started her career when she moved to Breckenridge Colorado chasing fresh snow and snowboard hills. She connected with a local realtor who introduced her to SEO and after discovering search data, she never looked backShe spent 7 months preparing to rank her personal site for the term “Burton US Open” and ended up ranking ahead of Burton.com and received a call from their marketing team who invited her to dinner This spurred her to start her own agency which she ran for several successful years but after being on the cutting edge of SEO and doing the speaking circuit at conferences around the world, Britney started getting hungry for a new challenge: enter Machine LearningShe stumbled upon Harvard's Data Science 109 course after searching Github repos and dived super deep into this new field She was eventually poached by Moz where she spent 4 years as Senior SEO Scientist where she re-wrote the Beginner's Guide to SEO amongst a bunch of other content and continued her SEO researchShe later joined Hugging Face, the fastest-growing Machine Learning community & open-source ML platformToday Britney has returned to her entrepreneurial roots as a Machine Learning & SEO consultant and the Founder of Data Sci 101 with the goal of making LLMs like ChatGPT as accessible as possibleEmbracing Machine Learning: A Journey from SEO to AIBritney's journey from SEO expertise to machine learning is a testament to the power of curiosity and continuous learning. Nearly a decade ago, while most in the martech field were focused solely on traditional methods, Britney's unique passion for learning and experimentation led her to explore machine learning. This shift was fueled by her desire for a new challenge, as she felt she had reached the zenith of her SEO experiments.The pivotal moment came when she took the Harvard CS 109 course on machine learning. This experience opened her eyes to the transformative potential of feeding data to models and letting them learn patterns independently. The tangible results and potential applications she witnessed were not just intellectually stimulating but also professionally inspiring. As machine learning evolved, so did Britney's skills. She recalls the early days of TensorFlow, where complex lines of code were required for basic functions, which have now been simplified drastically.Britney's approach to machine learning is unique. She enjoys taking existing models and reengineering them for different applications, a process she describes as akin to being a 'Frankenstein developer.' This creative tinkering led to practical applications and fun experiments, like her first MNIST model, which could recognize handwritten numbers with high accuracy. Her pride in this achievement underscores her deep connection to her work and the joy it brings her.Key takeaway: Britney's transition from SEO to machine learning highlights the importance of pursuing passions and continuous learning in professional development. Her success stems from her willingness to embrace new challenges and innovate by reapplying existing technologies in novel ways. This story is a reminder that staying curious and adaptable is crucial in the ever-progressing field of martech.Data Literacy: Bridging the Gap in MarketingBritney's endeavor with Data Sci 101 aligns perfectly with her goals of educating the martech community and fostering a well-informed approach to AI and ML. She emphasizes the importance of statistical knowledge in marketing, a skill often overlooked in traditional marketing education. Britney's passion for sharing knowledge is driven by her discovery of the significant gap in data literacy within the marketing industry. This gap, she believes, hinders marketers from making more strategic decisions and finding better insights.Her approach to education in this field is both innovative and practical. Britney focuses on creating content that is engaging and accessible, breaking down complex topics into understandable segments. She draws inspiration from her friend Daisy Quaker's approach, emphasizing the need to repurpose extensive resources into more digestible formats - akin to turning a large turkey into multiple turkey sandwiches. This analogy perfectly encapsulates her method of making complex data science concepts more palatable for the average marketer.Britney's journey in educating others began with her own realization of the lack of statistical training in her marketing career. This led her to delve deeper into data science, allowing her to identify and address the gaps in knowledge within the marketing community. Her efforts are not just about imparting knowledge but also about empowering marketers to leverage data more effectively in their strategies.Key takeaway: Britney's initiative with Data Sci 101 highlights the critical need for data literacy in the marketing world. Her commitment to educating her peers about the importance of statistical knowledge and her innovative approach to content creation serve as a model for making complex subjects accessible and engaging. This endeavor not only enhances the skill set of marketers but also paves the way for more data-informed and strategic decision-making in the industry.Deciphering the Alien Nature of Large Language ModelsBritney's analogy of large language models (LLMs) as aliens provides a unique perspective on the intricacies of AI in the martech world. She recalls one of the more technical textbooks she read on LLMs and how the author compares LLMs to beings in a black cave, fed with the world's texts but lacking a true understanding of human experiences and languages' nuances. This vivid imagery conveys the idea that, while LLMs are proficient in processing and mimicking language patterns, they fall short in grasping the depth and context of real-world experiences and specialized knowledge.Britney's approach to explaining complex concepts through relatable analogies reflects her commitment to making the abstract more accessible. Her use of post-it notes to jot down everyday analogies like baseball references showcases her inventive method of communication. This approach is crucial in a field where the technology is often abstract and difficult for the average person to grasp.“LLMs are essentially aliens from a different universe: while they have access to all our world's text, they lack genuine comprehension of languages, nuances of our reality, and the intricacies of human experience and knowledge.” - Britney Muller, Introduction to LLMs, part 1. This alien analogy underlines a significant limitatio...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Current AIs Provide Nearly No Data Relevant to AGI Alignment, published by Thane Ruthenis on December 15, 2023 on The AI Alignment Forum. Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take). Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical. At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations. I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters. It's clearer if you taboo the word "AI": The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems. The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm. It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences. It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI. What the Fuss Is All About To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place. Consider humans. Some facts: Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals. Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them. Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults. And when people with different values interact... People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside. People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks. So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerfu...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Criticism of Singular Learning Theory, published by Joar Skalse on November 19, 2023 on The AI Alignment Forum. In this post, I will briefly give my criticism of Singular Learning Theory (SLT), and explain why I am skeptical of its significance. I will especially focus on the question of generalisation --- I do not believe that SLT offers any explanation of generalisation in neural networks. I will also briefly mention some of my other criticisms of SLT, describe some alternative solutions to the problems that SLT aims to tackle, and describe some related research problems which I would be more excited about. (I have been meaning to write this for almost 6 months now, since I attended the SLT workshop last June, but things have kept coming in the way.) For an overview of SLT, see this sequence. This post will also refer to the results described in this post, and will also occasionally touch on VC theory. However, I have tried to make it mostly self-contained. The Mystery of Generalisation First of all, what is the mystery of generalisation? The issue is this; neural networks are highly expressive, and typically overparameterised. In particular, when a real-world neural network is trained on a real-world dataset, it is typically the case that this network is able to express many functions which would fit the training data well, but which would generalise poorly. Moreover, among all functions which do fit the training data, there are more functions (by number) that generalise poorly, than functions that generalise well. And yet neural networks will typically find functions that generalise well. To make this point more intuitive, suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse. A simple hypothesis might be that some of the parameters in a neural network are redundant, so that even if it has 500,000 parameters, the dimensionality of the space of all functions which it can express is still less than 500,000. This is true. However, the magnitude of this effect is too small to solve the puzzle. If you get the MNIST training set, and assign random labels to the test data, and then try to fit the network to this function, you will find that this often can be done. This means that while neural networks have redundant parameters, they are still able to express more functions which generalise poorly, than functions which generalise well. Hence the puzzle. The anwer to this puzzle must be that neural networks have an inductive bias towards low-complexity functions. That is, among all functions which fit a given training set, neural networks are more likely to find a low-complexity function (and such functions are more likely to generalise well, as per Occam's Razor). The next question is where this inductive bias comes from, and how it works. Understanding this would let us better understand and predict the behaviour of neural networks, which would be very useful for AI alignment. I should also mention that generalisation only is mysterious when we have an amount of training data that is small relative to the overall expressivity of the learning machine. Classical statistical learning theory already tells us that any sufficiently well-behaved learning machine will generalise well in the limit of infinite training data. For an overview of these results, see this post. Thus, the quest...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Estimating effective dimensionality of MNIST models, published by Arjun Panickssery on November 2, 2023 on LessWrong. The local learning coefficent λ is a measure of a model's "effective dimensionality" that captures its complexity (more background here ). Lau et al recently described a sampling method (SGLD) using noisy gradients to find a stochastic estimate ^ λ in a computationally tractable way (good explanation here ): I present results ( Github repo ) of the "Task Variability" project suggested on the DevInterp project list . To see how degeneracy scales with task difficulty and model class, I trained a fully-connected MLP and a CNN (both with ~120k parameters) on nine MNIST variants with different subsets of the labels (just the labels {0, 1}, then {0, 1, 2}, etc.). All models were trained to convergence using the same number of training data points. I implement Lau et al's algorithm on each of the trained models. The results below are averaged over three runs: The results for the full MNIST dataset are comparable to Lau et al's results while using models ten times smaller trained for ten times fewer epochs. The sampling method is finicky and sensitive to the hyperparameter choices of learning rate and noise factor. It will fail by producing negative or very high ^ λ values if the noise ϵ and the distance penalty γ (see lines 6 and 7 in the pseudocode above) isn't calibrated to the model's level of convergence. The results show a linear scaling law relating number of labels to task complexity. The CNN typically has a lower ^ λ than the MLP, which matches intuitions that some of the complexity is "stored" in the architecture because the convolutions apply a useful prior on functions good at solving image recognition tasks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Estimating effective dimensionality of MNIST models, published by Arjun Panickssery on November 2, 2023 on LessWrong. The local learning coefficent λ is a measure of a model's "effective dimensionality" that captures its complexity (more background here ). Lau et al recently described a sampling method (SGLD) using noisy gradients to find a stochastic estimate ^ λ in a computationally tractable way (good explanation here ): I present results ( Github repo ) of the "Task Variability" project suggested on the DevInterp project list . To see how degeneracy scales with task difficulty and model class, I trained a fully-connected MLP and a CNN (both with ~120k parameters) on nine MNIST variants with different subsets of the labels (just the labels {0, 1}, then {0, 1, 2}, etc.). All models were trained to convergence using the same number of training data points. I implement Lau et al's algorithm on each of the trained models. The results below are averaged over three runs: The results for the full MNIST dataset are comparable to Lau et al's results while using models ten times smaller trained for ten times fewer epochs. The sampling method is finicky and sensitive to the hyperparameter choices of learning rate and noise factor. It will fail by producing negative or very high ^ λ values if the noise ϵ and the distance penalty γ (see lines 6 and 7 in the pseudocode above) isn't calibrated to the model's level of convergence. The results show a linear scaling law relating number of labels to task complexity. The CNN typically has a lower ^ λ than the MLP, which matches intuitions that some of the complexity is "stored" in the architecture because the convolutions apply a useful prior on functions good at solving image recognition tasks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
This is a recap of the top 10 posts on Hacker News on September 20th, 2023.This podcast was generated by wondercraft.ai(00:37): Car allergic to vanilla ice cream (2000)Original post: https://news.ycombinator.com/item?id=37584399&utm_source=wondercraft_ai(02:23): DALL·E 3Original post: https://news.ycombinator.com/item?id=37586900&utm_source=wondercraft_ai(04:14): My uBlock Origin filters to remove distractionsOriginal post: https://news.ycombinator.com/item?id=37584134&utm_source=wondercraft_ai(05:56): OpenTF is now OpenTofuOriginal post: https://news.ycombinator.com/item?id=37581132&utm_source=wondercraft_ai(07:50): We have successfully completed our migration to RAM-only VPN infrastructureOriginal post: https://news.ycombinator.com/item?id=37581485&utm_source=wondercraft_ai(09:40): 78% MNIST accuracy using GZIP in under 10 lines of codeOriginal post: https://news.ycombinator.com/item?id=37583593&utm_source=wondercraft_ai(11:20): Toyota Research claims breakthrough in teaching robots new behaviorsOriginal post: https://news.ycombinator.com/item?id=37586264&utm_source=wondercraft_ai(13:36): British journalist held by police at Luton airport for five hours without arrestOriginal post: https://news.ycombinator.com/item?id=37584038&utm_source=wondercraft_ai(15:31): Gokrazy is coolOriginal post: https://news.ycombinator.com/item?id=37583234&utm_source=wondercraft_ai(17:30): ‘Less than half' fresh produce sold globally makes any profitOriginal post: https://news.ycombinator.com/item?id=37581316&utm_source=wondercraft_aiThis is a third-party project, independent from HN and YC. Text and audio generated using AI, by wondercraft.ai. Create your own studio quality podcast with text as the only input in seconds at app.wondercraft.ai. Issues or feedback? We'd love to hear from you: team@wondercraft.ai
Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We're collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)!If AI is so important, why is its software so bad?This was the motivating question for Chris Lattner as he reconnected with his product counterpart on Tensorflow, Tim Davis, and started working on a modular solution to the problem of sprawling, monolithic, fragmented platforms in AI development. They announced a $30m seed in 2022 and, following their successful double launch of Modular/Mojo
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.25.550525v1?rss=1 Authors: Stuck, M., Naud, R. Abstract: The need for energy-efficient solutions in Deep Neural Network (DNN) applications has led to a growing interest in Spiking Neural Networks (SNNs) implemented in neuromorphic hardware. The Burstprop algorithm enables online and local learning in hierarchical networks, and therefore can potentially be implemented in neuromorphic hardware. This work presents an adaptation of the algorithm for training hierarchical SNNs on MNIST. Our implementation requires an order of magnitude fewer neurons than the previous ones. While Burstprop outperforms Spike-timing dependent plasticity (STDP), it falls short compared to training with backpropagation through time (BPTT). This work establishes a foundation for further improvements in the Burstprop algorithm, developing such algorithms is essential for achieving energy-efficient machine learning in neuromorphic hardware. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.24.550365v1?rss=1 Authors: Galella, S., Ardid, S. Abstract: The ability to switch between tasks effectively in response to external stimuli is a hallmark of cognitive control. Our brain can filter and integrate external information to accomplish goal-directed behavior. Task switching occurs rapidly and efficiently, allowing us to perform multiple tasks with ease. Similarly, artificial neural networks can be tailored to exhibit multi-task capabilities and achieve high performance across domains. In terms of explainability, understanding how neural networks make predictions is crucial for their deployment in many real-world scenarios. In this study, we delve into neural representations learned by task-switching networks, which use task-specific bias for multitasking. Task-specific biases, mediated by context inputs, are learned by alternating the tasks the neural network learns during training. By using the MNIST dataset and binary tasks, we find that task-switching networks produce representations that resemble other multitasking paradigms, namely parallel networks in the early stages of processing and sequential networks in the last stages, respectively. We analyze the importance of inserting task contexts in different stages of processing and its role in aligning the task with relevant features. Moreover, we visualize how networks generalize neural representations during task-switching for different tasks. The use of context inputs improves the interpretability of simple neural networks for multitasking, helping to pave the way for the future study of architectures and tasks of higher complexity. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on LessWrong. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking longer about a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on The AI Alignment Forum. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking l...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on LessWrong. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking longer about a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by Stefan Heimersheim on May 9, 2023 on The AI Alignment Forum. We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM). The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image: The MNIST network consists of 2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2 2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2]) and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels. Spoilers ahead! Summary of our solution (TL,DR) The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image. The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity: The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0. This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error. We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation. Part 1: How we found this solution Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points through...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by StefanHex on May 9, 2023 on LessWrong. We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM). The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image: The MNIST network consists of 2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2 2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2]) and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels. Spoilers ahead! Summary of our solution (TL,DR) The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image. The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity: The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0. This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error. We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation. Part 1: How we found this solution Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points throughout the analysis. PCA ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by StefanHex on May 9, 2023 on LessWrong. We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM). The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image: The MNIST network consists of 2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2 2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2]) and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels. Spoilers ahead! Summary of our solution (TL,DR) The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image. The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity: The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0. This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error. We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation. Part 1: How we found this solution Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points throughout the analysis. PCA ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Exploring the Lottery Ticket Hypothesis, published by Rauno Arike on April 25, 2023 on LessWrong. I have recently been fascinated by the breadth of important mysteries in deep learning, including deep double descent and phase changes, that could be explained by a curious conjectured property of neural networks called the lottery ticket hypothesis. Despite this explanatory potential, however, I haven't seen much discussion about the evidence behind and the implications of this hypothesis in the alignment community. Being confused about these things motivated me to conduct my own survey of the phenomenon, which resulted in this post. The Lottery Ticket Hypothesis, explained in one minute The lottery ticket hypothesis (LTH) was originally proposed in a paper by Frankle and Carbin (2018): A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The authors call such subnetworks "winning lottery tickets". As the simplest example, they train a LeNet-300-100 model on the MNIST dataset and report that a network containing only 21.1% of the weights of the dense version reaches a higher test accuracy on less training iterations, while a network where only 3.6% of the weights remain performs almost identically to the dense network of the network. The lottery ticket hypothesis extends a long line of work in neural network pruning, a technique proposed in as early as 1990 by LeCun et al. Pruning simply means deleting some fraction of unimportant weights from the network after training to make inference more efficient. The vital insight of the lottery ticket hypothesis paper is that it may also be possible to prune the network before training to make both training and inference more efficient. In practice, the method that Frankle and Carbin used for finding winning tickets didn't yet eliminate the need to train the full network, instead just suggesting the possibility. The technique they used is iterative pruning, a procedure that roughly looks as follows: Train the full dense network on some classification task Prune out some fraction of the weights with the smallest magnitude Reinitialize the remaining weights to their original values Repeat the same procedure for a number of times This is computationally quite expensive, but as we'll see below, alternative approaches have been proposed later on. The paper also defined a stronger version of the hypothesis that they named the lottery ticket conjecture: We extend our hypothesis into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket. In the next section, I'll argue that making a distinction between the hypothesis and the conjecture appears to be quite important. Relevance to alignment Phase changes In his post about mechanistically interpreting grokking, Neel Nanda argues that the lottery ticket hypothesis may constitute the reason for why neural networks form sophisticated circuits. One may naively think that neural networks are optimized in a similar way to how linear regression classifiers are optimized: each weight slowly changes in a direction that marginally improves performance, and the result of these tiny individual improvements smoothly improve the performance of the entire ensemble. In practice, though, we observe the forming of sophisticated circuits like the induction circuit, which is comprised of a previous token head and an induction head. Either of those heads improves loss only in the case the other ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Exploring the Lottery Ticket Hypothesis, published by Rauno Arike on April 25, 2023 on LessWrong. I have recently been fascinated by the breadth of important mysteries in deep learning, including deep double descent and phase changes, that could be explained by a curious conjectured property of neural networks called the lottery ticket hypothesis. Despite this explanatory potential, however, I haven't seen much discussion about the evidence behind and the implications of this hypothesis in the alignment community. Being confused about these things motivated me to conduct my own survey of the phenomenon, which resulted in this post. The Lottery Ticket Hypothesis, explained in one minute The lottery ticket hypothesis (LTH) was originally proposed in a paper by Frankle and Carbin (2018): A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The authors call such subnetworks "winning lottery tickets". As the simplest example, they train a LeNet-300-100 model on the MNIST dataset and report that a network containing only 21.1% of the weights of the dense version reaches a higher test accuracy on less training iterations, while a network where only 3.6% of the weights remain performs almost identically to the dense network of the network. The lottery ticket hypothesis extends a long line of work in neural network pruning, a technique proposed in as early as 1990 by LeCun et al. Pruning simply means deleting some fraction of unimportant weights from the network after training to make inference more efficient. The vital insight of the lottery ticket hypothesis paper is that it may also be possible to prune the network before training to make both training and inference more efficient. In practice, the method that Frankle and Carbin used for finding winning tickets didn't yet eliminate the need to train the full network, instead just suggesting the possibility. The technique they used is iterative pruning, a procedure that roughly looks as follows: Train the full dense network on some classification task Prune out some fraction of the weights with the smallest magnitude Reinitialize the remaining weights to their original values Repeat the same procedure for a number of times This is computationally quite expensive, but as we'll see below, alternative approaches have been proposed later on. The paper also defined a stronger version of the hypothesis that they named the lottery ticket conjecture: We extend our hypothesis into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket. In the next section, I'll argue that making a distinction between the hypothesis and the conjecture appears to be quite important. Relevance to alignment Phase changes In his post about mechanistically interpreting grokking, Neel Nanda argues that the lottery ticket hypothesis may constitute the reason for why neural networks form sophisticated circuits. One may naively think that neural networks are optimized in a similar way to how linear regression classifiers are optimized: each weight slowly changes in a direction that marginally improves performance, and the result of these tiny individual improvements smoothly improve the performance of the entire ensemble. In practice, though, we observe the forming of sophisticated circuits like the induction circuit, which is comprised of a previous token head and an induction head. Either of those heads improves loss only in the case the other ...
2023 is the year of Multimodal AI, and Latent Space is going multimodal too! * This podcast comes with a video demo at the 1hr mark and it's a good excuse to launch our YouTube - please subscribe! * We are also holding two events in San Francisco — the first AI | UX meetup next week (already full; we'll send a recap here on the newsletter) and Latent Space Liftoff Day on May 4th (signup here; but get in touch if you have a high profile launch you'd like to make). * We also joined the Chroma/OpenAI ChatGPT Plugins Hackathon last week where we won the Turing and Replit awards and met some of you in person!This post featured on Hacker News.Out of the five senses of the human body, I'd put sight at the very top. But weirdly when it comes to AI, Computer Vision has felt left out of the recent wave compared to image generation, text reasoning, and even audio transcription. We got our first taste of it with the OCR capabilities demo in the GPT-4 Developer Livestream, but to date GPT-4's vision capability has not yet been released. Meta AI leapfrogged OpenAI and everyone else by fully open sourcing their Segment Anything Model (SAM) last week, complete with paper, model, weights, data (6x more images and 400x more masks than OpenImages), and a very slick demo website. This is a marked change to their previous LLaMA release, which was not commercially licensed. The response has been ecstatic:SAM was the talk of the town at the ChatGPT Plugins Hackathon and I was fortunate enough to book Joseph Nelson who was frantically integrating SAM into Roboflow this past weekend. As a passionate instructor, hacker, and founder, Joseph is possibly the single best person in the world to bring the rest of us up to speed on the state of Computer Vision and the implications of SAM. I was already a fan of him from his previous pod with (hopefully future guest) Beyang Liu of Sourcegraph, so this served as a personal catchup as well. Enjoy! and let us know what other news/models/guests you'd like to have us discuss! - swyxRecorded in-person at the beautiful StudioPod studios in San Francisco.Full transcript is below the fold.Show Notes* Joseph's links: Twitter, Linkedin, Personal* Sourcegraph Podcast and Game Theory Story* Represently* Roboflow at Pioneer and YCombinator* Udacity Self Driving Car dataset story* Computer Vision Annotation Formats* SAM recap - top things to know for those living in a cave* https://segment-anything.com/* https://segment-anything.com/demo* https://arxiv.org/pdf/2304.02643.pdf * https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/* https://blog.roboflow.com/segment-anything-breakdown/* https://ai.facebook.com/datasets/segment-anything/* Ask Roboflow https://ask.roboflow.ai/* GPT-4 Multimodal https://blog.roboflow.com/gpt-4-impact-speculation/Cut for time:* WSJ mention* Des Moines Register story* All In Pod: timestamped mention* In Forbes: underrepresented investors in Series A* Roboflow greatest hits* https://blog.roboflow.com/mountain-dew-contest-computer-vision/* https://blog.roboflow.com/self-driving-car-dataset-missing-pedestrians/* https://blog.roboflow.com/nerualhash-collision/ and Apple CSAM issue * https://www.rf100.org/Timestamps* [00:00:19] Introducing Joseph* [00:02:28] Why Iowa* [00:05:52] Origin of Roboflow* [00:16:12] Why Computer Vision* [00:17:50] Computer Vision Use Cases* [00:26:15] The Economics of Annotation/Segmentation* [00:32:17] Computer Vision Annotation Formats* [00:36:41] Intro to Computer Vision & Segmentation* [00:39:08] YOLO* [00:44:44] World Knowledge of Foundation Models* [00:46:21] Segment Anything Model* [00:51:29] SAM: Zero Shot Transfer* [00:51:53] SAM: Promptability* [00:53:24] SAM: Model Assisted Labeling* [00:56:03] SAM doesn't have labels* [00:59:23] Labeling on the Browser* [01:00:28] Roboflow + SAM Video Demo * [01:07:27] Future Predictions* [01:08:04] GPT4 Multimodality* [01:09:27] Remaining Hard Problems* [01:13:57] Ask Roboflow (2019)* [01:15:26] How to keep up in AITranscripts[00:00:00] Hello everyone. It is me swyx and I'm here with Joseph Nelson. Hey, welcome to the studio. It's nice. Thanks so much having me. We, uh, have a professional setup in here.[00:00:19] Introducing Joseph[00:00:19] Joseph, you and I have known each other online for a little bit. I first heard about you on the Source Graph podcast with bian and I highly, highly recommend that there's a really good game theory story that is the best YC application story I've ever heard and I won't tease further cuz they should go listen to that.[00:00:36] What do you think? It's a good story. It's a good story. It's a good story. So you got your Bachelor of Economics from George Washington, by the way. Fun fact. I'm also an econ major as well. You are very politically active, I guess you, you did a lot of, um, interning in political offices and you were responding to, um, the, the, the sheer amount of load that the Congress people have in terms of the, the support.[00:01:00] So you built, representing, which is Zendesk for Congress. And, uh, I liked in your source guide podcast how you talked about how being more responsive to, to constituents is always a good thing no matter what side of the aisle you're on. You also had a sideline as a data science instructor at General Assembly.[00:01:18] As a consultant in your own consultancy, and you also did a bunch of hackathon stuff with Magic Sudoku, which is your transition from N L P into computer vision. And apparently at TechCrunch Disrupt, disrupt in 2019, you tried to add chess and that was your whole villain origin story for, Hey, computer vision's too hard.[00:01:36] That's full, the platform to do that. Uh, and now you're co-founder c e o of RoboFlow. So that's your bio. Um, what's not in there that[00:01:43] people should know about you? One key thing that people realize within maybe five minutes of meeting me, uh, I'm from Iowa. Yes. And it's like a funnily novel thing. I mean, you know, growing up in Iowa, it's like everyone you know is from Iowa.[00:01:56] But then when I left to go to school, there was not that many Iowans at gw and people were like, oh, like you're, you're Iowa Joe. Like, you know, how'd you find out about this school out here? I was like, oh, well the Pony Express was running that day, so I was able to send. So I really like to lean into it.[00:02:11] And so you kind of become a default ambassador for places that. People don't meet a lot of other people from, so I've kind of taken that upon myself to just make it be a, a part of my identity. So, you know, my handle everywhere Joseph of Iowa, like I I, you can probably find my social security number just from knowing that that's my handle.[00:02:25] Cuz I put it plastered everywhere. So that's, that's probably like one thing.[00:02:28] Why Iowa[00:02:28] What's your best pitch for Iowa? Like why is[00:02:30] Iowa awesome? The people Iowa's filled with people that genuinely care. You know, if you're waiting a long line, someone's gonna strike up a conversation, kinda ask how you were Devrel and it's just like a really genuine place.[00:02:40] It was a wonderful place to grow up too at the time, you know, I thought it was like, uh, yeah, I was kind of embarrassed and then be from there. And then I actually kinda looking back it's like, wow, you know, there's good schools, smart people friendly. The, uh, high school that I went to actually Ben Silverman, the CEO and, or I guess former CEO and co-founder of Pinterest and I have the same teachers in high school at different.[00:03:01] The co-founder, or excuse me, the creator of crispr, the gene editing technique, Dr. Jennifer. Doudna. Oh, so that's the patent debate. There's Doudna. Oh, and then there's Fang Zang. Uh, okay. Yeah. Yeah. So Dr. Fang Zang, who I think ultimately won the patent war, uh, but is also from the same high school.[00:03:18] Well, she won the patent, but Jennifer won the[00:03:20] prize.[00:03:21] I think that's probably, I think that's probably, I, I mean I looked into it a little closely. I think it was something like she won the patent for CRISPR first existing and then Feng got it for, uh, first use on humans, which I guess for commercial reasons is the, perhaps more, more interesting one. But I dunno, biolife Sciences, is that my area of expertise?[00:03:38] Yep. Knowing people that came from Iowa that do cool things, certainly is. Yes. So I'll claim it. Um, but yeah, I, I, we, um, at Roble actually, we're, we're bringing the full team to Iowa for the very first time this last week of, of April. And, well, folks from like Scotland all over, that's your company[00:03:54] retreat.[00:03:54] The Iowa,[00:03:55] yeah. Nice. Well, so we do two a year. You know, we've done Miami, we've done. Some of the smaller teams have done like Nashville or Austin or these sorts of places, but we said, you know, let's bring it back to kinda the origin and the roots. Uh, and we'll, we'll bring the full team to, to Des Moines, Iowa.[00:04:13] So, yeah, like I was mentioning, folks from California to Scotland and many places in between are all gonna descend upon Des Moines for a week of, uh, learning and working. So maybe you can check in with those folks. If, what do they, what do they decide and interpret about what's cool. Our state. Well, one thing, are you actually headquartered in Des Moines on paper?[00:04:30] Yes. Yeah.[00:04:30] Isn't that amazing? That's like everyone's Delaware and you're like,[00:04:33] so doing research. Well, we're, we're incorporated in Delaware. Okay. We we're Delaware Sea like, uh, most companies, but our headquarters Yeah. Is in Des Moines. And part of that's a few things. One, it's like, you know, there's this nice Iowa pride.[00:04:43] And second is, uh, Brad and I both grew up in Brad Mc, co-founder and I grew up in, in Des Moines. And we met each other in the year 2000. We looked it up for the, the YC app. So, you know, I think, I guess more of my life I've known Brad than not, uh, which is kind of crazy. Wow. And during yc, we did it during 2020, so it was like the height of Covid.[00:05:01] And so we actually got a house in Des Moines and lived, worked outta there. I mean, more credit to. So I moved back. I was living in DC at the time, I moved back to to Des Moines. Brad was living in Des Moines, but he moved out of a house with his. To move into what we called our hacker house. And then we had one, uh, member of the team as well, Jacob Sorowitz, who moved from Minneapolis down to Des Moines for the summer.[00:05:21] And frankly, uh, code was a great time to, to build a YC company cuz there wasn't much else to do. I mean, it's kinda like wash your groceries and code. It's sort of the, that was the routine[00:05:30] and you can use, uh, computer vision to help with your groceries as well.[00:05:33] That's exactly right. Tell me what to make.[00:05:35] What's in my fridge? What should I cook? Oh, we'll, we'll, we'll cover[00:05:37] that for with the G P T four, uh, stuff. Exactly. Okay. So you have been featured with in a lot of press events. Uh, but maybe we'll just cover the origin story a little bit in a little bit more detail. So we'll, we'll cover robo flow and then we'll cover, we'll go into segment anything.[00:05:52] Origin of Roboflow[00:05:52] But, uh, I think it's important for people to understand. Robo just because it gives people context for what you're about to show us at the end of the podcast. So Magic Sudoku tc, uh, techers Disrupt, and then you go, you join Pioneer, which is Dan Gross's, um, YC before yc.[00:06:07] Yeah. That's how I think about it.[00:06:08] Yeah, that's a good way. That's a good description of it. Yeah. So I mean, robo flow kind of starts as you mentioned with this magic Sudoku thing. So you mentioned one of my prior business was a company called Represent, and you nailed it. I mean, US Congress gets 80 million messages a year. We built tools that auto sorted them.[00:06:23] They didn't use any intelligent auto sorting. And this is somewhat a solved problem in natural language processing of doing topic modeling or grouping together similar sentiment and things like this. And as you mentioned, I'd like, I worked in DC for a bit and been exposed to some of these problems and when I was like, oh, you know, with programming you can build solutions.[00:06:40] And I think the US Congress is, you know, the US kind of United States is a support center, if you will, and the United States is sports center runs on pretty old software, so mm-hmm. We, um, we built a product for that. It was actually at the time when I was working on representing. Brad, his prior business, um, is a social games company called Hatchlings.[00:07:00] Uh, he phoned me in, in 2017, apple had released augmented reality kit AR kit. And Brad and I are both kind of serial hackers, like I like to go to hackathons, don't really understand new technology until he build something with them type folks. And when AR Kit came out, Brad decided he wanted to build a game with it that would solve Sudoku puzzles.[00:07:19] And the idea of the game would be you take your phone, you hover hold it over top of a Sudoku puzzle, it recognizes the state of the board where it is, and then it fills it all in just right before your eyes. And he phoned me and I was like, Brad, this sounds awesome and sounds like you kinda got it figured out.[00:07:34] What, what's, uh, what, what do you think I can do here? It's like, well, the machine learning piece of this is the part that I'm most uncertain about. Uh, doing the digit recognition and, um, filling in some of those results. I was like, well, I mean digit recognition's like the hell of world of, of computer vision.[00:07:48] That's Yeah, yeah, MNIST, right. So I was like, that that part should be the, the easy part. I was like, ah, I'm, he's like, I'm not so super sure, but. You know, the other parts, the mobile ar game mechanics, I've got pretty well figured out. I was like, I, I think you're wrong. I think you're thinking about the hard part is the easy part.[00:08:02] And he is like, no, you're wrong. The hard part is the easy part. And so long story short, we built this thing and released Magic Sudoku and it kind of caught the Internet's attention of what you could do with augmented reality and, and with computer vision. It, you know, made it to the front ofer and some subreddits it run Product Hunt Air app of the year.[00:08:20] And it was really a, a flash in the pan type app, right? Like we were both running separate companies at the time and mostly wanted to toy around with, with new technology. And, um, kind of a fun fact about Magic Sudoku winning product Hunt Air app of the year. That was the same year that I think the model three came out.[00:08:34] And so Elon Musk won a Golden Kitty who we joked that we share an award with, with Elon Musk. Um, the thinking there was that this is gonna set off a, a revolution of if two random engineers can put together something that makes something, makes a game programmable and at interactive, then surely lots of other engineers will.[00:08:53] Do similar of adding programmable layers on top of real world objects around us. Earlier we were joking about objects in your fridge, you know, and automatically generating recipes and these sorts of things. And like I said, that was 2017. Roboflow was actually co-found, or I guess like incorporated in, in 2019.[00:09:09] So we put this out there, nothing really happened. We went back to our day jobs of, of running our respective businesses, I sold Represently and then as you mentioned, kind of did like consulting stuff to figure out the next sort of thing to, to work on, to get exposed to various problems. Brad appointed a new CEO at his prior business and we got together that summer of 2019.[00:09:27] We said, Hey, you know, maybe we should return to that idea that caught a lot of people's attention and shows what's possible. And you know what, what kind of gives, like the future is here. And we have no one's done anything since. No one's done anything. So why is, why are there not these, these apps proliferated everywhere.[00:09:42] Yeah. And so we said, you know, what we'll do is, um, to add this software layer to the real world. Will build, um, kinda like a super app where if you pointed it at anything, it will recognize it and then you can interact with it. We'll release a developer platform and allow people to make their own interfaces, interactivity for whatever object they're looking at.[00:10:04] And we decided to start with board games because one, we had a little bit of history there with, with Sudoku two, there's social by default. So if one person, you know finds it, then they'd probably share it among their friend. Group three. There's actually relatively few barriers to entry aside from like, you know, using someone else's brand name in your, your marketing materials.[00:10:19] Yeah. But other than that, there's no real, uh, inhibitors to getting things going and, and four, it's, it's just fun. It would be something that'd be bring us enjoyment to work on. So we spent that summer making, uh, boggle the four by four word game provable, where, you know, unlike Magic Sudoku, which to be clear, totally ruins the game, uh, you, you have to solve Sudoku puzzle.[00:10:40] You don't need to do anything else. But with Boggle, if you and I are playing, we might not find all of the words that adjacent letter tiles. Unveil. So if we have a, an AI tell us, Hey, here's like the best combination of letters that make high scoring words. And so we, we made boggle and released it and that, and that did okay.[00:10:56] I mean maybe the most interesting story was there's a English as a second language program in, in Canada that picked it up and used it as a part of their curriculum to like build vocabulary, which I thought was kind of inspiring. Example, and what happens just when you put things on the internet and then.[00:11:09] We wanted to build one for chess. So this is where you mentioned we went to 2019. TechCrunch Disrupt TechCrunch. Disrupt holds a Hackathon. And this is actually, you know, when Brad and I say we really became co-founders, because we fly out to San Francisco, we rent a hotel room in the Tenderloin. We, uh, we, we, uh, have one room and there's like one, there's room for one bed, and then we're like, oh, you said there was a cot, you know, on the, on the listing.[00:11:32] So they like give us a little, a little cot, the end of the cot, like bled and over into like the bathroom. So like there I am sleeping on the cot with like my head in the bathroom and the Tenderloin, you know, fortunately we're at a hackathon glamorous. Yeah. There wasn't, there wasn't a ton of sleep to be had.[00:11:46] There is, you know, we're, we're just like making and, and shipping these, these sorts of many[00:11:50] people with this hack. So I've never been to one of these things, but[00:11:52] they're huge. Right? Yeah. The Disrupt Hackathon, um, I don't, I don't know numbers, but few hundreds, you know, classically had been a place where it launched a lot of famous Yeah.[00:12:01] Sort of flare. Yeah. And I think it's, you know, kind of slowed down as a place for true company generation. But for us, Brad and I, who likes just doing hackathons, being, making things in compressed time skills, it seemed like a, a fun thing to do. And like I said, we'd been working on things, but it was only there that like, you're, you're stuck in a maybe not so great glamorous situation together and you're just there to make a, a program and you wanna make it be the best and compete against others.[00:12:26] And so we add support to the app that we were called was called Board Boss. We couldn't call it anything with Boggle cause of IP rights were called. So we called it Board Boss and it supported Boggle and then we were gonna support chess, which, you know, has no IP rights around it. Uh, it's an open game.[00:12:39] And we did so in 48 hours, we built an app that, or added fit capability to. Point your phone at a chess board. It understands the state of the chess board and converts it to um, a known notation. Then it passes that to stock fish, the open source chess engine for making move recommendations and it makes move recommendations to, to players.[00:13:00] So you could either play against like an ammunition to AI or improve your own game. We learn that one of the key ways users like to use this was just to record their games. Cuz it's almost like reviewing game film of what you should have done differently. Game. Yeah, yeah, exactly. And I guess the highlight of, uh, of chess Boss was, you know, we get to the first round of judging, we get to the second round of judging.[00:13:16] And during the second round of judging, that's when like, TechCrunch kind of brings around like some like celebs and stuff. They'll come by. Evan Spiegel drops by Ooh. Oh, and he uh, he comes up to our, our, our booth and um, he's like, oh, so what does, what does this all do? And you know, he takes an interest in it cuz the underpinnings of, of AR interacting with the.[00:13:33] And, uh, he is kinda like, you know, I could use this to like cheat on chess with my friends. And we're like, well, you know, that wasn't exactly the, the thesis of why we made it, but glad that, uh, at least you think it's kind of neat. Um, wait, but he already started Snapchat by then? Oh, yeah. Oh yeah. This, this is 2019, I think.[00:13:49] Oh, okay, okay. Yeah, he was kind of just checking out things that were new and, and judging didn't end up winning any, um, awards within Disrupt, but I think what we won was actually. Maybe more important maybe like the, the quote, like the co-founders medal along the way. Yep. The friends we made along the way there we go to, to play to the meme.[00:14:06] I would've preferred to win, to be clear. Yes. You played a win. So you did win, uh,[00:14:11] $15,000 from some Des Moines, uh, con[00:14:14] contest. Yeah. Yeah. The, uh, that was nice. Yeah. Slightly after that we did, we did win. Um, some, some grants and some other things for some of the work that we've been doing. John Papa John supporting the, uh, the local tech scene.[00:14:24] Yeah. Well, so there's not the one you're thinking of. Okay. Uh, there's a guy whose name is Papa John, like that's his, that's his, that's his last name. His first name is John. So it's not the Papa John's you're thinking of that has some problematic undertones. It's like this guy who's totally different. I feel bad for him.[00:14:38] His press must just be like, oh, uh, all over the place. But yeah, he's this figure in the Iowa entrepreneurial scene who, um, he actually was like doing SPACs before they were cool and these sorts of things, but yeah, he funds like grants that encourage entrepreneurship in the state. And since we'd done YC and in the state, we were eligible for some of the awards that they were providing.[00:14:56] But yeah, it was disrupt that we realized, you know, um, the tools that we made, you know, it took us better part of a summer to add Boggle support and it took us 48 hours to add chest support. So adding the ability for programmable interfaces for any object, we built a lot of those internal tools and our apps were kind of doing like the very famous shark fin where like it picks up really fast, then it kind of like slowly peters off.[00:15:20] Mm-hmm. And so we're like, okay, if we're getting these like shark fin graphs, we gotta try something different. Um, there's something different. I remember like the week before Thanksgiving 2019 sitting down and we wrote this Readme for, actually it's still the Readme at the base repo of Robo Flow today has spent relatively unedited of the manifesto.[00:15:36] Like, we're gonna build tools that enable people to make the world programmable. And there's like six phases and, you know, there's still, uh, many, many, many phases to go into what we wrote even at that time to, to present. But it's largely been, um, right in line with what we thought we would, we would do, which is give engineers the tools to add software to real world objects, which is largely predicated on computer vision. So finding the right images, getting the right sorts of video frames, maybe annotating them, uh, finding the right sort of models to use to do this, monitoring the performance, all these sorts of things. And that from, I mean, we released that in early 2020, and it's kind of, that's what's really started to click.[00:16:12] Why Computer Vision[00:16:12] Awesome. I think we should just kind[00:16:13] of[00:16:14] go right into where you are today and like the, the products that you offer, just just to give people an overview and then we can go into the, the SAM stuff. So what is the clear, concise elevator pitch? I think you mentioned a bunch of things like make the world programmable so you don't ha like computer vision is a means to an end.[00:16:30] Like there's, there's something beyond that. Yeah.[00:16:32] I mean, the, the big picture mission for the business and the company and what we're working on is, is making the world programmable, making it read and write and interactive, kind of more entertaining, more e. More fun and computer vision is the technology by which we can achieve that pretty quickly.[00:16:48] So like the one liner for the, the product in, in the company is providing engineers with the tools for data and models to build programmable interfaces. Um, and that can be workflows, that could be the, uh, data processing, it could be the actual model training. But yeah, Rob helps you use production ready computer vision workflows fast.[00:17:10] And I like that.[00:17:11] In part of your other pitch that I've heard, uh, is that you basically scale from the very smallest scales to the very largest scales, right? Like the sort of microbiology use case all the way to[00:17:20] astronomy. Yeah. Yeah. The, the joke that I like to make is like anything, um, underneath a microscope and, and through a telescope and everything in between needs to, needs to be seen.[00:17:27] I mean, we have people that run models in outer space, uh, underwater remote places under supervision and, and known places. The crazy thing is that like, All parts of, of not just the world, but the universe need to be observed and understood and acted upon. So vision is gonna be, I dunno, I feel like we're in the very, very, very beginnings of all the ways we're gonna see it.[00:17:50] Computer Vision Use Cases[00:17:50] Awesome. Let's go into a lo a few like top use cases, cuz I think that really helps to like highlight the big names that you've, big logos that you've already got. I've got Walmart and Cardinal Health, but I don't, I don't know if you wanna pull out any other names, like, just to illustrate, because the reason by the way, the reason I think that a lot of developers don't get into computer vision is because they think they don't need it.[00:18:11] Um, or they think like, oh, like when I do robotics, I'll do it. But I think if, if you see like the breadth of use cases, then you get a little bit more inspiration as to like, oh, I can use[00:18:19] CVS lfa. Yeah. It's kind of like, um, you know, by giving, by making it be so straightforward to use vision, it becomes almost like a given that it's a set of features that you could power on top of it.[00:18:32] And like you mentioned, there's, yeah, there's Fortune One there over half the Fortune 100. I've used the, the tools that Robel provides just as much as 250,000 developers. And so over a quarter million engineers finding and developing and creating various apps, and I mean, those apps are, are, are far and wide.[00:18:49] Just as you mentioned. I mean everything from say, like, one I like to talk about was like sushi detection of like finding the like right sorts of fish and ingredients that are in a given piece of, of sushi that you're looking at to say like roof estimation of like finding. If there's like, uh, hail damage on, on a given roof, of course, self-driving cars and understanding the scenes around us is sort of the, you know, very early computer vision everywhere.[00:19:13] Use case hardhat detection, like finding out if like a given workplace is, is, is safe, uh, disseminate, have the right p p p on or p p e on, are there the right distance from various machines? A huge place that vision has been used is environmental monitoring. Uh, what's the count of species? Can we verify that the environment's not changing in unexpected ways or like river banks are become, uh, becoming recessed in ways that we anticipate from satellite imagery, plant phenotyping.[00:19:37] I mean, people have used these apps for like understanding their plants and identifying them. And that dataset that's actually largely open, which is what's given a proliferation to the iNaturalist, is, is that whole, uh, hub of, of products. Lots of, um, people that do manufacturing. So, like Rivian for example, is a Rubal customer, and you know, they're trying to scale from 1000 cars to 25,000 cars to a hundred thousand cars in very short order.[00:20:00] And that relies on having the. Ability to visually ensure that every part that they're making is produced correctly and right in time. Medical use cases. You know, there's actually, this morning I was emailing with a user who's accelerating early cancer detection through breaking apart various parts of cells and doing counts of those cells.[00:20:23] And actually a lot of wet lab work that folks that are doing their PhDs or have done their PhDs are deeply familiar with that is often required to do very manually of, of counting, uh, micro plasms or, or things like this. There's. All sorts of, um, like traffic counting and smart cities use cases of understanding curb utilization to which sort of vehicles are, are present.[00:20:44] Uh, ooh. That can be[00:20:46] really good for city planning actually.[00:20:47] Yeah. I mean, one of our customers does exactly this. They, they measure and do they call it like smart curb utilization, where uhhuh, they wanna basically make a curb be almost like a dynamic space where like during these amounts of time, it's zoned for this during these amounts of times.[00:20:59] It's zoned for this based on the flows and e ebbs and flows of traffic throughout the day. So yeah, I mean the, the, the truth is that like, you're right, it's like a developer might be like, oh, how would I use vision? And then all of a sudden it's like, oh man, all these things are at my fingertips. Like I can just, everything you can see.[00:21:13] Yeah. Right. I can just, I can just add functionality for my app to understand and ingest the way, like, and usually the way that someone gets like almost nerd sniped into this is like, they have like a home automation project, so it's like send Yeah. Give us a few. Yeah. So send me a text when, um, a package shows up so I can like prevent package theft so I can like go down and grab it right away or.[00:21:29] We had a, uh, this one's pretty, pretty niche, but it's pretty funny. There was this guy who, during the pandemic wa, wanted to make sure his cat had like the proper, uh, workout. And so I've shared the story where he basically decided that. He'd make a cat workout machine with computer vision, you might be alone.[00:21:43] You're like, what does that look like? Well, what he decided was he would take a robotic arm strap, a laser pointer to it, and then train a machine to recognize his cat and his cat only, and point the laser pointer consistently 10 feet away from the cat. There's actually a video of you if you type an YouTube cat laser turret, you'll find Dave's video.[00:22:01] Uh, and hopefully Dave's cat has, has lost the weight that it needs to, cuz that's just the, that's an intense workout I have to say. But yeah, so like, that's like a, um, you know, these, uh, home automation projects are pretty common places for people to get into smart bird feeders. I've seen people that like are, are logging and understanding what sort of birds are, uh, in their background.[00:22:18] There's a member of our team that was working on actually this as, as a whole company and has open sourced a lot of the data for doing bird species identification. And now there's, I think there's even a company that's, uh, founded to create like a smart bird feeder, like captures photos and tells you which ones you've attracted to your yard.[00:22:32] I met that. Do, you know, get around the, uh, car sharing company that heard it? Them never used them. They did a SPAC last year and they had raised at like, They're unicorn. They raised at like 1.2 billion, I think in the, the prior round and inspected a similar price. I met the CTO of, of Getaround because he was, uh, using Rob Flow to hack into his Tesla cameras to identify other vehicles that are like often nearby him.[00:22:56] So he's basically building his own custom license plate recognition, and he just wanted like, keep, like, keep tabs of like, when he drives by his friends or when he sees like regular sorts of folks. And so he was doing like automated license plate recognition by tapping into his, uh, camera feeds. And by the way, Elliot's like one of the like OG hackers, he was, I think one of the very first people to like, um, she break iPhones and, and these sorts of things.[00:23:14] Mm-hmm. So yeah, the project that I want, uh, that I'm gonna work on right now for my new place in San Francisco is. There's two doors. There's like a gate and then the other door. And sometimes we like forget to close, close the gate. So like, basically if it sees that the gate is open, it'll like send us all a text or something like this to make sure that the gate is, is closed at the front of our house.[00:23:32] That's[00:23:32] really cool. And I'll, I'll call out one thing that readers and listeners can, uh, read out on, on your history. One of your most popular initial, um, viral blog post was about, um, autonomous vehicle data sets and how, uh, the one that Udacity was using was missing like one third of humans. And, uh, it's not, it's pretty problematic for cars to miss humans.[00:23:53] Yeah, yeah, actually, so yeah, the Udacity self-driving car data set, which look to their credit, it was just meant to be used for, for academic use. Um, and like as a part of courses on, on Udacity, right? Yeah. But the, the team that released it, kind of hastily labeled and let it go out there to just start to use and train some models.[00:24:11] I think that likely some, some, uh, maybe commercial use cases maybe may have come and, and used, uh, the dataset, who's to say? But Brad and I discovered this dataset. And when we were working on dataset improvement tools at Rob Flow, we ran through our tools and identified some like pretty, as you mentioned, key issues.[00:24:26] Like for example, a lot of strollers weren't labeled and I hope our self-driving cars do those, these sorts of things. And so we relabeled the whole dataset by hand. I have this very fond memory is February, 2020. Brad and I are in Taiwan. So like Covid is actually just, just getting going. And the reason we were there is we were like, Hey, we can work on this from anywhere for a little bit.[00:24:44] And so we spent like a, uh, let's go closer to Covid. Well, you know, I like to say we uh, we got early indicators of, uh, how bad it was gonna be. I bought a bunch of like N 90 fives before going o I remember going to the, the like buying a bunch of N 95 s and getting this craziest look like this like crazy tin hat guy.[00:25:04] Wow. What is he doing? And then here's how you knew. I, I also got got by how bad it was gonna be. I left all of them in Taiwan cuz it's like, oh, you all need these. We'll be fine over in the us. And then come to find out, of course that Taiwan was a lot better in terms of, um, I think, yeah. Safety. But anyway, we were in Taiwan because we had planned this trip and you know, at the time we weren't super sure about the, uh, covid, these sorts of things.[00:25:22] We always canceled it. We didn't, but I have this, this very specific time. Brad and I were riding on the train from Clay back to Taipei. It's like a four hour ride. And you mentioned Pioneer earlier, we were competing in Pioneer, which is almost like a gamified to-do list. Mm-hmm. Every week you say what you're gonna do and then other people evaluate.[00:25:37] Did you actually do the things you said you were going to do? One of the things we said we were gonna do was like this, I think re-release of this data set. And so it's like late, we'd had a whole week, like, you know, weekend behind us and, uh, we're on this train and it was very unpleasant situation, but we relabeled this, this data set, and one sitting got it submitted before like the Sunday, Sunday countdown clock starts voting for, for.[00:25:57] And, um, once that data got out back out there, just as you mentioned, it kind of picked up and Venture beat, um, noticed and wrote some stories about it. And we really rereleased of course, the data set that we did our best job of labeling. And now if anyone's listening, they can probably go out and like find some errors that we surely still have and maybe call us out and, you know, put us, put us on blast.[00:26:15] The Economics of Annotation (Segmentation)[00:26:15] But,[00:26:16] um, well, well the reason I like this story is because it, it draws attention to the idea that annotation is difficult and basically anyone looking to use computer vision in their business who may not have an off-the-shelf data set is going to have to get involved in annotation. And I don't know what it costs.[00:26:34] And that's probably one of the biggest hurdles for me to estimate how big a task this is. Right? So my question at a higher level is tell the customers, how do you tell customers to estimate the economics of annotation? Like how many images do, do we need? How much, how long is it gonna take? That, that kinda stuff.[00:26:50] How much money and then what are the nuances to doing it well, right? Like, cuz obviously Udacity had a poor quality job, you guys had proved it, and there's errors every everywhere. Like where do[00:26:59] these things go wrong? The really good news about annotation in general is that like annotation of course is a means to an end to have a model be able to recognize a thing.[00:27:08] Increasingly there's models that are coming out that can recognize things zero shot without any annotation, which we're gonna talk about. Yeah. Which, we'll, we'll talk more about that in a moment. But in general, the good news is that like the trend is that annotation is gonna become decreasingly a blocker to starting to use computer vision in meaningful ways.[00:27:24] Now that said, just as you mentioned, there's a lot of places where you still need to do. Annotation. I mean, even with these zero shot models, they might have of blind spots, or maybe you're a business, as you mentioned, that you know, it's proprietary data. Like only Rivian knows what a rivian is supposed to look like, right?[00:27:39] Uh, at the time of, at the time of it being produced, like underneath the hood and, and all these sorts of things. And so, yeah, that's gonna necessarily require annotation. So your question of how long is it gonna take, how do you estimate these sorts of things, it really comes down to the complexity of the problem that you're solving and the amount of variance in the scene.[00:27:57] So let's give some contextual examples. If you're trying to recognize, we'll say a scratch on one specific part and you have very strong lighting. You might need fewer images because you control the lighting, you know the exact part and maybe you're lucky in the scratch. Happens more often than not in similar parts or similar, uh, portions of the given part.[00:28:17] So in that context, you, you, the function of variance, the variance is, is, is lower. So the number of images you need is also lower to start getting up to work. Now the orders of magnitude we're talking about is that like you can have an initial like working model from like 30 to 50 images. Yeah. In this context, which is shockingly low.[00:28:32] Like I feel like there's kind of an open secret in computer vision now, the general heuristic that often. Users, is that like, you know, maybe 200 images per class is when you start to have a model that you can rely[00:28:45] on? Rely meaning like 90, 99, 90, 90%, um,[00:28:50] uh, like what's 85 plus 85? Okay. Um, that's good. Again, these are very, very finger in the wind estimates cuz the variance we're talking about.[00:28:59] But the real question is like, at what point, like the framing is not like at what point do it get to 99, right? The framing is at what point can I use this thing to be better than the alternative, which is humans, which maybe humans or maybe like this problem wasn't possible at all. And so usually the question isn't like, how do I get to 99?[00:29:15] A hundred percent? It's how do I ensure that like the value I am able to get from putting this thing in production is greater than the alternative? In fact, even if you have a model that's less accurate than humans, there might be some circumstances where you can tolerate, uh, a greater amount of inaccuracy.[00:29:32] And if you look at the accuracy relative to the cost, Using a model is extremely cheap. Using a human for the same sort of task can be very expensive. Now, in terms of the actual accuracy of of what you get, there's probably some point at which the cost, but relative accuracy exceeds of a model, exceeds the high cost and hopefully high accuracy of, of a human comparable, like for example, there's like cameras that will track soccer balls or track events happening during sporting matches.[00:30:02] And you can go through and you know, we actually have users that work in sports analytics. You can go through and have a human. Hours and hours of footage. Cuz not just watching their team, they're watching every other team, they're watching scouting teams, they're watching junior teams, they're watching competitors.[00:30:15] And you could have them like, you know, track and follow every single time the ball goes within blank region of the field or every time blank player goes into, uh, this portion of the field. And you could have, you know, exact, like a hundred percent accuracy if that person, maybe, maybe not a hundred, a human may be like 95, 90 7% accuracy of every single time the ball is in this region or this player is on the field.[00:30:36] Truthfully, maybe if you're scouting analytics, you actually don't need 97% accuracy of knowing that that player is on the field. And in fact, if you can just have a model run at a 1000th, a 10000th of the cost and goes through and finds all the times that Messi was present on the field mm-hmm. That the ball was in this region of the.[00:30:54] Then even if that model is slightly less accurate, the cost is just so orders of magnitude different. And the stakes like the stakes of this problem, of knowing like the total number of minutes that Messi played will say are such that we have a higher air tolerance, that it's a no-brainer to start to use Yeah, a computer vision model in this context.[00:31:12] So not every problem requires equivalent or greater human performance. Even when it does, you'd be surprised at how fast models get there. And in the times when you, uh, really look at a problem, the question is, how much accuracy do I need to start to get value from this? This thing, like the package example is a great one, right?[00:31:27] Like I could in theory set up a camera that's constantly watching in front of my porch and I could watch the camera whenever I have a package and then go down. But of course, I'm not gonna do that. I value my time to do other sorts of things instead. And so like there, there's this net new capability of, oh, great, I can have an always on thing that tells me when a package shows up, even if you know the, the thing that's gonna text me.[00:31:46] When a package shows up, let's say a flat pack shows up instead of a box and it doesn't know what a flat pack likes, looks like initially. Doesn't matter. It doesn't matter because I didn't have this capability at all before. And I think that's the true case where a lot of computer vision problems exist is like it.[00:32:00] It's like you didn't even have this capability, this superpower before at all, let alone assigning a given human to do the task. And that's where we see like this explosion of, of value.[00:32:10] Awesome. Awesome. That was a really good overview. I want to leave time for the others, but I, I really want to dive into a couple more things with regards to Robo Flow.[00:32:17] Computer Vision Annotation Formats[00:32:17] So one is, apparently your original pitch for Robo Flow was with regards to conversion tools for computer vision data sets. And I'm sure as, as a result of your job, you have a lot of rants. I've been digging for rants basically on like the best or the worst annotation formats. What do we know? Cause most of us, oh my gosh, we only know, like, you know, I like,[00:32:38] okay, so when we talk about computer vision annotation formats, what we're talking about is if you have an image and you, you picture a boing box around my face on that image.[00:32:46] Yeah. How do you describe where that Monty box is? X, Y, Z X Y coordinates. Okay. X, y coordinates. How, what do you mean from the top lefts.[00:32:52] Okay. You, you, you, you take X and Y and then, and then the. The length and, and the width of the, the[00:32:58] box. Okay. So you got like a top left coordinate and like the bottom right coordinate or like the, the center of the bottom.[00:33:02] Yeah. Yeah. Top, left, bottom right. Yeah. That's one type of format. Okay. But then, um, I come along and I'm like, you know what? I want to do a different format where I wanna just put the center of the box, right. And give the length and width. Right. And by the way, we didn't even talk about what X and Y we're talking about.[00:33:14] Is X a pixel count? Is a relative pixel count? Is it an absolute pixel count? So the point is, the number of ways to describe where a box lives in a freaking image is endless, uh, seemingly and. Everyone decided to kind of create their own different ways of describing the coordinates and positions of where in this context of bounding Box is present.[00:33:39] Uh, so there's some formats, for example, that like use re, so for the x and y, like Y is, uh, like the left, most part of the image is zero. And the right most part of the image is one. So the, the coordinate is like anywhere from zero to one. So 0.6 is, you know, 60% of your way right up the image to describe the coordinate.[00:33:53] I guess that was, that was X instead of Y. But the point is there, of the zero to one is the way that we determined where that was in the position, or we're gonna do an absolute pixel position anyway. We got sick, we got sick of all these different annotation formats. So why do you even have to convert between formats?[00:34:07] Is is another part of this, this story. So different training frameworks, like if you're using TensorFlow, you need like TF Records. If you're using PyTorch, it's probably gonna be, well it depends on like what model you're using, but someone might use Coco JSON with PyTorch. Someone else might use like a, just a YAML file and a text file.[00:34:21] And to describe the cor it's point is everyone that creates a model. Or creates a dataset rather, has created different ways of describing where and how a bounding box is present in the image. And we got sick of all these different formats and doing these in writing all these different converter scripts.[00:34:39] And so we made a tool that just converts from one script, one type of format to another. And the, the key thing is that like if you get that converter script wrong, your model doesn't not work. It just fails silently. Yeah. Because the bounding boxes are now all in the wrong places. And so you need a way to visualize and be sure that your converter script, blah, blah blah.[00:34:54] So that was the very first tool we released of robo. It was just a converter script, you know, like these, like these PDF to word converters that you find. It was basically that for computer vision, like dead simple, really annoying thing. And we put it out there and people found some, some value in, in that.[00:35:08] And you know, to this day that's still like a surprisingly painful[00:35:11] problem. Um, yeah, so you and I met at the Dall-E Hackathon at OpenAI, and we were, I was trying to implement this like face masking thing, and I immediately ran into that problem because, um, you know, the, the parameters that Dall-E expected were different from the one that I got from my face, uh, facial detection thing.[00:35:28] One day it'll go away, but that day is not today. Uh, the worst format that we work with is, is. The mart form, it just makes no sense. And it's like, I think, I think it's a one off annotation format that this university in China started to use to describe where annotations exist in a book mart. I, I don't know, I dunno why that So best[00:35:45] would be TF record or some something similar.[00:35:48] Yeah, I think like, here's your chance to like tell everybody to use one one standard and like, let's, let's, can[00:35:53] I just tell them to use, we have a package that does this for you. I'm just gonna tell you to use the row full package that converts them all, uh, for you. So you don't have to think about this. I mean, Coco JSON is pretty good.[00:36:04] It's like one of the larger industry norms and you know, it's in JS O compared to like V xml, which is an XML format and Coco json is pretty descriptive, but you know, it has, has its own sort of drawbacks and flaws and has random like, attribute, I dunno. Um, yeah, I think the best way to handle this problem is to not have to think about it, which is what we did.[00:36:21] We just created a, uh, library that, that converts and uses things. Uh, for us. We've double checked the heck out of it. There's been hundreds of thousands of people that have used the library and battle tested all these different formats to find those silent errors. So I feel pretty good about no longer having to have a favorite format and instead just rely on.[00:36:38] Dot load in the format that I need. Great[00:36:41] Intro to Computer Vision Segmentation[00:36:41] service to the community. Yeah. Let's go into segmentation because is at the top of everyone's minds, but before we get into segment, anything, I feel like we need a little bit of context on the state-of-the-art prior to Sam, which seems to be YOLO and uh, you are the leading expert as far as I know.[00:36:56] Yeah.[00:36:57] Computer vision, there's various task types. There's classification problems where we just like assign tags to images, like, you know, maybe safe work, not safe work, sort of tagging sort of stuff. Or we have object detection, which are the boing boxes that you see and all the formats I was mentioning in ranting about there's instant segmentation, which is the polygon shapes and produces really, really good looking demos.[00:37:19] So a lot of people like instant segmentation.[00:37:21] This would be like counting pills when you point 'em out on the, on the table. Yeah. So, or[00:37:25] soccer players on the field. So interestingly, um, counting you could do with bounding boxes. Okay. Cause you could just say, you know, a box around a person. Well, I could count, you know, 12 players on the field.[00:37:35] Masks are most useful. Polygons are most useful if you need very precise area measurements. So you have an aerial photo of a home and you want to know, and the home's not a perfect box, and you want to know the rough square footage of that home. Well, if you know the distance between like the drone and, and the ground.[00:37:53] And you have the precise polygon shape of the home, then you can calculate how big that home is from aerial photos. And then insurers can, you know, provide say accurate estimates and that's maybe why this is useful. So polygons and, and instant segmentation are, are those types of tasks? There's a key point detection task and key point is, you know, if you've seen those demos of like all the joints on like a hand kind of, kind of outlined, there's visual question answering tasks, visual q and a.[00:38:21] And that's like, you know, some of the stuff that multi-modality is absolutely crushing for, you know, here's an image, tell me what food is in this image. And then you can pass that and you can make a recipe out of it. But like, um, yeah, the visual question in answering task type is where multi-modality is gonna have and is already having an enormous impact.[00:38:40] So that's not a comprehensive survey, very problem type, but it's enough to, to go into why SAM is significant. So these various task types, you know, which model to use for which given circumstance. Most things is highly dependent on what you're ultimately aiming to do. Like if you need to run a model on the edge, you're gonna need a smaller model, cuz it is gonna run on edge, compute and process in, in, in real time.[00:39:01] If you're gonna run a model on the cloud, then of course you, uh, generally have more compute at your disposal Considerations like this now, uh,[00:39:08] YOLO[00:39:08] just to pause. Yeah. Do you have to explain YOLO first before you go to Sam, or[00:39:11] Yeah, yeah, sure. So, yeah. Yeah, we should. So object detection world. So for a while I talked about various different task types and you can kinda think about a slide scale of like classification, then obvious detection.[00:39:20] And on the right, at most point you have like segmentation tasks. Object detection. The bounding boxes is especially useful for a wide, like it's, it's surprisingly versatile. Whereas like classification is kind of brittle. Like you only have a tag for the whole image. Well, that doesn't, you can't count things with tags.[00:39:35] And on the other hand, like the mask side of things, like drawing masks is painstaking. And so like labeling is just a bit more difficult. Plus like the processing to produce masks requires more compute. And so usually a lot of folks kind of landed for a long time on obvious detection being a really happy medium of affording you with rich capabilities because you can do things like count, track, measure.[00:39:56] In some CAGR context with bounding boxes, you can see how many things are present. You can actually get a sense of how fast something's moving by tracking the object or bounding box across multiple frames and comparing the timestamp of where it was across those frames. So obviously detection is a very common task type that solves lots of things that you want do with a given model.[00:40:15] In obviously detection. There's been various model frameworks over time. So kind of really early on there's like R-CNN uh, then there's faster rc n n and these sorts of family models, which are based on like resnet kind of architectures. And then a big thing happens, and that is single shot detectors. So faster, rc n n despite its name is, is very slow cuz it takes two passes on the image.[00:40:37] Uh, the first pass is, it finds par pixels in the image that are most interesting to, uh, create a bounding box candidate out of. And then it passes that to a, a classifier that then does classification of the bounding box of interest. Right. Yeah. You can see, you can see why that would be slow. Yeah. Cause you have to do two passes.[00:40:53] You know, kind of actually led by, uh, like mobile net was I think the first large, uh, single shot detector. And as its name implies, it was meant to be run on edge devices and mobile devices and Google released mobile net. So it's a popular implementation that you find in TensorFlow. And what single shot detectors did is they said, Hey, instead of looking at the image twice, what if we just kind of have a, a backbone that finds candidate bounding boxes?[00:41:19] And then we, we set loss functions for objectness. We set loss function. That's a real thing. We set loss functions for objectness, like how much obj, how object do this part of the images. We send a loss function for classification, and then we run the image through the model on a single pass. And that saves lots of compute time and you know, it's not necessarily as accurate, but if you have lesser compute, it can be extremely useful.[00:41:42] And then the advances in both modeling techniques in compute and data quality, single shot detectors, SSDs has become, uh, really, really popular. One of the biggest SSDs that has become really popular is the YOLO family models, as you described. And so YOLO stands for you only look once. Yeah, right, of course.[00:42:02] Uh, Drake's, uh, other album, um, so Joseph Redman introduces YOLO at the University of Washington. And Joseph Redman is, uh, kind of a, a fun guy. So for listeners, for an Easter egg, I'm gonna tell you to Google Joseph Redman resume, and you'll find, you'll find My Little Pony. That's all I'll say. And so he introduces the very first YOLO architecture, which is a single shot detector, and he also does it in a framework called Darknet, which is like this, this own framework that compiles the Cs, frankly, kind of tough to work with, but allows you to benefit from the speedups that advance when you operate in a low level language like.[00:42:36] And then he releases, well, what colloquially is known as YOLO V two, but a paper's called YOLO 9,000 cuz Joseph Redmond thought it'd be funny to have something over 9,000. So get a sense for, yeah, some fun. And then he releases, uh, YOLO V three and YOLO V three is kind of like where things really start to click because it goes from being an SSD that's very limited to competitive and, and, and superior to actually mobile That and some of these other single shot detectors, which is awesome because you have this sort of solo, I mean, him and and his advisor, Ali, at University of Washington have these, uh, models that are becoming really, really powerful and capable and competitive with these large research organizations.[00:43:09] Joseph Edmond leaves Computer Vision Research, but there had been Alexia ab, one of the maintainers of Darknet released Yola VI four. And another, uh, researcher, Glenn Yer, uh, jocker had been working on YOLO V three, but in a PyTorch implementation, cuz remember YOLO is in a dark implementation. And so then, you know, YOLO V three and then Glenn continues to make additional improvements to YOLO V three and pretty soon his improvements on Yolov theory, he's like, oh, this is kind of its own things.[00:43:36] Then he releases YOLO V five[00:43:38] with some naming[00:43:39] controversy that we don't have Big naming controversy. The, the too long didn't read on the naming controversy is because Glen was not originally involved with Darknet. How is he allowed to use the YOLO moniker? Roe got in a lot of trouble cuz we wrote a bunch of content about YOLO V five and people were like, ah, why are you naming it that we're not?[00:43:55] Um, but you know,[00:43:56] cool. But anyway, so state-of-the-art goes to v8. Is what I gather.[00:44:00] Yeah, yeah. So yeah. Yeah. You're, you're just like, okay, I got V five. I'll skip to the end. Uh, unless, unless there's something, I mean, I don't want, well, so I mean, there's some interesting things. Um, in the yolo, there's like, there's like a bunch of YOLO variants.[00:44:10] So YOLOs become this, like this, this catchall for various single shot, yeah. For various single shot, basically like runs on the edge, it's quick detection framework. And so there's, um, like YOLO R, there's YOLO S, which is a transformer based, uh, yolo, yet look like you only look at one sequence is what s stands were.[00:44:27] Um, the pp yo, which, uh, is PAT Paddle implementation, which is by, which Chinese Google is, is their implementation of, of TensorFlow, if you will. So basically YOLO has like all these variants. And now, um, yo vii, which is Glen has been working on, is now I think kind of like, uh, one of the choice models to use for single shot detection.[00:44:44] World Knowledge of Foundation Models[00:44:44] Well, I think a lot of those models, you know, Asking the first principal's question, like let's say you wanna find like a bus detector. Do you need to like go find a bunch of photos of buses or maybe like a chair detector? Do you need to go find a bunch of photos of chairs? It's like, oh no. You know, actually those images are present not only in the cocoa data set, but those are objects that exist like kind of broadly on the internet.[00:45:02] And so computer visions kind of been like us included, have been like really pushing for and encouraging models that already possess a lot of context about the world. And so, you know, if GB T's idea and i's idea OpenAI was okay, models can only understand things that are in their corpus. What if we just make their corpus the size of everything on the internet?[00:45:20] The same thing that happened in imagery, what's happening now? And that's kinda what Sam represents, which is kind of a new evolution of, earlier on we were talking about the cost of annotation and I said, well, good news. Annotations then become decreasingly necessary to start to get to value. Now you gotta think about it more, kind of like, you'll probably need to do some annotation because you might want to find a custom object, or Sam might not be perfect, but what's about to happen is a big opportunity where you want the benefits of a yolo, right?[00:45:47] Where it can run really fast, it can run on the edge, it's very cheap. But you want the knowledge of a large foundation model that already knows everything about buses and knows everything about shoes, knows everything about real, if the name is true, anything segment, anything model. And so there's gonna be this novel opportunity to take what these large models know, and I guess it's kind of like a form of distilling, like distill them down into smaller architectures that you can use in versatile ways to run in real time to run on the edge.[00:46:13] And that's now happening. And what we're seeing in actually kind of like pulling that, that future forward with, with, with Robo Flow.[00:46:21] Segment Anything Model[00:46:21] So we could talk a bit about, um, about SAM and what it represents maybe into, in relation to like these, these YOLO models. So Sam is Facebook segment Everything Model. It came out last week, um, the first week of April.[00:46:34] It has 24,000 GitHub stars at the time of, of this recording within its first week. And why, what does it do? Segment? Everything is a zero shot segmentation model. And as we're describing, creating masks is a very arduous task. Creating masks of objects that are not already represented means you have to go label a bunch of masks and then train a model and then hope that it finds those masks in new images.[00:47:00] And the promise of Segment anything is that in fact you just pass at any image and it finds all of the masks of relevant things that you might be curious about finding in a given image. And it works remarkably. Segment anything in credit to Facebook and the fair Facebook research team, they not only released the model permissive license to move things forward, they released the full data set, all 11 million images and 1.1 billion segmentation masks and three model sizes.[00:47:29] The largest ones like 2.5 gigabytes, which is not enormous. Medium ones like 1.2 and the smallest one is like 400, 3 75 megabytes. And for context,[00:47:38] for, for people listening, that's six times more than the previous alternative, which, which is apparently open images, uh, in terms of number images, and then 400 times more masks than open[00:47:47] images as well.[00:47:48] Exactly, yeah. So huge, huge order magnitude gain in terms of dataset accessibility plus like the model and how it works. And so the question becomes, okay, so like segment. What, what do I do with this? Like, what does it allow me to do? And it didn't Rob float well. Yeah, you should. Yeah. Um, it's already there.[00:48:04] You um, that part's done. Uh, but the thing that you can do with segment anything is you can almost, like, I almost think about like this, kinda like this model arbitrage where you can basically like distill down a giant model. So let's say like, like let's return to the package example. Okay. The package problem of, I wanna get a text when a package appears on my front porch before segment anything.[00:48:25] The way that I would go solve this problem is I would go collect some images of packages on my porch and I would label them, uh, with bounding boxes or maybe masks in that part. As you mentioned, it can be a long process and I would train a model. And that model it actually probably worked pretty well cause it's purpose-built.[00:48:44] The camera position, my porch, the packages I'm receiving. But that's gonna take some time, like everything that I just mentioned the
We're trying a new format, inspired by Acquired.fm! No guests, no news, just highly prepared, in-depth conversation on one topic that will level up your understanding. We aren't experts, we are learning in public. Please let us know what we got wrong and what you think of this new format!When you ask someone to break down the basic ingredients of a Large Language Model, you'll often hear a few things: You need lots of data. You need lots of compute. You need models with billions of parameters. Trust the Bitter Lesson, more more more, scale is all you need. Right?Nobody ever mentions the subtle influence of great benchmarking.LLM Benchmarks mark our progress in building artificial intelligences, progressing from * knowing what words go with others (1985 WordNet)* recognizing names and entities (2004 Enron Emails) * and image of numbers, letters, and clothes (1998-2017 MNIST)* language translation (2002 BLEU → 2020 XTREME)* more and more images (2009 ImageNet, CIFAR)* reasoning in sentences (2016 LAMBADA) and paragraphs (2019 AI2RC, DROP)* stringing together whole sentences (2018 GLUE and SuperGLUE)* question answering (2019 CoQA)* having common sense (2018 Swag and HellaSwag, 2019 WinoGrande)* knowledge of all human tasks and professional exams (2021 MMLU)* knowing everything (2022 BIG-Bench)People who make benchmarks are the unsung heroes of LLM research, because they dream up ever harder tests that last ever shorter periods of time.In our first AI Fundamentals episode, we take a trek through history to try to explain what we have learned about LLM Benchmarking, and what issues we have discovered with them. There are way, way too many links and references to include in this email. You can follow along the work we did for our show prep in this podcast's accompanying repo, with all papers and selected tests pulled out.Enjoy and please let us know what other fundamentals topics you'd like us to cover!Timestamps* [00:00:21] Benchmarking Questions* [00:03:08] Why AI Benchmarks matter* [00:06:02] Introducing Benchmark Metrics* [00:08:14] Benchmarking Methodology* [00:09:45] 1985-1989: WordNet and Entailment* [00:12:44] 1998-2004 Enron Emails and MNIST* [00:14:35] 2009-14: ImageNet, CIFAR and the AlexNet Moment for Deep Learning* [00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference* [00:23:21] 2018-19: Swag and HellaSwag - Common Sense Inference* [00:26:07] Aside: How to Design Benchmarks* [00:26:51] 2021: MMLU - Human level Professional Knowledge* [00:29:39] 2021: HumanEval - Code Generation* [00:31:51] 2020: XTREME - Multilingual Benchmarks* [00:35:14] 2022: BIG-Bench - The Biggest of the Benches* [00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results* [00:38:25] Issue: GPT4 vs the mystery of the AMC10/12* [00:40:28] Issue: Data Contamination* [00:42:13] Other Issues: Benchmark Data Quality and the Iris data set* [00:45:44] Tradeoffs of Latency, Inference Cost, Throughput* [00:49:45] ConclusionTranscript[00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO and residence at Decibel Partners, and I'm joined by my co-host, swyx writer and editor of Latent Space.[00:00:21] Benchmarking Questions[00:00:21] Up until today, we never verified that we're actually humans to you guys. So we'd have one good thing to do today would be run ourselves through some AI benchmarks and see if we are humans.[00:00:31] Indeed. So, since I got you here, Sean, I'll start with one of the classic benchmark questions, which is what movie does this emoji describe? The emoji set is little Kid Bluefish yellow, bluefish orange Puffer fish. One movie does that. I think if you added an octopus, it would be slightly easier. But I prepped this question so I know it's finding Nemo.[00:00:57] You are so far a human. Second one of these emoji questions instead, depicts a superhero man, a superwoman, three little kids, one of them, which is a toddler. So you got this one too? Yeah. It's one of my favorite movies ever. It's the Incredibles. Uh, second one was kind of a letdown, but the first is a.[00:01:17] Awesome. Okay, I'm gonna ramp it up a little bit. So let's ask something that involves a little bit of world knowledge. So when you drop a ball from rest, it accelerates downward at 9.8 meters per second if you throw it downward instead, assuming no air resistance, so you're throwing it down instead of dropping it, it's acceleration immediately after leaving your hand is a 9.8 meters per second.[00:01:38] B, more than 9.8 meters per second. C less than 9.8 meters per second. D cannot say unless the speed of the throw is. I would say B, you know, I started as a physics major and then I changed, but I think I, I got enough from my first year. That is B Yeah. Even proven that you're human cuz you got it wrong.[00:01:56] Whereas the AI got it right is 9.8 meters per second. The gravitational constant, uh, because you are no longer accelerating after you leave the hand. The question says if you throw it downward after leaving your hand, what is the. It is, it goes back to the gravitational constant, which is 9.8 meters per, I thought you said you were a physics major.[00:02:17] That's why I changed. So I'm a human. I'm a human. You're human. You're human. But you, you got them all right. So I can't ramp it up. I can't ramp it up. So, Assuming, uh, the AI got all of that right, you would think that AI will get this one wrong. Mm-hmm. Because it's just predicting the next token, right?[00:02:31] Right. In the complex Z plane, the set of points satisfying the equation. Z squared equals modulars. Z squared is A, a pair points B circle, C, a half line D, online D square. The processing is, this is going on in your head. You got minus three. A line. This is hard. Yes, that is. That is a line. Okay. What's funny is that I think if, if an AI was doing this, it would take the same exact amount of time to answer this as it would every single other word.[00:03:05] Cuz it's computationally the same to them. Right.[00:03:08] Why AI Benchmarks matter[00:03:08] Um, so anyway, if you haven't caught on today, we're doing our first, uh, AI fundamentals episode, which just the two of us, no guess because we wanted to go deep on one topic and the topic. AI benchmarks. So why are we focusing on AI benchmarks? So, GPT4 just came out last week and every time a new model comes out, All we hear about is it's so much better than the previous model on benchmark X, on benchmark Y.[00:03:33] It performs better on this, better on that. But most people don't actually know what actually goes on under these benchmarks. So we thought it would be helpful for people to put these things in context. And also benchmarks evolved. Like the more the models improve, the harder the benchmarks get. Like I couldn't even get one of the questions right.[00:03:52] So obviously they're working and you'll see that. From the 1990s where some of the first ones came out to day, the, the difficulty of them is truly skyrocketed. So we wanna give a, a brief history of that and leave you with a mental model on, okay, what does it really mean to do well at X benchmark versus Y benchmark?[00:04:13] Um, so excited to add that in. I would also say when you ask people what are the ingredients going into a large language model, they'll talk to you about the data. They'll talk to you about the neural nets, they'll talk to you about the amount of compute, you know, how many GPUs are getting burned based on this.[00:04:30] They never talk to you about the benchmarks. And it's actually a shame because they're so influential. Like that is the entirety of how we judge whether a language model is better than the other. Cuz a language model can do anything out of. Potentially infinite capabilities. How do you judge one model versus another?[00:04:48] How do you know you're getting better? And so I think it's an area of intense specialization. Also, I think when. Individuals like us, you know, we sort of play with the language models. We are basically doing benchmarks. We're saying, look, it's, it's doing this awesome thing that I found. Guess what? There have been academics studying this for 20 years who have, uh, developed a science to this, and we can actually benefit from studying what they have done.[00:05:10] Yep. And obviously the benchmarks also drive research, you know, in a way whenever you're working on, in a new model. Yeah. The benchmark kind of constraints what you're optimizing for in a way. Because if you've read a paper and it performs worse than all the other models, like you're not gonna publish it.[00:05:27] Yeah. So in a way, there's bias in the benchmark itself. Yeah. Yeah. We'll talk a little bit about that. Right. Are we optimizing for the right things when we over-optimize for a single benchmark over over some others? And also curiously, when GPT4 was released, they emitted some very. Commonplace industry benchmarks.[00:05:44] So the way that you present yourself, it is a form of marketing. It is a form of trying to say you're better than something else. And, and trying to explain where you think you, you do better. But it's very hard to verify as well because there are certain problems with reproducing benchmarks, uh, especially when you come to large language models.[00:06:02] Introducing Benchmark Metrics[00:06:02] So where do we go from here? Should we go over the, the major concept? Yeah. When it comes to benchmark metrics, we get three main measures. Accuracy, precision, recall accuracy is just looking at how many successful prediction the model does. Precision is the ratio of true positives, meaning how many of them are good compared to the overall amount of predictions made Versus recall is what proportion of the positives were identified.[00:06:31] So if you think. Spotify playlist to maybe make it a little more approachable, precision is looking. How many songs in a Spotify playlist did you like versus recall is looking at of all the Spotify songs that you like in the word, how many of them were put in the in the playlist? So it's more looking at how many of the true positives can you actually bring into the model versus like more focusing on just being right.[00:06:57] And the two things are precision and recall are usually in tension.. If you're looking for a higher position, you wanna have a higher percentage of correct results. You're usually bringing recall down because you lead to kind of like lower response sets, you know, so there's always trade offs. And this is a big part of the benchmarking too.[00:07:20] You know, what do you wanna optimize for? And most benchmarks use this, um, F1 score, which is the harmonic mean of precision and recall. Which is, you know, we'll put it in the show notes, but just like two times, like the, you know, precision Times Recall divided by the sum. So that's one. And then you get the Stanford Helm metrics.[00:07:38] Um, yeah, so ultimately I think we have advanced a lot in the, in the past few decades on how we measure language models. And the most interesting one came out January of this year from Percy Lang's research lab at Stanford, and he's got. A few metrics, accuracy, calibration, robustness, fairness, efficiency, general information bias and toxicity, and caring that your language models are not toxic and not biased.[00:08:03] So is is, mm-hmm. Kind of a new thing because we have solved the other stuff, therefore we get to care about the toxic of, uh, the language models yelling at us.[00:08:14] Benchmarking Methodology[00:08:14] But yeah, I mean, maybe we can also talk about the other forms of how their be. Yeah, there's three main modes. You can need a benchmark model in a zero shot fashion, few shot or fine tune models, zero shots.[00:08:27] You do not provide any example and you're just testing how good the model is at generalizing few shots, you have a couple examples that you provide and then. You see from there how good the model is. These are the number of examples usually represented with a K, so you might see few shots, K equal five, it means five examples were passed, and then fine tune is you actually take a bunch of data and fine tune the model for that specific task, and then you test it.[00:08:55] These all go from the least amount of work required to the most amount of work required. If you're doing zero shots benchmarking, you do not need to have any data, so you can just take 'em out and do. If you're fine tuning it, you actually need a lot of data and a lot of compute time. You're expecting to see much better results from there.[00:09:14] Yeah. And sometimes the number of shots can go up to like a hundred, which is pretty surprising for me to see that people are willing to test these language models that far. But why not? You just run the computer a little bit longer. Yeah. Uh, what's next? Should we go into history and then benchmarks? Yeah.[00:09:29] History of Benchmarking since 1985[00:09:29] Okay, so I was up all night yesterday. I was like, this is a fascinating topic. And I was like, all right, I'll just do whatever's in the G PT three paper. And then I read those papers and they all cited previous papers, and I went back and back and back all the way to 1985. The very first benchmark that I can find.[00:09:45] 1985-1989: WordNet and Entailment[00:09:45] Which is WordNet, which is uh, an English benchmark created in at Princeton University by George Miller and Christian Fellbaum. Uh, so fun fact, Chris George Miller also authored the paper, the Magical Number seven plus Minus two, which is the observation that people have a short term memory of about seven for things.[00:10:04] If you have plus or minus two of seven, that's about all you can sort of remember in the short term, and I just wanted. Say like, this was before computers, right? 1985. This was before any of these personal computers were around. I just wanna give people a sense of how much work manual work was being done by these people.[00:10:22] The database, uh, WordNet. Sorry. The WordNet database contains 155,000 words organized in 175,000 sys. These sys are basically just pairings of nouns and verbs and adjectives and adverbs that go together. So in other words, for example, if you have nouns that are hyper names, if every X is a, is a kind of Y.[00:10:44] So a canine is a hyper name of a dog. It's a holo. If X is a part of Y, so a building is a hollow name of a window. The most interesting one for in terms of formal, uh, linguistic logic is entailment, which captures the relationship between two words, where the verb Y is entailed by X. So if by doing X, you must be doing Y.[00:11:02] So in other words, two, sleep is entailed by two snore because you cannot snore without also sleeping and manually mapping 155,000 words like that, the relationships between all of them in a, in a nested tree, which is. Incredible to me. Mm-hmm. And people just did that on faith. They were like, this will be useful somehow.[00:11:21] Right. Uh, and they were interested in cycle linguistics, like understanding how humans thought, but then it turned out that this was a very good dataset for understanding semantic similarity, right? Mm-hmm. Like if you measure the distance between two words by traversing up and down the graph, you can find how similar to two words are, and therefore, Try to figure out like how close they are and trade a model to, to predict that sentiment analysis.[00:11:42] You can, you can see how far something is from something that is considered a good sentiment or a bad sentiment or machine translation from one language to the other. Uh, they're not 200 word languages, which is just amazing. Like people had to do this without computers. Penn Tree Bank, I was in 1989, I went to Penn, so I always give a shout out to my university.[00:12:01] This one expanded to 4.5 million words of text, which is every uh, wall Street Journal. For three years, hand collected, hand labeled by grad students your tuition dollars at work. So I'm gonna skip forward from the eighties to the nineties. Uh, NYS was the most famous data set that came out of this. So this is the, uh, data set of 60,000.[00:12:25] Training images of, uh, of numbers. And this was the first visual dataset where, uh, people were tr tracking like, you know, handwritten numbers and, and mapping them to digital numbers and seeing what the error rate for them was. Uh, these days I think this can be trained in like e every Hello world for machine learning is just train missed in like four lanes of code.[00:12:44] 1998-2004 Enron Emails and MNIST[00:12:44] Then we have the Enron email data set. Enron failed in 2001. Uh, the emails were released in 2004 and they've been upgraded every, uh, every few years since then. That is 600,000 emails by 150 senior employees of Enron, which is really interesting because these are email people emailing each other back and forth in a very natural.[00:13:01] Context not knowing they're being, they're about to be observed, so you can do things like email classification, email summarization, entity recognition and language modeling, which is super cool. Any thoughts about that be before we go into the two thousands? I think like in a way that kind of puts you back to the bias, you know, in some of these benchmarks, in some of these data sets.[00:13:21] You know, like if your main corpus of benchmarking for entity recognition is a public energy company. Mm-hmm. You know, like if you're building something completely different and you're building a model for that, maybe it'll be worse. You know, you start to see how we started. With kind of like, WordNet is just like human linguistics, you know?[00:13:43] Yes. It's not domain related. And then, um, same with, you know, but now we're starting to get into more and more domain-specific benchmarks and you'll see this increase over time. Yeah. NY itself was very biased towards, um, training on handwritten letter. Uh, and handwritten numbers. So, um, in 2017 they actually extended it to Eist, which is an extended to extension to handwritten letters that seems very natural.[00:14:08] And then 2017, they also had fashion ness, which is a very popular data set, which is images of clothing items pulled from Zando. So you can see the capabilities of computer vision growing from single digit, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, to all the letters of the alphabet. To now we can recognize images, uh, of fashion, clothing items.[00:14:28] So it's pretty. So the big one for deep learning, cuz all of that was just, just the appetizers, just getting started.[00:14:35] 2009-2014 : ImageNet, CIFAR and the AlexNet Moment for Deep Learning[00:14:35] The big one for deep learning was ImageNet, which is where Fafa Lee came into the picture and that's why she's super well known. She started working in 2006 and released it in 2009. Fun fact, she actually met with, uh, Christian Feldbaum, who was, uh, one of the co-authors of, uh, war.[00:14:51] To create ImageNet. So there's a direct lineage from Words to Images. Yeah. And uh, they use Amazon Mechanical Turk to help with classification images. No longer grad students. But again, like I think, uh, this goes, kind of goes back to your observation about bias, like when I am a mechanical Turk worker. And I'm being paid by the image to classify an image.[00:15:10] Do you think I'll be very careful at my job? Right? Yeah. Whereas when I'm a, you know, Enron employee, emailing my, my fellow coworker, trying to just communicate something of, of natural language that is a different type of, uh, environment. Mm-hmm. So it's a pretty interesting benchmark. So it was released in 2009 ish and, you know, people were sort of competing to recognize and classify that properly.[00:15:33] The magic moment for ImageNet came in 2012, uh, which is called the AlexNet moment cuz I think that grad student that, um, created this recognition model was, uh, named Alex, I forget his last name, achieved a error rate of 15%, which is, More than 10% lower than the runner up. So it was used just so much better than the second place that everyone else was like, what are you doing?[00:15:54] Uh, and it turned out that he was, he was the first to use, uh, deep learning, uh, c n n 10 percentage points. So like 15 and the other one was 25. Yeah, exactly. So it was just so much, so much better than the others. It was just unbelievable that no one else was, no other approach was even coming close.[00:16:09] Therefore, everyone from there on out for the next, until today we're just learning the lessons of deep learning because, um, it is so much superior to the other approaches. And this was like a big. Images and visual moment because then you had like a sci-fi 10, which is a, another, like a data set that is mostly images.[00:16:27] Mm-hmm. Focused. Mm-hmm. So it took a little bit before we got back to to text. And nowadays it feels like text, you know, text models are kind of eating the word, you know, we're making the text one multi-model. Yeah. So like we're bringing the images to GBT four instead of the opposite. But yeah, in 2009 we had a, another 60,000 images that set.[00:16:46] 32 by 32. Color images with airplanes, automobiles, like, uh, animals, like all kind of stuff. Like I, I think before we had the numbers, then we had the handwritten letters. Then we had clothing, and then we finally made clothing items came after, oh, clothing items. 2009. Yeah, this is 2009. I skipped, I skipped time a little bit.[00:17:08] Yeah, yeah. But yeah, CFR 10 and CFR 100. CFR 10 was for 10 classes. And that that was chosen. And then obviously they optimized that and they were like, all right, we need a new problem now. So in 20 14, 5 years later, they introduced CFAR 100, which was a hundred classes of other items. And I think this is a very general pattern, which is used.[00:17:25] You create a data set for a specific be. You think it's too hard for machines? Mm-hmm. It lasts for five years before it's no longer too hard for machines, and you have to find a new data set and you have to extend it again. So it's Similarly, we are gonna find that in glue, which is another, which is one of more modern data sets.[00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference[00:17:42] This one came out in 2018. Glue stands for general Language Understanding Evaluation. This is one of the most influential, I think, early. Earlier, um, language model benchmarks, and it has nine tasks. Um, so it has single sentence tasks, similarity and paraphrase tasks and inference tasks. So a single sentence task, uh, would be something like, uh, the Stanford Sentiment Tree Bank, which is a.[00:18:05] Uh, sentences from movie reviews and human annotations of the sentiment, whether it's positive or negative, in a sort of like a four point scale. And your job is to predict the task of a single sentence. This similarity task would involve corpuses, like the Microsoft research paraphrase corpus. So it's a corpus of sentence pairs automatically extracted from online news sources with human annotations for whether or not the sentence is in the para semantically equivalent.[00:18:28] So you just predict true or false and again, Just to call back to the math that we did earlier in this episode, the classes here are imbalance. This data set, for example, is 68% positive. So we report both accuracy and F1 scores. F1 is a more balanced approach because it, it adjusts for, uh, imbalanced, um, data sets.[00:18:48] Mm-hmm. Yeah. And then finally, inference. Inference is the one where we really start to have some kind of logic. So for example, the M N L I. Um, actually I'm, I'm gonna focus on squad, the Stanford questioning question answering dataset. It's another data set of pairs, uh, questions, uh, uh, p question paragraphs, pairs.[00:19:04] So where one of the sentences of the paragraph drawn from Wikipedia contains the answer to the corresponding question, we convert the task into a sentence, para classification by forming a pair between each question in each sentence into corresponding context and filtering out pairs of low overlap. So basically annotating whether or not.[00:19:20] Is the answer to the question inside of this paragraph that I pulled. Can you identify that? And again, like Entailment is kind of included inside of each of these inference tasks because it starts to force the language model to understand whether or not one thing implies the other thing. Mm-hmm. Yeah.[00:19:37] And the, the models evolving. This came out in 2018, lasted one year exactly. One year later, people were like, that's too easy. That's too easy. So in 2019, they actually came out with super. I love how you'll see later with like swag and hella swag. It's like they come up with very good names for these things.[00:19:55] Basically what's super glue dead is stick glue and try and move outside of the single sentence evaluation. So most of the tasks that. Sean was talking about focus on one sentence. Yeah, one sentence, one question. It's pretty straightforward in that way. Superglue kind of at the, so one, it went from single sentence to having some multi sentence and kind of like a context driven thing.[00:20:21] So you might have questions where, The answer is not in the last paragraph that you've read. So it starts to test the, the context window on this model. Some of them are more, in order to know the answer, you need to know what's not in the question kind of thing. So like you may say, Hey, this drink is owned by the Coca-Cola company.[00:20:43] Is this a Pepsi product? You know, so you need to make the connection false. Exactly, yeah. Then you have also like, um, embedded clauses. So you have things that are not exactly said, have to be inferred, and like a lot of this stack is very conversational. So some of the example contain a lot of the, um, um, you know, or this question's very hard to read out.[00:21:07] Yeah, I know. It's like, it sounds like you are saying, um, but no, you're actually, you're actually. And yet I hope to see employer base, you know, helping out child, um, care centers at the place of employment, things like that, that will help out. It's kind of hard to even read it. And then the hypothesis is like they're setting a trend.[00:21:27] It's going from something very simple like a big p d extract to something that is more similar to how humans communicate. Transcripts, like audio transcripts. Exactly. Of how people talk. Yeah. And some of them are also, Plausibility. You know, like most of these models have started to get good at understanding like a clear cause, kind of like a.[00:21:48] You know, cause effect things. But some of the plausible ones are like, for example, this one is a copa. They're called choice of plausible alternatives. The premises, my body cast a shadow over the grass. What's the cost for this alternative? One, the sun was rising. Alternative to the grass was cut.[00:22:07] Obviously it's the sun was rising, but nowhere. In the question we're actually mentioning the sun, uh, we are mentioning the grass. So some models, some of the older models might see the grass and make the connection that the grass is part of the reason, but the models start to get better and better and go from simply looking at the single sentence context to a more of a, a word new, uh, word knowledge.[00:22:27] It's just really impressive, like the fact that. We can expect that out of a model. It still blows my mind. I think we should not take it for granted that when we're evaluating models, we're asking questions like this that is not obvious from just the given text itself. Mm-hmm. So it, it is just coming with a memorized view of the world, uh, or, or world knowledge. And it understands the premise on, on some form. It is not just random noise. Yeah, I know. It's really impressive. This one, I actually wanted multi rc I actually wanted to spring on you as a, as a test, but it's just too long to read. It's just like a very long logic question.[00:23:03] And then it'll ask you to do, uh, comprehension. But uh, yeah, we'll just, we'll just kinda skip that. We'll put it, we'll put it in the show notes, and then you have to prove us that you're a human. Send us the answer exactly. Exactly and subscribe to the podcast. So superglue was a lot harder, and I think also was superseded eventually, pretty soon.[00:23:21] 2018-2019: Swag and HellaSwag - Common Sense Inference[00:23:21] And, uh, yeah, then we started coming onto the more recent cohort of tests. I don't know how to introduce the rest. Uh, there, there are just so many tests here that I, I struggle a little bit picking from these. Uh, but perhaps we can talk about swag and heli swyx since you mentioned it. Yeah. So SWAG stands for situations with Adversarial Generations.[00:23:39] Uh, also came out in 2018, but this guy, zes Etal, likes to name his data sets and his benchmarks in a very memorable way. And if you look at the PDF of the paper, he also has a little icon, uh, image icon for swag. And he doesn't just go by, uh, regular language. So he definitely has a little bit of branding to this and it's.[00:24:00] Part. So I'll give you an example of the kind of problems that swyx poses. Uh, it it is focused on common sense inference. So what's common sense inference? So, for example, given a partial description, like she opened the hood of the car, humans can reason about the situation and anticipate what might come next.[00:24:16] Then she examined the engine. So you're supposed to pick based on what happened in the first part. What is most likely to happen in the second part based on the, uh, multiple choice question, right? Another example would be on stage, a woman takes a seat at the piano. She a, sits on a bench as her sister plays at the doll.[00:24:33] B. Smiles with someone as the music play. C is in the crowd watching the dancers. D nervously set her fingers on the keys, so A, B, C, or D. It's not all of them are plausible. When you look at the rules of English, we're we've, we're not even checking for whether or not produces or predicts grammatical English.[00:24:54] We're checking for whether the language model can correctly pick what is most likely given the context. The only information that you're given is on stage. A woman takes a seat at the piano, what is she most likely to do next? And D makes sense. It's arguable obviously. Sometimes it could be a. In common sense, it's D.[00:25:11] Mm-hmm. So we're training these models to have common. Yeah, which most humans don't have. So it's a, it's already a step up. Obviously that only lasted a year. Uh, and hello, SWAG was no longer, was no longer challenging in 2019, and they started extending it quite a lot more, a lot more questions. I, I forget what, how many questions?[00:25:33] Um, so Swag was a, swag was a data set. A hundred thousand multiple choice questions. Um, and, and part of the innovation of swag was really that you're generating these questions rather than manually coming up with them. Mm-hmm. And we're starting to get into not just big data, but big questions and big benchmarks of the, of the questions.[00:25:51] That's where the adversarial generations come in, but how that swag. Starts pulling in from real world questions and, and data sets like, uh, wikiHow and activity net. And it's just really, you know, an extension of that. I couldn't even add examples just cuz there's so many. But just to give you an idea of, uh, the progress over time.[00:26:07] Aside: How to Design Benchmarks[00:26:07] Most of these benchmarks are, when they're released, they set. Benchmark at a level where if you just randomly guessed all of the questions, you'll get a 25%. That's sort of the, the baseline. And then you can run each of the language models on them, and then you can run, uh, human evaluations on them. You can have median evaluations, and then you have, um, expert evaluations of humans.[00:26:28] So the randoms level was, uh, for halla. swyx was 20. GT one, uh, which is the, uh, 2019 version that got a 41 on the, on the Hello Sue X score. Bert from Google, got 47. Grover, also from Google, got 57 to 75. Roberta from Facebook, got 85 G P T, 3.5, got 85, and then GPT4 got 95 essentially solving hello swag. So this is useless too.[00:26:51] 2021 - MMLU - Human level Professional Knowledge[00:26:51] We need, we need super Hell now's use this. Super hell swyx. I think the most challenging one came from 2021. 2021 was a very, very good year in benchmarking. So it's, we had two major benchmarks that came out. Human eval and M M L U, uh, we'll talk about mm. M L U first, cuz that, that's probably the more, more relevant one.[00:27:08] So M M L U. Stands for measuring mul massive multitask language understanding, just by far the biggest and most comprehensive and most human-like, uh, benchmark that we've had for until 2021. We had a better one in 2022, but we'll talk about that. So it is a test that covers 57 tasks, including elementary, math, US history, computer science law, and more.[00:27:29] So to attain high accuracy on this task, models must possess extensive world knowledge and prop problem solving. Its. Includes practice questions for the GRE test and the U United States, um, m l e, the medical exam as. It also includes questions from the undergrad courses from Oxford, from all the way from elementary high school to college and professional.[00:27:49] So actually the opening question that I gave you for this podcast came from the math test from M M L U, which is when you drop a ball from rest, uh, what happens? And then also the question about the Complex Z plane, uh, but it equally is also asking professional medicine question. So asking a question about thyroid cancer and, uh, asking you to diagnose.[00:28:10] Which of these four options is most likely? And asking a question about microeconomics, again, giving you a, a situation about regulation and monopolies and asking you to choose from a list of four questions. Mm-hmm. Again, random baseline is 25 out of 100 G P T two scores, 32, which is actually pretty impressive.[00:28:26] GT three scores between 43 to 60, depending on the the size. Go. Scores 60, chinchilla scores 67.5, GT 3.5 scores, 70 GPT4 jumps, one in 16 points to 86.4. The author of M M L U, Dan Hendrix, uh, was commenting on GPT4 saying this is essentially solved. He's basically says like, GT 4.5, the, the next incremental improvement on GPT4 should be able to reach expert level human perform.[00:28:53] At which point it is passing simultaneously, passing all the law exams, all the medical exams, all the graduate student exams, every single test from AP history to computer science to. Math to physics, to economics. It's very impressive. Yeah. And now you're seeing, I mean, it's probably unrelated, but Ivy League universities starting to drop the a t as a requirement for getting in.[00:29:16] So yeah. That might be unrelated as well, because, uh, there's a little bit of a culture war there with regards to, uh, the, the inherent bias of the SATs. Yeah. Yeah. But I mean, that's kinda, I mean exactly. That's kinda like what we were talking about before, right? It's. If a model can solve all of these, then like how good is it really?[00:29:33] How good is it as a Exactly. Telling us if a person should get in. It captures it. Captures with just the beginning. Yeah. Right.[00:29:39] 2021: HumanEval - Code Generation[00:29:39] Well, so I think another significant. Benchmark in 2021 was human eval, which is, uh, the first like very notable benchmark for code code generation. Obviously there's a, there's a bunch of research preceding this, but this was the one that really caught my eye because it was simultaneously introduced with Open Eyes Codex, which is the code generation model, the version of G P T that was fine tuned for generating code.[00:30:02] Uh, and that is, Premise of, well, there is the origin or the the language model powering GitHub co-pilot and yeah, now we can write code with language models, just with that, with that benchmark. And it's good too. That's the other thing, I think like this is one where the jump from GT 3.5 to GPT4 was probably the biggest, like GT 3.4 is like 48% on. On this benchmark, GPT4 is 67%. So it's pretty big. Yeah. I think coders should rest a little bit. You know, it's not 90 something, it's, it's still at 67, but just wait two years. You know, if you're a lawyer, if you're a lawyer, you're done. If you're a software engineer, you got, you got a couple more years, so save your money.[00:30:41] Yeah. But the way they test it is also super creative, right? Like, I think maybe people don't understand that actually all of the tests that are given here are very intuitive. Like you. 90% of a function, and then you ask the language model to complete it. And if it completes it like any software engineer would, then you give it a win.[00:31:00] If not, you give it a loss, run that model 164 times, and that is human eval. Yeah. Yeah. And since a lot of our listeners are engineers too, I think the big thing here is, and there was a, a link that we had that I missed, but some of, for example, some of. Coding test questions like it can answer older ones very, very well.[00:31:21] Like it doesn't not answer recent ones at all. So like you see some of like the data leakage from the training, like since it's been trained on the issues, massive data, some of it leaks. So if you're a software engineer, You don't have to worry too much. And hopefully, especially if you're not like in the JavaScript board, like a lot of these frameworks are brand new every year.[00:31:41] You get a lot of new technologies. So there's Oh, there's, oh yeah. Job security. Yes, exactly. Of course. Yeah. You got a new, you have new framework every year so that you have job security. Yeah, exactly. I'll sample, uh, data sets.[00:31:51] 2020 - XTREME - Multilingual Benchmarks[00:31:51] So before we get to big bench, I'll mention a couple more things, which is basically multilingual benchmarks.[00:31:57] Uh, those are basically simple extensions of monolingual benchmarks. I feel like basical. If you can. Accurately predicts the conversion of one word or one part of the word to another part of the word. Uh, you get a score. And, and I think it's, it's fairly intuitive over there. Uh, but I think the, the main benchmarks to know are, um, extreme, which is the, uh, x the x lingual transfer evaluation, the multilingual encoders, and much prefer extreme.[00:32:26] I know, right? Uh, that's why, that's why they have all these, uh, honestly, I think they just wanted the acronym and then they just kinda worked backwards. And then the other one, I can't find it in my notes for, uh, what the other multilingual ones are, but I, I just think it's interesting to always keep in mind like what the other.[00:32:43] Language capabilities are like, one language is basically completely equivalent to another. And I think a lot of AI ethicists or armchair AI ethicists are very angry that, you know, most of the time we optimize for English because obviously that has, there's the most, uh, training corpuses. I really like extreme the work that's being done here, because they took a, a huge amount of effort to make sure they cover, uh, sparse languages like the, the less popular ones.[00:33:06] So they had a lot of, uh, the, the, obviously the, the popular. Uh, the world's top languages. But then they also selected to maximize language diversity in terms of the complete diversity in, uh, human languages like Tamil Telugu, maam, and Sohi and Yoruba from Africa. Mm-hmm. So I just thought like that kind of effort is really commendable cuz uh, that means that the rest of the world can keep up in, in this air race.[00:33:28] Right. And especially on a lot of the more human based things. So I think we talked about this before, where. A lot of Israel movies are more[00:33:36] focused on culture and history and like are said in the past versus a lot of like the Western, did we talk about this on the podcast? No, not on the podcast. We talked and some of the Western one are more focused on the future and kind of like what's to come.[00:33:48] So I feel like when you're, some of the benchmarks that we mentioned before, you know, they have movie reviews as like, uh, one of the. One of the testing things. Yeah. But there's obviously a big cultural difference that it's not always captured when you're just looking at English data. Yeah. So if you ask the a motto, it's like, you know, are people gonna like this movie that I'm writing about the future?[00:34:10] Maybe it's gonna say, yeah, that's a really good idea. Or if I wanna do a movie about the past, it's gonna be like maybe people want to hear about robots. But that wouldn't be the case in, in every country. Well, since you and I speak different languages, I speak Chinese, you speak Italian, I'm sure you've tested the Italian capabilities.[00:34:29] What do you think? I think like as. Italy, it's so much more, um, dialect driven. So it can be, it can be really hard. So what kind of Italian does g PT three speak? Actually Italian, but the reality is most people have like their own, their own like dialect. So it would be really hard for a model to fool. An Italian that it's like somebody from where they are, you know?[00:34:49] Yeah. Like you can actually tell if you're speaking to AI bot in Chinese because they would not use any of the things that human with humans would use because, uh, Chinese humans would use all sorts of replacements for regular Chinese words. Also, I tried one of those like language tutor things mm-hmm.[00:35:06] That people are making and they're just not good Chinese. Not colloquial Chinese, not anything that anyone would say. They would understand you, but they were from, right, right.[00:35:14] 2022: BIG-Bench - The Biggest of the Benches[00:35:14] So, 2022, big bench. This was the biggest of the biggest, of the biggest benchmarks. I think the, the main pattern is really just, Bigger benchmarks rising in opposition to bigger and bigger models.[00:35:27] In order to evaluate these things, we just need to combine more and more and way more tasks, right? Like swag had nine tasks, hello swag had nine more tasks, and then you're, you're just adding and adding and adding and, and just running a battery of tasks all over. Every single model and, uh, trying to evaluate how good they are at each of them.[00:35:43] Big bench was 204 tasks contributed by 442 authors across 132 institutions. The task topics are diverse, drawing from linguistics, childhood development, math, common sense reasoning, biology, physics, social bias, software development, and beyond. I also like the fact that these authors also selected tasks that are not solved by current language models, but also not solvable by memorizing the internet, which is mm-hmm.[00:36:07] Tracking back to a little bit of the issues that we're, we're gonna cover later. Right. Yeah. I think that's, that's super interesting. Like one of, some of the examples would include in the following chess position, find a checkmate, which is, some humans cannot do that. What is the name of the element within a topic number of six?[00:36:22] Uh, that one you can look up, right? By consulting a periodic table. We just expect language models to memorize that. I really like this one cuz it's, uh, it's inherent. It's, uh, something that you can solve.[00:36:32] Identify whether this sentence has an anachronism. So, option one. During the Allied bombardment of the beaches of Iwojima, Ralph spoke loudly into his radio.[00:36:41] And in option two, during the allied bombardment of the beaches of Iwojima, Ralph spoke loudly into his iPhone. And you have to use context of like when iPhone, when Ally bombarding. Mm-hmm. And then sort of do math to like compare one versus the other and realize that okay, this one is the one that's out of place.[00:36:57] And that's asking more and more and more of the language model to do in implicitly, which is actually modeling what we do when we listen to language, which is such a big. Gap. It's such a big advancement from 1985 when we were comparing synonyms. Mm-hmm. Yeah, I know. And it's not that long in the grand scheme of like humanity, you know, like it's 40 years.[00:37:17] It's crazy. It's crazy. So this is a big missing gap in terms of research. Big benches seems like the most comprehensive, uh, set of benchmarks that we have. But it is curiously missing from Gypsy four. Mm-hmm. I don't know. On paper, for code, I only see Gopher two 80. Yeah. On it. Yeah. Yeah. It could be a curious emission because it maybe looks.[00:37:39] Like it didn't do so well.[00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results[00:37:40] Hello, this is Swyx from the editing room sometime in the future. I just wanted to interject that. Uh, we now know why the GPT for benchmark results did not include the big bench. Benchmark, even though that was the state-of-the-art benchmark at the time. And that's because the. Uh, GPC four new the Canary G U I D of the big bench.[00:38:02] Benchmark. Uh, so Canary UID is a random string, two, six[00:38:08] eight six B eight, uh, blah, blah, blah. It's a UID. UID, and it should not be knowable by the language model. And in this case it was therefore they had to exclude big bench and that's. And the issue of data contamination, which we're about to go into right now.[00:38:25] Issue: GPT4 vs the mystery of the AMC10/12[00:38:25] And there's some interesting, if you dive into details of GPT4, there's some interesting results in GPT4, which starts to get into the results with benchmarking, right? Like so for example, there was a test that GPT4 published that is very, very bizarre to everyone who is even somewhat knowledgeable.[00:38:41] And this concerns the Ammc 10 and AMC 12. So the mc. Is a measure of the American math 10th grade student and the AMC12 is a, uh, is a measure of the American 12th grade student. So 12 is supposed to be harder than 10. Because the students are supposed to be older, it's, it's covering topics in algebra, geometry number, theory and combinatorics.[00:39:04] GPT4 scored a 30 on AMC10 and scored a 60 on AMC12. So the harder test, it got twice as good, and 30 was really, really bad. So the scoring format of AMC10. It is 25 questions. Each correct answer is worth six points. Each incorrect answer is worth 1.5 points and unanswered questions receive zero points.[00:39:25] So if you answer every single question wrong, you will get more than GPT4 got on AMC10. You just got everything wrong. Yeah, it's definitely better in art medics, you know, but it's clearly still a, a long way from, uh, from being even a high school student. Yeah. There's a little bit of volatility in these results and it, it shows that we, it's not quite like machine intelligence is not the same, or not linearly scaling and not intuitive as human intelligence.[00:39:54] And it's something that I think we should be. Aware of. And when it freaks out in certain ways, we should not be that surprised because Yeah, we're seeing that. Yeah. I feel like part of it is also human learning is so structured, you know, like you learn the new test, you learn the new test, you learn the new test.[00:40:10] But these models, we kind of throw everything at them all at once, you know, when we train them. So when, when the model is strained, are you excusing the model? No, no, no. I'm just saying like, you know, and you see it in everything. It's like some stuff. I wonder what the percentage of. AMC 10 versus AMC 12.[00:40:28] Issue: Data Contamination[00:40:28] Content online is, yes. This comes in a topic of contamination and memorization. Right. Which we can get into if we, if we, if we want. Yeah. Yeah, yeah. So, uh, we're getting into benchmarking issues, right? Like there's all this advancements in benchmarks, uh, language models. Very good. Awesome. Awesome, awesome. Uh, what are the problems?[00:40:44] Uh, the problem is that in order to train these language models, we are scraping the vast majority of the internet. And as time passes, the. Of previous runs of our tests will be pasted on the internet, and they will go into the corpus and the leg model will be memorizing them rather than reasoning them from first principles.[00:41:02] So in, in the machine, classic machine learning parlance, this would be overfitting mm-hmm. Uh, to the test rather than to the generalizing to the, uh, the results that we really want. And so there's an example of, uh, code forces as well also discovered on GPT4. So Code Forces has annual vintages and there was this guy, uh, C H H Halle on Twitter who ran GPT4 on pre 2021 problems, solved all of them and then ran it on 2022 plus problems and solved zero of them.[00:41:31] And we know that the cutoff for GPT4 was 2021. Mm-hmm. So it just memorized the code forces problems as far as we can tell. And it's just really bad at math cuz it also failed the mc 10 stuff. Mm-hmm. It's actually. For some subset of its capabilities. I bet if you tested it with GPT3, it might do better, right?[00:41:50] Yeah. I mean, this is the, you know, when you think about models and benchmarks, you can never take the benchmarks for what the number says, you know, because say, you know, you're focusing on code, like the benchmark might only include the pre 2021 problems and it scores great, but it's actually bad at generalizing and coming up with new solutions.[00:42:10] So, yeah, that, that's a. Big problem.[00:42:13] Other Issues: Benchmark Data Quality and the Iris data set[00:42:13] Yeah. Yeah. So bias, data quality, task specificity, reproducibility, resource requirements, and then calibrating confidence. So bias is, is, is what you might think it is. Basically, there's inherent bias in the data. So for example, when you think about doctor, do you think about a male doctor, a female doctor, in specifically an image net?[00:42:31] Businessmen, white people will be labeled businessmen, whereas Asian businessmen will be labeled Asian businessmen and that can reinforce harmful serotypes. That's the bias issue. Data quality issue. I really love this one. Okay, so there's a famous image data set we haven't talked about called the pedals or iris.[00:42:47] Iris dataset mm-hmm. Contains measurements of, uh, of, uh, length with petal length and petal with, uh, three different species of iris, iris flowers, and they have labeling issues in. So there's a mini, there's a lowest level possible error rate because the error rate exists in the data itself. And if you have a machine learning model that comes out with better error rate than the data, you have a problem cuz your machine learning model is lying to you.[00:43:12] Mm-hmm. Specifically, there's, we know this for a fact because especially for Iris flowers, the length should be longer than the, than the width. Um, but there. Number of instances in the data set where the length was shorter than the, than the width, and that's obviously impossible. So there was, so somebody made an error in the recording process.[00:43:27] Therefore if your machine learning model fits that, then it's doing something wrong cuz it's biologically impossible. Mm-hmm. Task specificity basically if you're overfitting to, to one type of task, for example, answering questions based on a single sentence or you're not, you know, facing something real world reproducibility.[00:43:43] This one is actually, I guess, the fine details of machine learning, which people don't really like to talk about. There's a lot. Pre-processing and post-processing done in I Python notebooks. That is completely un versions untested, ad hoc, sticky, yucky, and everyone does it differently. Therefore, your test results might not be the same as my test results.[00:44:04] Therefore, we don't agree that your scores are. The right scores for your benchmark, whereas you're self reporting it every single time you publish it on a, on a paper. The last two resource requirements, these are, these are more to do with GPTs. The larger and larger these models get, the harder, the more, more expensive it is to run some.[00:44:22] And some of them are not open models. In other words, they're not, uh, readily available, so you cannot tell unless they run it themselves on, on your benchmark. So for example, you can't run your GPT3, you have to kind of run it through the api. If you don't have access to the API like GPT4, then you can't run it at all.[00:44:39] The last one is a new one from GPT4's Paper itself. So you can actually ask the language models to expose their log probabilities and show you how confident they think they are in their answer, which is very important for calibrating whether the language model has the right amount of confidence in itself and in the GPT4 people. It. They were actually very responsible in disclosing that They used to have about linear correspondence between the amount of confidence and the amount of times it was right, but then adding R L H F onto GPT4 actually skewed this prediction such that it was more confident than it should be. It was confidently incorrect as as people say.[00:45:18] In other words, hallucinating. And that is a problem. So yeah, those are the main issues with benchmarking that we have to deal with. Mm-hmm. Yeah, and a lot of our friends, our founders, we work with a lot of founders. If you look at all these benchmarks, all of them just focus on how good of a score they can get.[00:45:38] They don't focus on what's actually feasible to use for my product, you know? So I think.[00:45:44] Tradeoffs of Latency, Inference Cost, Throughput[00:45:44] Production benchmarking is something that doesn't really exist today, but I think we'll see the, the rise off. And I think the main three drivers are one latency. You know, how quickly can I infer the answer cost? You know, if I'm using this model, how much does each call cost me?[00:46:01] Like is that in line with my business model I, and then throughput? I just need to scale these models to a lot of questions on the ones. Again, I just do a benchmark run and you kind of come up. For quadrants. So if on the left side you have model size going from smallest to biggest, and on the X axis you have latency tolerance, which is from, I do not want any delay to, I'll wait as long as I can to get the right answer.[00:46:27] You start to see different type of use cases, for example, I might wanna use a small model that can get me an answer very quickly in a short amount of time, even though the answer is narrow. Because me as a human, maybe I'm in a very iterative flow. And we have Varun before on the podcast, and we were talking about a kind of like a acceleration versus iteration use cases.[00:46:50] Like this is more for acceleration. If I'm using co-pilot, you know, the code doesn't have to be a hundred percent correct, but it needs to happen kind of in my flow of writing. So that's where a model like that would be. But instead, other times I might be willing, like if I'm asking it to create a whole application, I'm willing to wait one hour, you know, for the model to get me a response.[00:47:11] But you don't have, you don't have a way to choose that today with most models. They kind of do just one type of work. So I think we're gonna see more and more of these benchmark. Focus on not only on the research side of it, which is what they really are today when you're developing a new model, like does it meet the usual standard research benchmarks to having more of a performance benchmark for production use cases?[00:47:36] And I wonder who's gonna be the first company that comes up with, with something like this, but I think we're seeing more and more of these models go from a research thing to like a production thing. And especially going from companies like. Google and Facebook that have kinda unlimited budget for a lot of these things to startups, starting to integrate them in the products.[00:48:00] And when you're on a tight budget paying, you know, 1 cent per thousand tokens or 0.10 cent for a thousand tokens, like it's really important. So I think that's, um, that's what's missing to get a lot of these things to productions. But hopefully we, we see them.[00:48:16] Yeah, the software development lifecycle I'm thinking about really is that most people will start with large models and then they will prototype with that because that is the most capable ones.[00:48:25] But then as they put more and more of those things in production, people always want them to run faster and faster and faster and cheaper. So you will distill towards a more domain specific model, and every single company that puts this into production, we'll, we'll want something like that, but I, I think it's, it's a reasonable bet because.[00:48:41] There's another branch of the AI builders that I see out there who are build, who are just banking on large models only. Mm-hmm. And seeing how far they can stretch them. Right. With building on AI agents that can take arbitrarily long amounts of time because they're saving you lots of, lots of time with, uh, searching the web for you and doing research for you.[00:48:59] And I think. I'm happy to wait for Bing for like 10 seconds if it does a bunch of searches for median. Mm-hmm. Just ends with, ends with the right, right result. You know, I was, I was tweeting the other day that I wanted an AI enabled browser because I was seeing this table, uh, there was an image and I just needed to screenshot an image and say, plot this on a chart for me.[00:49:17] And I just wanted to do that, but it would have to take so many steps and I would be willing to wait for a large model to do that for me. Mm-hmm. Yeah. I mean, web development so far has been, Reduce, reduce, reduce the loading times. You know, it's like first we had the, I don't know about that. There, there are people who disagree.[00:49:34] Oh. But I, I think, like if you think about, you know, the CDN and you think about deploying things at the edge, like the focus recently has been on lowering the latency time versus increasing it.[00:49:45] Conclusion[00:49:45] Yeah. So, well that's the, that's Benchmark 1 0 1. Um. Let us know how we, how you think we did. This is something we're trying for the first time.[00:49:52] We're very inspired by other podcasts that we like where we do a bunch of upfront prep, but then it becomes a single topical episode that is hopefully a little bit more timeless. We don't have to keep keeping up with the news. I think there's a lot of history that we can go back on and. Deepen our understanding of the context of all these evolutions in, uh, language models.[00:50:12] Yeah. And if you have ideas for the next, you know, 1 0 1 fundamentals episode, yeah, let us know in the, in the comments and we'll see you all soon. Bye. Get full access to Latent Space at www.latent.space/subscribe
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.04.535544v1?rss=1 Authors: van Rossum, M. C. Abstract: The brain is not only constrained by energy needed to fuel computation, but it is also constrained by energy needed to form memories. Experiments have shown that learning simple conditioning tasks already carries a significant metabolic cost. Yet, learning a task like MNIST to 95% accuracy appears to require at least 10^{8} synaptic updates. Therefore the brain has likely evolved to be able to learn using as little energy as possible. We explored the energy required for learning in feedforward neural networks. Based on a parsimonious energy model, we propose two plasticity restricting algorithms that save energy: 1) only modify synapses with large updates, and 2) restrict plasticity to subsets of synapses that form a path through the network. Combining these two methods leads to substantial energy savings while only incurring a small increase in learning time. In biology networks are often much larger than the task requires. In particular in that case, large savings can be achieved. Thus competitively restricting plasticity helps to save metabolic energy associated to synaptic plasticity. The results might lead to a better understanding of biological plasticity and a better match between artificial and biological learning. Moreover, the algorithms might also benefit hardware because in electronics memory storage is energetically costly as well. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.03.535317v1?rss=1 Authors: Lee, K., Dora, S., Mejias, J., Bohte, S., Pennartz, C. Abstract: Predictive coding (PC) is an influential theory in neuroscience, which suggests the existence of a cortical architecture that is constantly generating and updating predictive representations of sensory inputs. Owing to its hierarchical and generative nature, PC has inspired many computational models of perception in the literature. However, the biological plausibility of existing models has not been sufficiently explored due to their use of artificial neural network features such as a non-linear, continuous, and clock-driven function approximator as basic unit of computation. Therefore, we have developed a spiking neural network for predictive coding (SNN-PC), in which neurons communicate using event-driven and asynchronous spikes. While adopting the hierarchical structure and Hebbian learning algorithms from previous PC neural network models, SNN-PC introduces two novel features: 1) a fast feedforward sweep from the input to higher areas, which generates a spatially reduced and abstract representation of input (i.e., a neural code for the gist of a scene) and provides a neurobiological alternative to an arbitrary choice of priors; and 2) a separation of positive and negative error-computing neurons, which counters the biological implausibility of a bi-directional error neuron with a very high basal firing rate. After training with the MNIST handwritten digit dataset, SNN-PC developed hierarchical internal representations and was able to reconstruct samples it had not seen during training. SNN-PC suggests biologically plausible mechanisms by which the brain may perform perceptual inference and learning in an unsupervised manner. In addition, it may be used in neuromorphic applications that can utilize its energy-efficient, event-driven, local learning, and parallel information processing nature. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VII: A Challenge for Mechanists, published by Stephen Casper on February 18, 2023 on The AI Alignment Forum. Part 7 of 12 in the Engineer's Interpretability Sequence. Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me. MI = “mechanistic interpretability” Given a network, recover its labeling function. In the last post, I argued that existing works in MI focus on solving problems that are too easy. Here, I am posing a challenge for mechanists that is still a toy problem but one that is quite a bit less convenient than studying a simple model or circuit implementing a trivial, known task. The the best of my knowledge: Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network's solution to a task that was not cherrypicked by the researcher(s) doing so. Gaining a mechanistic understanding of the models in this challenge may be difficult, but it will probably be much less difficult than mechanistically interpreting highly intelligent systems in high stakes settings in the real world. So if an approach can't solve the type of challenge posed here, it may not be very promising for doing much heavy lifting with AI safety work. This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023). Challenge 1, MNIST CNN I made up a nonlinear labeling function that labels approximately half of all MNIST images as 0's and the other half as 1's. Then I trained a small CNN on these labels, and it got 96% testing accuracy. The challenge is to use MI tools on the network to recover that labeling function. Hint 1: The labels are binary. Hint 2: The network gets 95.58% accuracy on the test set. Hint 3: This image may be helpful. Challenge 2, Transformer I made up a labeling function that takes in two integers from 0 to 113 and outputs either a 0 or 1. Then, using a lot of code from Neel Nanda's grokking work, I trained a 1-layer transformer on half of the data. It then got 97% accuracy on the test half. As before, the challenge is to use MI tools to recover the labeling function. Hint 1: The labels are binary. Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half. Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary... Prizes If you are the first person to send me the labeling function and a mechanistic explanation for either challenge, I will sing your praises on my Twitter, and I would be happy to help you write a post about how you solved a problem I thought would be very difficult. Neel Nanda and I are also offering a cash prize. (Thanks to Neel for offering to contribute to the pool!) Neel will donate $250, and I will donate $500 to a high-impact charity of choice for the first person to solve each challenge. That makes the total donation prize pool $1,500. Good luck For this challenge, I intentionally designed the labeling functions to not be overly simple. But I will not be too surprised if someone reverse-engineers them with MI tools, and if so, I will be extremely interested in how. Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with models don't tend to conveniently grok onto a simple, elegant, programmat...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The effect of horizon length on scaling laws, published by Jacob Hilton on February 1, 2023 on The AI Alignment Forum. The scaling of optimal model size with compute is a key input into the biological anchors framework for forecasting transformative AI. In particular, the "effective horizon length" introduces a multiplier into this scaling law that can have a big effect on forecasts. This paper studies this scaling law for several RL environments: Procgen, Dota 2 and a toy MNIST-based environment. The last of these is used to study the effect of the task horizon length in a toy setting. There are a number of takeaways for the biological anchors framework, which are summarized in Section 5.4. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: Superposition, Memorization, and Double Descent (Anthropic), published by Lawrence Chan on January 5, 2023 on The AI Alignment Forum. (This is a follow up to Anthropic's prior work on Toy Models of Superposition.) The authors study how neural networks interpolate between memorization and generalization in the "ReLU Output" toy model from the first toy model paper: They train models to perform a synthetic regression task with T training points, for models with m hidden dimensions. First, they find that for small training sets, while the features are messy, the training set hidden vectors hi=WXi (the projection of the input datapoints into the hidden space) often show clean structures: They then extend their old definition of feature dimensionality to measure the dimensionalities allocated to each of the training examples: DXi = hi2∑j(^hihj)2and plot this against the data set size (and also test loss): This shows that as you increase the amount of data, you go from a regime with high dimensionality allocated to training vectors and low dimensionality allocated to features, to one where the opposite is true. In between the two, both feature and hidden vector dimensionalities receive low dimensionality, which coincides with an increase in test loss, which they compare to the phenomena of "data double descent" (where as you increase data on overparameterized models with small amounts of regularization, test loss can go up before it goes down). Finally, they visualize how varying T and m affects test loss, and find double descent along both dimensions: They also included some (imo very interesting) experiments from Adam Jermyn, 1) replicating the results, 2) exploring how weight decay interacts with this double descent-like phenomenon, and 3) studying what happens if you repeat particular datapoints. Some limitations of the work, based on my first read through: The authors note that the results seem quite sensitive to hyperparameters, especially for low hidden dimension m∈{2,3}. For example, Adam Jermyn's results differ from the Anthropic Interp team results (though the figures still look qualitatively similar). I'm still not super convinced how much results from the superposition work apply in practice. I'd be interested in seeing more work along the lines of the MNIST preliminary experiment done by Chris Olah at the bottom. (I'll probably have more thoughts as I think for longer.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Hoy vamos a describir un clasificador de imagenes Abra un nuevo cuaderno Colab en su cuenta de Google y asegúrese de tener acceso a Internet y a los paquetes de Python necesarios, como TensorFlow y Keras. Obtenga un conjunto de datos de imágenes etiquetadas que desee utilizar para entrenar y evaluar su modelo. Puede utilizar conjuntos de datos públicos o crear su propio conjunto de datos recopilando imágenes y etiquetándolas manualmente. Un conjunto de datos de imágenes etiquetadas es un conjunto de imágenes que han sido previamente clasificadas y etiquetadas con ciertas categorías o etiquetas. Estos conjuntos de datos se utilizan a menudo para entrenar y evaluar modelos de clasificación de imágenes, ya que proporcionan una forma de evaluar la precisión del modelo al clasificar imágenes nuevas. Por ejemplo, un conjunto de datos de imágenes etiquetadas podría contener imágenes de perros y gatos, etiquetadas como "perro" o "gato", respectivamente. Al entrenar un modelo de clasificación de imágenes en este conjunto de datos, podría evaluar la precisión del modelo al clasificar imágenes de perros y gatos. Existen muchos conjuntos de datos de imágenes etiquetadas disponibles en línea, tanto gratuitos como de pago, que pueden utilizarse para entrenar y evaluar modelos de clasificación de imágenes. Algunos ejemplos populares incluyen el conjunto de datos de imágenes de perros y gatos de Kaggle, el conjunto de datos de imágenes de flores de Oxford y el conjunto de datos de imágenes de dígitos escritos a mano de MNIST. Puede acceder al conjunto de datos de imágenes de perros y gatos de Kaggle de la siguiente manera: Vaya a la página de inicio de Kaggle (https://www.kaggle.com/) y regístrese o inicie sesión en su cuenta. En la barra de búsqueda en la parte superior de la página, escriba "dogs vs. cats" (perros vs. gatos) y presione Enter. Se le llevará a la página del conjunto de datos "Dogs vs. Cats" (Perros vs. Gatos). Haga clic en el botón "Download" (Descargar) para descargar el conjunto de datos a su computadora. Descomprima el archivo descargado para acceder a las imágenes y las etiquetas correspondientes. Las etiquetas se encuentran en el archivo "labels.csv" y las imágenes se encuentran en la carpeta "train" (entrenamiento). Tenga en cuenta que el conjunto de datos de imágenes de perros y gatos de Kaggle consta de 25.000 imágenes de perros y 25.000 imágenes de gatos, divididas en conjuntos de entrenamiento y prueba. Puede utilizar el conjunto de entrenamiento para entrenar un modelo de clasificación de imágenes y el conjunto de prueba para evaluar la precisión del modelo. Calcificación de imagenes Canal telegram Déjame un mensaje de voz
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Re-Examining LayerNorm, published by Eric Winsor on December 1, 2022 on LessWrong. Please check out the colab notebook for interactive figures and more detailed technical explanations. This post is part of the work done at Conjecture. Special thanks to Sid Black, Dan Braun, Carlos Ramón Guevara, Beren Millidge, Chris Scammell, Lee Sharkey, and Lucas Teixeira for feedback on early drafts. There's a lot of non-linearities floating around in neural networks these days, but one that often gets overlooked is LayerNorm. This is understandable because it's not "supposed" to be doing anything; it was originally introduced to stabilize training. Contemporary attitudes about LayerNorm's computational power range from "it's just normalizing a vector" to "it can do division apparently". And theories of mechanistic interpretability such as features as directions and polytopes are unhelpful, or even harmful, in understanding normalization's impact on a network's representations. After all, normalization doesn't alter the direction of vectors, but it still bends lines and planes (the boundaries of polytopes) out of shape. As it turns out, LayerNorm can be used as a general purpose activation function (you can solve MNIST with a LayerNorm MLP, for example). Concretely, it can do things like this: and this: We will explain what's going on in these animations later, but the point is that to develop a strong, principled theory of mechanistic interpretability, we need to grapple with this non-linearity. In this interactive notebook, we study LayerNorm systematically using math and geometric intuition to characterize the ways in which it can manipulate data. We show that the core non-linearity of LayerNorm can be understood via simple geometric primitives. We explain how these basic primitives may perform semantic operations. For example, folding can be viewed as extracting extremal features (e.g. separating extreme temperatures from normal ones). We leverage these primitives to understand more complex low-dimensional classification tasks with multiple layers of non-linearities. The methods and intuition developed here extend to non-linearities beyond LayerNorm, and we plan to extend them in future work. Below is an interactive summary of the content of the notebook. (If you find the summary interesting, you really should check out the notebook: manipulating the figures helps build geometric intuition.) Summary The formula for LayerNorm is something messy like But it turns out the core non-linear operation is (almost) normalizing a vector: Graphically, this function has the iconic sigmoid shape in one dimension (note that in 1D the norm is simply the absolute value). Interesting things start happening when we precompose this normalization function with affine transformations (such as scaling and shifting). Below, we start with a collection of points, (x,y), distributed uniformly on the sphere. Then we compute uϵ(tx,y) (stretch/shrink by a factor of t in the x direction and then normalize). Since normalization guarantees points end up on the circle, this operation stretches the distribution along the circle. The right hand panel illustrates how this stretching is captured in the shape of a 1D activation function (the x coordinate of the input against the x coordinate of the output). For example, when t is close to 5, we see that any points with x coordinates not near 0 get compressed towards x=−1 or x=1. This matches the picture in the circle where we see the points bunch up at the left and right sides of the circle. If we make the same plots with a shifting operation, uϵ(x+t,y), the operation folds the input data. We can envision possible "semantic" uses for these geometric operations. Stretching can be used to perform an approximate "sign" operation (as in + or - sign). For example, it ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Re-Examining LayerNorm, published by Eric Winsor on December 1, 2022 on LessWrong. Please check out the colab notebook for interactive figures and more detailed technical explanations. This post is part of the work done at Conjecture. Special thanks to Sid Black, Dan Braun, Carlos Ramón Guevara, Beren Millidge, Chris Scammell, Lee Sharkey, and Lucas Teixeira for feedback on early drafts. There's a lot of non-linearities floating around in neural networks these days, but one that often gets overlooked is LayerNorm. This is understandable because it's not "supposed" to be doing anything; it was originally introduced to stabilize training. Contemporary attitudes about LayerNorm's computational power range from "it's just normalizing a vector" to "it can do division apparently". And theories of mechanistic interpretability such as features as directions and polytopes are unhelpful, or even harmful, in understanding normalization's impact on a network's representations. After all, normalization doesn't alter the direction of vectors, but it still bends lines and planes (the boundaries of polytopes) out of shape. As it turns out, LayerNorm can be used as a general purpose activation function (you can solve MNIST with a LayerNorm MLP, for example). Concretely, it can do things like this: and this: We will explain what's going on in these animations later, but the point is that to develop a strong, principled theory of mechanistic interpretability, we need to grapple with this non-linearity. In this interactive notebook, we study LayerNorm systematically using math and geometric intuition to characterize the ways in which it can manipulate data. We show that the core non-linearity of LayerNorm can be understood via simple geometric primitives. We explain how these basic primitives may perform semantic operations. For example, folding can be viewed as extracting extremal features (e.g. separating extreme temperatures from normal ones). We leverage these primitives to understand more complex low-dimensional classification tasks with multiple layers of non-linearities. The methods and intuition developed here extend to non-linearities beyond LayerNorm, and we plan to extend them in future work. Below is an interactive summary of the content of the notebook. (If you find the summary interesting, you really should check out the notebook: manipulating the figures helps build geometric intuition.) Summary The formula for LayerNorm is something messy like But it turns out the core non-linear operation is (almost) normalizing a vector: Graphically, this function has the iconic sigmoid shape in one dimension (note that in 1D the norm is simply the absolute value). Interesting things start happening when we precompose this normalization function with affine transformations (such as scaling and shifting). Below, we start with a collection of points, (x,y), distributed uniformly on the sphere. Then we compute uϵ(tx,y) (stretch/shrink by a factor of t in the x direction and then normalize). Since normalization guarantees points end up on the circle, this operation stretches the distribution along the circle. The right hand panel illustrates how this stretching is captured in the shape of a 1D activation function (the x coordinate of the input against the x coordinate of the output). For example, when t is close to 5, we see that any points with x coordinates not near 0 get compressed towards x=−1 or x=1. This matches the picture in the circle where we see the points bunch up at the left and right sides of the circle. If we make the same plots with a shifting operation, uϵ(x+t,y), the operation folds the input data. We can envision possible "semantic" uses for these geometric operations. Stretching can be used to perform an approximate "sign" operation (as in + or - sign). For example, it ...
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.11.30.518534v1?rss=1 Authors: Brodersen, P. J. N., Akerman, C. J. Abstract: In the search for biologically plausible but mathematically precise theories of learning in the brain, recent studies have begun to investigate how key assumptions underlying algorithms for supervised learning in artificial neural networks can be relaxed in biologically plausible ways. Turning to unsupervised learning, we develop biologically more plausible variants of the restricted Boltzmann machine (RBM), and benchmark their performance on MNIST. We show that RBMs with asymmetric connectivity can still be successfully trained with contrastive divergence, even if no two units are reciprocally connected. Furthermore, RBMs are able to learn if the forward, visible-to-hidden layer weights are kept constant and only the backward, hidden-to-visible layer weights are updated. These findings indicate that neural networks with biologically plausible connectivity support contrastive learning. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.10.19.512820v1?rss=1 Authors: Soo, W. W. M., Lengyel, M. Abstract: There continues to be a trade-off between the biological realism and performance of neural networks. Contemporary deep learning techniques allow neural networks to be trained to perform challenging computations at (near) human-level, but these networks typically violate key biological constraints. More detailed models of biological neural networks can incorporate many of these constraints but typically suffer from subpar performance and trainability. Here, we narrow this gap by developing an effective method for training a canonical model of cortical neural circuits, the stabilized supralinear network (SSN), that in previous work had to be constructed manually or trained with undue constraints. SSNs are particularly challenging to train for the same reasons that make them biologically realistic: they are characterized by strongly-connected excitatory cells and expansive firing rate non-linearities that together make them prone to dynamical instabilities unless stabilized by appropriately tuned recurrent inhibition. Our method avoids such instabilities by initializing a small network and gradually increasing network size via the dynamics-neutral addition of neurons during training. We first show how SSNs can be trained to perform typical machine learning tasks by training an SSN on MNIST classification. We then demonstrate the effectiveness of our method by training an SSN on the challenging task of performing amortized Markov chain Monte Carlo-based inference under a Gaussian scale mixture generative model of natural image patches with a rich and diverse set of basis functions - something that was not possible with previous methods. These results open the way to training realistic cortical-like neural networks on challenging tasks at scale. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Timelines via Cumulative Optimization Power: Less Long, More Short, published by jacob cannell on October 6, 2022 on LessWrong. TLDR: We can best predict the future by using simple models which best postdict the past (ala Bayes/Solomonoff). A simple model based on net training compute postdicts the relative performance of successful biological and artificial neural networks. Extrapolation of this model into the future leads to short AI timelines: ~75% chance of AGI by 2032. Cumulative Optimization Power[1]: a Simple Model of Intelligence A simple generalized scaling model predicts the emergence of capabilities in trained ANNs(Artificial Neural Nets) and BNNs(Biological Neural Nets): perf ~= P = CT For sufficiently flexible and efficient NN architectures and learning algorithms, the relative intelligence and capabilities of the best systems are simply proportional to net training compute or intra-lifetime cumulative optimization power P, where P = CT (compute ops/cycle training cycles), assuming efficient allocation of (equivalent uncompressed) model capacity bits N roughly proportional to data size bits D. Intelligence Rankings Imagine ordering some large list of successful BNNs(brains or brain modules) by intelligence (using some committee of experts), and from that deriving a relative intelligence score for each BNN. Obviously such a scoring will be noisy in its least significant bits: is a bottlenose dolphin more intelligent than an american crow? But the most significant bits are fairly clear: C. Elegans is less intelligent than Homo Sapiens. Now imagine performing the same tedious ranking process for various successful ANNs. Here the task is more challenging because ANNs tend to be far more specialized, but the general ordering is still clear: char-RNN is less intelligent than GPT-3. We could then naturally combine the two lists, and make more fine-grained comparisons by including specialized sub-modules of BNNs (vision, linguistic processing, etc). The initial theory is that P - intra-lifetime cumulative optimization power (net training compute) - is a very simple model which explains a large amount of the entropy/variance in a rank order intelligence measure: much more so than any other simple proposed candidates (at least that I'm aware of). Since P follow a predictable temporal trajectory due to Moore's Law style technological progress, we can then extrapolate the trends to predict the arrival of AGI. This simple initial theory has a few potential flaws/objections, which we will then address. Initial Exemplars I've semi-randomly chosen 15 exemplars for more detailed analysis: 8 BNNs, and 9 ANNs. Here are the 8 BNNs (6 whole brains and 2 sub-systems) in randomized order: Honey Bee Human Raven Human Linguistic Cortex Cat C. Elegans Lizard Owl Monkey Visual Cortex The ranking of the 6 full brains in intelligence is rather obvious and likely uncontroversial. Ranking all 8 BNNs in terms of P (net training compute) is still fairly obvious. Here are the 9 ANNs, also initially in randomized order: AlphaGo: First ANN to achieve human pro-level play in Go Deepspeech 2: ANN speech transcription system VPT: Diamond-level minecraft play Alexnet: Early CNN imagenet milestone, subhuman performance 6-L MNIST MLP: Early CNN milestone on MNIST, human level Chinchilla: A 'Foundation' Large Language Model GPT-3: A 'Foundation' Large Language Model DQN Atari: First strong ANN for Atari, human level on some games VIT L/14@336px: OpenAI CLIP 'Foundation' Large Vision Model Most of these systems are specialists in non-overlapping domains, such that direct performance comparison is mostly meaningless, but the ranking of the 3 vision systems should be rather obvious based on the descriptions. The DQN Atari and VPT agents are somewhat comparable to animal brains. How would you ran...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Timelines via Cumulative Optimization Power: Less Long, More Short, published by jacob cannell on October 6, 2022 on LessWrong. TLDR: We can best predict the future by using simple models which best postdict the past (ala Bayes/Solomonoff). A simple model based on net training compute postdicts the relative performance of successful biological and artificial neural networks. Extrapolation of this model into the future leads to short AI timelines: ~75% chance of AGI by 2032. Cumulative Optimization Power[1]: a Simple Model of Intelligence A simple generalized scaling model predicts the emergence of capabilities in trained ANNs(Artificial Neural Nets) and BNNs(Biological Neural Nets): perf ~= P = CT For sufficiently flexible and efficient NN architectures and learning algorithms, the relative intelligence and capabilities of the best systems are simply proportional to net training compute or intra-lifetime cumulative optimization power P, where P = CT (compute ops/cycle training cycles), assuming efficient allocation of (equivalent uncompressed) model capacity bits N roughly proportional to data size bits D. Intelligence Rankings Imagine ordering some large list of successful BNNs(brains or brain modules) by intelligence (using some committee of experts), and from that deriving a relative intelligence score for each BNN. Obviously such a scoring will be noisy in its least significant bits: is a bottlenose dolphin more intelligent than an american crow? But the most significant bits are fairly clear: C. Elegans is less intelligent than Homo Sapiens. Now imagine performing the same tedious ranking process for various successful ANNs. Here the task is more challenging because ANNs tend to be far more specialized, but the general ordering is still clear: char-RNN is less intelligent than GPT-3. We could then naturally combine the two lists, and make more fine-grained comparisons by including specialized sub-modules of BNNs (vision, linguistic processing, etc). The initial theory is that P - intra-lifetime cumulative optimization power (net training compute) - is a very simple model which explains a large amount of the entropy/variance in a rank order intelligence measure: much more so than any other simple proposed candidates (at least that I'm aware of). Since P follow a predictable temporal trajectory due to Moore's Law style technological progress, we can then extrapolate the trends to predict the arrival of AGI. This simple initial theory has a few potential flaws/objections, which we will then address. Initial Exemplars I've semi-randomly chosen 15 exemplars for more detailed analysis: 8 BNNs, and 9 ANNs. Here are the 8 BNNs (6 whole brains and 2 sub-systems) in randomized order: Honey Bee Human Raven Human Linguistic Cortex Cat C. Elegans Lizard Owl Monkey Visual Cortex The ranking of the 6 full brains in intelligence is rather obvious and likely uncontroversial. Ranking all 8 BNNs in terms of P (net training compute) is still fairly obvious. Here are the 9 ANNs, also initially in randomized order: AlphaGo: First ANN to achieve human pro-level play in Go Deepspeech 2: ANN speech transcription system VPT: Diamond-level minecraft play Alexnet: Early CNN imagenet milestone, subhuman performance 6-L MNIST MLP: Early CNN milestone on MNIST, human level Chinchilla: A 'Foundation' Large Language Model GPT-3: A 'Foundation' Large Language Model DQN Atari: First strong ANN for Atari, human level on some games VIT L/14@336px: OpenAI CLIP 'Foundation' Large Vision Model Most of these systems are specialists in non-overlapping domains, such that direct performance comparison is mostly meaningless, but the ranking of the 3 vision systems should be rather obvious based on the descriptions. The DQN Atari and VPT agents are somewhat comparable to animal brains. How would you ran...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Four usages of "loss" in AI, published by Alex Turner on October 2, 2022 on The AI Alignment Forum. Summary: What does it mean for a loss function to be "aligned with" human goals? I perceive four different concepts which involve "loss function" in importantly different ways: Physical-loss: The physical implementation of a loss function and the loss computations, Mathematical-loss: The mathematical idealization of a loss function, A loss function "encoding/representing/aligning with" an intended goal, and Agents which "care about achieving low loss." I advocate retaining physical- and mathematical-loss. I advocate dropping 3 in favor of talking directly about desired AI cognition and how the loss function entrains that cognition. I advocate disambiguating 4, because it can refer to a range of physically grounded preferences about loss (e.g. low value at the loss register versus making perfect future predictions). Related: Towards deconfusing wireheading and reward maximization. I'm going to talk about "loss" instead of "reward", but the lessons apply to both. I think it's important to maintain a sharp distinction between the following four concepts. 1: Physically implemented loss The loss function updated my network. This is a statement about computations embedded in physical reality. This statement involves the physically implemented sequence of loss computations which stream in throughout training. For example, the computations engendered by loss_fn = torch.nn.CrossEntropyLoss(). 2: Mathematical loss The loss function is a smooth function of the prediction distribution. This is a statement about the idealized mathematical loss function. These are the mathematical objects you can prove learning theory results about. The Platonic idealization of the learning problem and the mathematical output-grading rule casts a shadow into your computer via its real-world implementation (concept 1). For example, (D,ℓ) where D:={(x,label(x))∣x∈MNIST} is the mathematical idealization of the MNIST dataset, where the x∈R28×28 are the idealized grayscale MNIST images. And ℓ is the mathematical function of cross-entropy (CE) loss between a label prediction distribution and the ground-truth labels. 3: Loss functions "representing" goals I want a loss function which is aligned with the goal of "write good novels." This is an aspirational statement about achieving some kind of correspondence between the loss function and the goal of writing good novels. But what does this statement mean? Suppose you tell me "I have written down a loss function ℓnovel which is perfectly aligned with the goal of 'write good novels'." What experiences should this claim lead me to anticipate? That an agent can only achieve low physical-loss (concept 1) if it has, in physical fact, written a good novel? That in some mathematical idealization of the learning problem (concept 2), loss-minimization only occurs when the agent outputs text which would be found in what we rightly consider to be "good novels"? (But in which mathematical idealization?) That, as a matter of physical fact, if you train an agent on ℓnovel using learning setup X, then you produce a language model which can be easily prompted to output high-quality novels? The imprecision comes from loss functions not directly encoding goals. Loss signals are physically implemented (concept 1) parts of the AI's training process which (physically) update the AI's cognition in certain ways. While a loss function can be involved in the AI's decision-making structure (see items i and ii above), additional information is needed to understand what motivational cognition is being discussed. I think that talking about loss functions being "aligned" encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Q-Networks Explained, published by Jay Bailey on September 13, 2022 on LessWrong. Note: This is a long post. The post is structured in such a way that not everyone needs to read everything - Sections 1 and 2 are skippable background information, and Sections 4 and 5 go into technical detail that not everybody wants or needs to know. Section 3 on its own is sufficient to gain a high-level understanding of DQN if you already know how reinforcement learning works.This post was made as my final project for the AGI Safety Fundamentals course.This post aims to provide a distillation of Deep Q-Networks, neural networks trained via Deep Q-Learning. The algorithm, usually just referred to as DQN, is the algorithm that first put deep reinforcement learning on the map. The 2013 paper is also the first paper in the Key Papers section of the excellent Spinning Up In Deep RL resource. As a result, anyone who wants to understand reinforcement learning is likely to start here. Since reading and replicating papers can be difficult, this resource aims to make the learning curve a little easier for those who want to improve their ML knowledge past an initial deep learning course. This post will be focused mainly on the 2015 paper, which adds some extra improvements to the 2013 paper, as well as giving more information that's helpful for replication. This is a long post, but it's broken up into sections so that you can only read the sections you're interested in. It starts with background information, then goes over DQN at three levels of detail, each of which builds off the previous one. You should probably have a basic understanding of supervised learning in order to get the most out of this post. You should understand what a loss function is, what gradient descent is conceptually, and be familiar with the concept of a neural network. Supervised Learning With Neural Networks (Section 1) - Background information on deep learning. This can be safely skipped if you already understand how supervised learning problems like image classifiers work. Reinforcement Learning (Section 2) - Background information on RL. This can be safely skipped if you're already familiar with the basics of reinforcement learning, like states, actions, and why reinforcement learning is harder than supervised learning. DQN Overview (Section 3) - This section presents DQN at a high level, without the math. I recommend everyone read this section. You can stop there, or continue on to Section 4 if you want to understand the math behind it. DQN Technical Explanation (Section 4) - This section builds off of Section 3, and includes the algorithms and equations that make DQN's work. You can skip this section if you just want a high-level overview of what DQN is - the what, but not the how. Replicating DQN (Section 5) - This section builds off of Section 4, and includes tips and tricks to make DQN work in code. I recommend only reading this section if you're interested in replicating this or similar papers yourself. Further Improvements (Section 6) - This section talks about improvements that have been made in the DQN line of research since the original DQN papers were released. If this topic interests you, you can skip Sections 4 and 5 and still understand this section. Regardless of how far you get, feedback is certainly welcome - part of the purpose of this post is to test my fit for doing more distilling work, so if you found this helpful (or unhelpful) I'd love to hear about it! Supervised Learning With Neural Networks (Section 1) To understand supervised learning, let's start with a toy problem that's often used to compare algorithms and make sure they work before applying them to more challenging domains. MNIST is a dataset composed of tens of thousands of hand-written digits from 0 to 9. Each piece of data co...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Q-Networks Explained, published by Jay Bailey on September 13, 2022 on LessWrong. Note: This is a long post. The post is structured in such a way that not everyone needs to read everything - Sections 1 and 2 are skippable background information, and Sections 4 and 5 go into technical detail that not everybody wants or needs to know. Section 3 on its own is sufficient to gain a high-level understanding of DQN if you already know how reinforcement learning works.This post was made as my final project for the AGI Safety Fundamentals course.This post aims to provide a distillation of Deep Q-Networks, neural networks trained via Deep Q-Learning. The algorithm, usually just referred to as DQN, is the algorithm that first put deep reinforcement learning on the map. The 2013 paper is also the first paper in the Key Papers section of the excellent Spinning Up In Deep RL resource. As a result, anyone who wants to understand reinforcement learning is likely to start here. Since reading and replicating papers can be difficult, this resource aims to make the learning curve a little easier for those who want to improve their ML knowledge past an initial deep learning course. This post will be focused mainly on the 2015 paper, which adds some extra improvements to the 2013 paper, as well as giving more information that's helpful for replication. This is a long post, but it's broken up into sections so that you can only read the sections you're interested in. It starts with background information, then goes over DQN at three levels of detail, each of which builds off the previous one. You should probably have a basic understanding of supervised learning in order to get the most out of this post. You should understand what a loss function is, what gradient descent is conceptually, and be familiar with the concept of a neural network. Supervised Learning With Neural Networks (Section 1) - Background information on deep learning. This can be safely skipped if you already understand how supervised learning problems like image classifiers work. Reinforcement Learning (Section 2) - Background information on RL. This can be safely skipped if you're already familiar with the basics of reinforcement learning, like states, actions, and why reinforcement learning is harder than supervised learning. DQN Overview (Section 3) - This section presents DQN at a high level, without the math. I recommend everyone read this section. You can stop there, or continue on to Section 4 if you want to understand the math behind it. DQN Technical Explanation (Section 4) - This section builds off of Section 3, and includes the algorithms and equations that make DQN's work. You can skip this section if you just want a high-level overview of what DQN is - the what, but not the how. Replicating DQN (Section 5) - This section builds off of Section 4, and includes tips and tricks to make DQN work in code. I recommend only reading this section if you're interested in replicating this or similar papers yourself. Further Improvements (Section 6) - This section talks about improvements that have been made in the DQN line of research since the original DQN papers were released. If this topic interests you, you can skip Sections 4 and 5 and still understand this section. Regardless of how far you get, feedback is certainly welcome - part of the purpose of this post is to test my fit for doing more distilling work, so if you found this helpful (or unhelpful) I'd love to hear about it! Supervised Learning With Neural Networks (Section 1) To understand supervised learning, let's start with a toy problem that's often used to compare algorithms and make sure they work before applying them to more challenging domains. MNIST is a dataset composed of tens of thousands of hand-written digits from 0 to 9. Each piece of data co...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Taking the parameters which seem to matter and rotating them until they don't, published by Garrett Baker on August 26, 2022 on LessWrong. A big bottleneck in interpretability is neural networks are non-local. That is, given the layer setup if we change a small bit of the original activations, then a large bit of the new activations are affected. This is an impediment to finding the circuit-structure of networks. It is difficult to figure out how something works when changing one thing affects everything. The project I'm currently working on aims to fix this issue, without affecting the training dynamics of networks or the function which the network is implementing. The idea is to find a rotation matrix R and insert it with its inverse like below, then group together the rotation with the original activations, and the inverse with the weights and nonlinear function. We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations. This locality is measured by the average sparsity of the jacobian across all the training inputs. We do this because the jacobian is a representation of how each of the inputs affects each of the outputs. Large entries represent large effects. Small entries represent small effects. So if many entries are zero, this means that fewer inputs have an effect on fewer outputs. I.e. local changes to the input cause local changes to the output. This should find us a representation of the activations and interpretations of matrix multiplies that "make sense" in the context of the rest of the network. Another way of thinking about this is that our goal is to find the basis our network is thinking in. Currently I'm getting this method to work on a simple, 3-layer, fully connected MNIST number classifying network. If this seems to give insight into the mechanics of the network after application, the plan is to adapt it to a more complicated network such as a transformer or resnet. I only have preliminary results right now, but they are looking promising: This is the normalized jacobian the middle layer before a rough version of my method: And here is the normalized jacobian after a rough version of my method (the jacobian's output has been set to a basis which maximizes it's sparsity): Thanks David Udell for feedback on the post. I did not listen to everything you said, and if I did the post would have been better This seems important if we'd like to use interpretability work to produce useful conjectures about agency and selection more generally. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Taking the parameters which seem to matter and rotating them until they don't, published by Garrett Baker on August 26, 2022 on LessWrong. A big bottleneck in interpretability is neural networks are non-local. That is, given the layer setup if we change a small bit of the original activations, then a large bit of the new activations are affected. This is an impediment to finding the circuit-structure of networks. It is difficult to figure out how something works when changing one thing affects everything. The project I'm currently working on aims to fix this issue, without affecting the training dynamics of networks or the function which the network is implementing. The idea is to find a rotation matrix R and insert it with its inverse like below, then group together the rotation with the original activations, and the inverse with the weights and nonlinear function. We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations. This locality is measured by the average sparsity of the jacobian across all the training inputs. We do this because the jacobian is a representation of how each of the inputs affects each of the outputs. Large entries represent large effects. Small entries represent small effects. So if many entries are zero, this means that fewer inputs have an effect on fewer outputs. I.e. local changes to the input cause local changes to the output. This should find us a representation of the activations and interpretations of matrix multiplies that "make sense" in the context of the rest of the network. Another way of thinking about this is that our goal is to find the basis our network is thinking in. Currently I'm getting this method to work on a simple, 3-layer, fully connected MNIST number classifying network. If this seems to give insight into the mechanics of the network after application, the plan is to adapt it to a more complicated network such as a transformer or resnet. I only have preliminary results right now, but they are looking promising: This is the normalized jacobian the middle layer before a rough version of my method: And here is the normalized jacobian after a rough version of my method (the jacobian's output has been set to a basis which maximizes it's sparsity): Thanks David Udell for feedback on the post. I did not listen to everything you said, and if I did the post would have been better This seems important if we'd like to use interpretability work to produce useful conjectures about agency and selection more generally. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Data limited future, published by Donald Hobson on August 6, 2022 on The AI Alignment Forum. This is a prediction of what I think the near future might be like. Suppose the trends in scaling laws roughly continue. Deep learning can do anything if it has enough data. But getting that data is hard. So large language models get better, but not that much better. They are using most of the available high quality text, and it is hard to get more text. Image generation, can do. Self driving cars. Took lots of simulations, lots of gathering data and even building a filmset town in a desert and letting 1000 self driving cars crash all over it for a few months. But we got there. Short videos, can do, 30 second clickbait videos are now generated by something like DALLE-2. Social media profiles. Done. Deep learning can learn to do anything, given enough data. It can't learn to make AI breakthroughs. There just aren't enough examples of humans coming up with breakthroughs. You can train a model to set network parameters, but the best parameters are fairly well understood. And you need to run it on small examples, so you get slightly better parameters for Mnist classifiers. The more choice you give the network, the more likely you are to get something that doesn't function at all. If you just create an RL agent producing code, it won't produce anything that compiles. (or at least anything nontrivial that compiles). If you pretrain on code, your model will output code similar to existing code. So it will output existing algorithms, with the minor adjustments any competent programmer could easily do. Often neural networks. Sometimes k-means or linear regression. If you prompt a large language model for a superintelligent output, you usually get a result like this. This isn't a coincidence, the pattern [prompt asking for superintelligence][Answer degenerating into nonsense] appears many times in the training data. (For example it appears here. And if no one give me a reason not to, it can appear in many other places as well). So in this world, we have AI that is limited to things there is large amounts of data for. If many humans do something every day, and that data is reasonably collectable, an AI can be trained to do it. If the AI can play blindly for a million rounds before it figures something out, then it can do it. Any computer games. Short term picking things up with robot hands etc. If you have lots of robots, and a wholesale supply of eggs, and a team of people to clean up the first attempts, you can train an AI to make an omelette. (Especially if you have many human examples to start with). Some startups are doing this. Perhaps formal theorem provers with deep learning selection of the next line of proof become a thing. The world has reached a new equilibrium. The thing humans still hold over the machines is data efficiency. And there it will stay until someone either manages to invent a much more data efficient algorithm, an abstract theory step that will hopefully take a while, or someone stumbles onto a much more data efficient algorithm trying to make some routine AI, or someone manages to cajole code for a superintelligence out of a language model. This is an equilibrium that could plausibly remain for more than a decade. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ten experiments in modularity, which we'd like you to run!, published by TheMcDouglas on June 16, 2022 on The AI Alignment Forum. This is the third post describing our team's work on selection theorems for modularity, as part of a project mentored by John Wentworth (see here for the earlier posts). Although the theoretical and empirical parts of the project have both been going very well, we're currently bottlenecked on the empirical side: we have several theories and ideas for how to test them, but few experimental results. Right now, we only have one empiricist coding up experiments, so this overhang seems likely to persist. The purpose of this post is to outline some of our ideas for experiments. We hope that this will provide concrete steps for people who are interested in engaging with empirical research in AI safety, or on selection theorems in particular, to contribute to this area. Experiments 1. Investigate (randomly) modulary varying goals in modern deep learning architectures. In our previous post, we discussed the idea of modularly varying goals in more detail. We conjectured that rapidly switching between a small set of different goals might fine-tune the network to deal with this specific switch, rather than providing selection pressure for modularity. One possible solution to this is RMVG (randomly-sampled modularly varying goals), where we switch between a large set of random goals, all still with the same subtasks, but have a large enough set that we never show the network the same task twice. For example, one could build a CNN to recognise two MNIST digits M1,M2, then perform some algebraic operation on the recognised digits (e.g. addition), and measure the modularity of the zero-training-loss solutions found by ADAM (e.g. using the graph theory measure of modularity with the matrix norm of the CNN kernels as weights - see these CHAI papers for more ideas and discussion). To introduce RMVG, you might switch the task during training from calculating M1+M2 to a×M1+b×M2, where a,b are real numbers, varied periodically after some fixed number of epochs. The network wouldn't get a and b as direct inputs; it could only access them indirectly via their effect on the loss function. This task is modular in the sense that it can be factored into the separate tasks of “recognising two separate digits” and “performing joint arithmetical operations on them”, and we might hope that varying the goal in this way causes the network to learn a corresponding modular solution. Questions you might ask: Does this way of varying goals lead to more modular solutions than the fixed goal of a×M1+b×M2? Can you think of any other modular goals which you could run a similar test for? Are there some types of modular goals which produce modularity more reliably? Does switching training methods (e.g. ADAM to SGD or GD) change anything about the average modularity score of solutions? 2) Measure the broadness of modular and non-modular optima in deep learning architectures. We have some theories that predict modular solutions for tasks to be on average broader in the loss function landscape than non-modular solutions. One could test this by making a CNN and getting it to produce some zero training loss solutions to a task, some of which were more modular than others (see experiment (1) for more detail on this, and (4) for another proposed method to produce modularity), then vary the parameters of these optima slightly to see how quickly the loss increases. We've recently done experiments with simple networks and the retina problem, and this intuition seems to hold up. But more replications here would be very valuable, to examine the extent to which the hypothesis is borne out in different situations / task / network sizes. Questions you might ask: Is there a relationship between bro...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Humans are very reliable agents, published by alyssavance on June 16, 2022 on LessWrong. Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results? Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. The best current image models have ~99.9% reliability on MNIST, an easy handwriting recognition task. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models. By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.) (The numbers here are only rough Fermi estimates. I'm sure one could nitpick them by going into pre-pandemic vs. post-pandemic crash rates, laws in the US vs. other countries, what percentage of crashes are drunk drivers, do drunk drivers count, how often would a really bad decision be fatal, etc. But I'm confident that whichever way you do the math, you'll still find that humans are many orders of magnitude more reliable.) Other types of accidents are similarly rare. Eg. pre-pandemic, there were around 40 million commercial flights per year, but only a handful of fatal crashes. If each flight involves 100 chances for the pilot to crash the plane by screwing up, then that would get you a reliability rate around 1:1,000,000,000, or ~99.99999999%. Even obviously dangerous activities can have very low critical failure rates. For example, shooting is a popular hobby in the US; the US market buys around 10 billion rounds of ammunition per year. There are around 500 accidental gun deaths per year, so shooting a gun has a reliability rate against accidental death of ~1:20,000,000, or 99.999995%. In a military context, the accidental death rate was around ten per year against ~1 billion rounds fired, for a reliability rate of ~99.9999999%. Deaths by fire are very rare compared to how often humans use candles, stoves, and so on; New York subway deaths are rare compared to several billion annual rides; out of hundreds of millions of hikers, only a tiny percentage fall off of cliffs; and so forth. The 2016 AI Impacts survey asked hundreds of AI researchers when they thought AI would be capable of doing certain tasks, playing poker, proving theorems and so on. Some tasks have been solved or have a solution "in sight", but right now, we're nowhere close to an AI that can replace human surgeons; robot-assisted surgeries still have manual control by human operators. Cosmetic surgeries on healthy patients have a fatality rate around 1:300,000, even before excluding unpredictable problems like bl...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Humans are very reliable agents, published by alyssavance on June 16, 2022 on LessWrong. Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results? Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. The best current image models have ~99.9% reliability on MNIST, an easy handwriting recognition task. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models. By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.) (The numbers here are only rough Fermi estimates. I'm sure one could nitpick them by going into pre-pandemic vs. post-pandemic crash rates, laws in the US vs. other countries, what percentage of crashes are drunk drivers, do drunk drivers count, how often would a really bad decision be fatal, etc. But I'm confident that whichever way you do the math, you'll still find that humans are many orders of magnitude more reliable.) Other types of accidents are similarly rare. Eg. pre-pandemic, there were around 40 million commercial flights per year, but only a handful of fatal crashes. If each flight involves 100 chances for the pilot to crash the plane by screwing up, then that would get you a reliability rate around 1:1,000,000,000, or ~99.99999999%. Even obviously dangerous activities can have very low critical failure rates. For example, shooting is a popular hobby in the US; the US market buys around 10 billion rounds of ammunition per year. There are around 500 accidental gun deaths per year, so shooting a gun has a reliability rate against accidental death of ~1:20,000,000, or 99.999995%. In a military context, the accidental death rate was around ten per year against ~1 billion rounds fired, for a reliability rate of ~99.9999999%. Deaths by fire are very rare compared to how often humans use candles, stoves, and so on; New York subway deaths are rare compared to several billion annual rides; out of hundreds of millions of hikers, only a tiny percentage fall off of cliffs; and so forth. The 2016 AI Impacts survey asked hundreds of AI researchers when they thought AI would be capable of doing certain tasks, playing poker, proving theorems and so on. Some tasks have been solved or have a solution "in sight", but right now, we're nowhere close to an AI that can replace human surgeons; robot-assisted surgeries still have manual control by human operators. Cosmetic surgeries on healthy patients have a fatality rate around 1:300,000, even before excluding unpredictable problems like bl...
Open Tech Talks : Technology worth Talking| Blogging |Lifestyle
An open talk on the challenges of having a data pipeline for the images, audio and videos, The Hub enables to have several famous machine learning datasets with just a single command, like CIFAR-10, MNIST or Fashion-MNIST, Google Objection, ImageNet, COCO, and many others. As I came from a Relational Database management system (RDBS) background, this talk gives me a new perspective and helps to think outside of the known areas. Enjoy the talk with the CEO of Active Loop. This session was recorded in October 2021 and is now being published. Today's Guest Davit Buniatyan, CEO at ActiveLoop.ai A great insight talk with the guest speaker on a topic, a product owner focuses on a dataset format to offer API for creating, storing, and collaborating on any size of AI datasets. What were the challenges faced in the unstructured data storage and how the hub is offering a solution to solve the data problem? The question of where you will store the large data sets of images, and videos - you will get all the answers in this talk How the opensource Github repo is helping thousands of people to use datasets to PyTorch or TensorFlow with one line of code. Website: ActiveLoop.ai Twitter: activeloop Resources: ActiveLoop hub - dataset for AI Book: Good to Great
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Who models the models that model models? An exploration of GPT-3's in-context model fitting ability, published by Lovre on June 7, 2022 on The AI Alignment Forum. Introduction Much has been written and much has been observed about the abilities of GPT-3 on many tasks. Most of these capabilities, though not all, pertain to writing convicing text, but, not to undermine GPT-3's impressiveness at performing these tasks, we might call this the predictable part of its oeuvre. It only makes sense that a better language modelling is, well, going to be better at writing text. Deeper and comparatively much less explored is the unpredictable ability of GPT-3 to learn new tasks by just seeing a few examples, without any training/backpropagation – the so called in-context learning (sometimes called metalearning). The original paper announcing GPT-3 contained a handful of examples (perhaps mostly notably examples of GPT-3 learning to perform arithmetic, e.g. accurate addition of up to 5-digit numbers), Gwern has also insightfully written about it, Frieda Rong has performed some interesting experiments, and there have been various other experiments one could chance upon. My curiosity being piqued but not sated by these experiments, and also having had the feeling that, as captivating the arithmetic examples were, they weren't the most natural question one could ask about a stochastic model's quantitative capabilities – I decided to investigate whether GPT-3 could fit numerical models in-context. The Setup What does it mean to fit a model in-context? Well, recall that GPT-3 has a context window (roughly speaking: text it considers before generating additional text) of length 2048 tokens (roughly speaking: (parts of the) words, punctuation, etc.) so the idea is to put feature vectors inside that context window. Of course, this means you cannot fit any larger or higher-dimensional dataset in there.[1] In practice this means prompting GPT-3 on input like: Input: 89, 51, 73, 31, output = 1 [...] Input: 96, 51, 80, 38, output = 2 Input: 90, 37, 76, 27, output = And then taking its output, i.e. the text it generates, as the prediction. A couple more technical details: In all the experiments I performed, all the numbers were integers. The GPT-3's outputs were always sampled with temperature 0, i.e. they were deterministic – only the most probable output was considered. I restricted myself that way for simplicity, but hopefully some future work looks at the full distribution over outputs. (In the rest of this post I share GPT-3's results without too much warning, so if you'd like to make your own predictions about how GPT-3 does at simple classification and regression tasks in-context, now would be the time to pause reading and make your general predictions.) Iris Dataset Just as there is the MNIST dataset for visual recognition, so too there is a small low-dimensional classification dataset which every would-be classifier has to solve or else sink – the Iris dataset, composed of 150 observations of septal/petal height/width of three species of iris (50 observations each). Note that classification of the Iris dataset is in some sense trivial, with simple algorithms achieving near-perfect accuracy, but a model still needs to do some meaningful learning, given that it needs to learn to differentiate between three classes based on just a few dozen four-dimensional vectors. In order to prevent leakage, as the Iris dataset is all over the internet, as well as to get the more-easily-palatable integers out, I transformed every feature with the transformation xnew=round(14xold+6). I also split the dataset into 50% train and 50% test folds – so that the model "trained" on, or rather looked at 75 examples. I hadn't quite been expecting what had followed – I had expected GPT-3 to catch on to some patt...
Episode 54 - Une nouvelle dimension (au sens le plus mathématique du terme) Du détournement, des espaces vectoriels et du google traduction de thaï, c'est l'épisode 54 de Recommandé. En voiture ! Musiques La playlist de l'émission : https://www.youtube.com/playlist?list=PLNjXbZkItxtY6KOqIbTtbpGn-Z1R391On Contact - envoyez-moi des musiques ! Twitter @recommande0 Discord https://discord.gg/4RnA9v7 Illustration C'est une image aplatie en 2D d'un espace vectoriel à 10 dimensions pour le jeu de données MNIST.
David and Randy explore deep neural networks in Julia using Flux.jl by recreating Grant Sanderson's model for predicting handwritten digits in the MNIST data set. We also show how to visualize model results and training performance in TensorBoard using the TensorBoardLogging.jl package.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Future ML Systems Will Be Qualitatively Different, published by jsteinhardt on January 11, 2022 on LessWrong. In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. Some examples of More is Different include: Uranium. With a bit of uranium, nothing special happens; with a large amount of uranium packed densely enough, you get a nuclear reaction. DNA. Given only small molecules such as calcium, you can't meaningfully encode useful information; given larger molecules such as DNA, you can encode a genome. Water. Individual water molecules aren't wet. Wetness only occurs due to the interaction forces between many water molecules interspersed throughout a fabric (or other material). Traffic. A few cars on the road are fine, but with too many you get a traffic jam. It could be that 10,000 cars could traverse a highway easily in 15 minutes, but 20,000 on the road at once could take over an hour. Specialization. Historically, in small populations, virtually everyone needed to farm or hunt to survive; in contrast, in larger and denser communities, enough food is produced for large fractions of the population to specialize in non-agricultural work. While some of the examples, like uranium, correspond to a sharp transition, others like specialization are more continuous. I'll use emergence to refer to qualitative changes that arise from quantitative increases in scale, and phase transitions for cases where the change is sharp. In this post, I'll argue that emergence often occurs in the field of AI, and that this should significantly affect our intuitions about the long-term development and deployment of AI systems. We should expect weird and surprising phenomena to emerge as we scale up systems. This presents opportunities, but also poses important risks. Emergent Shifts in the History of AI There have already been several examples of quantitative differences leading to important qualitative changes in machine learning. Storage and Learning. The emergence of machine learning as a viable approach to AI is itself an example of More Is Different. While learning had been discussed since the 1950s, it wasn't until the 80s-90s that it became a dominant paradigm: for instance, IBM's first statistical translation model was published in 1988, even though the idea was proposed in 1949. Not coincidentally, 1GB of storage cost over $100k in 1981 but only around $9k in 1990 (adjusted to 2021 dollars). The Hansard corpus used to train IBM's model comprised 2.87 million sentences and would have been difficult to use before the 80s. Even the simple MNIST dataset would have required $4000 in hardware just to store in 1981, but that had fallen to a few dollars by 1998 when it was published. Cheaper hardware thus allowed for a qualitatively new approach to AI: in other words, More storage enabled Different approaches. Compute, Data, and Neural Networks. As hardware improved, it became possible to train neural networks that were very deep for the first time. Better compute enabled bigger models trained for longer, and better storage enabled learning from more data; AlexNet-sized models and ImageNet-sized datasets wouldn't have been feasible for researchers to experiment with in 1990. Deep learning performs well with lots of data and compute, but struggles at smaller scales. Without many resources, simpler algorithms tend to outperform it, but with sufficient resources it pulls far ahead of the pack. This reversal of fortune led to qualitative changes in the field. As one example, the field of machine t...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Future ML Systems Will Be Qualitatively Different, published by jsteinhardt on January 11, 2022 on LessWrong. In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. Some examples of More is Different include: Uranium. With a bit of uranium, nothing special happens; with a large amount of uranium packed densely enough, you get a nuclear reaction. DNA. Given only small molecules such as calcium, you can't meaningfully encode useful information; given larger molecules such as DNA, you can encode a genome. Water. Individual water molecules aren't wet. Wetness only occurs due to the interaction forces between many water molecules interspersed throughout a fabric (or other material). Traffic. A few cars on the road are fine, but with too many you get a traffic jam. It could be that 10,000 cars could traverse a highway easily in 15 minutes, but 20,000 on the road at once could take over an hour. Specialization. Historically, in small populations, virtually everyone needed to farm or hunt to survive; in contrast, in larger and denser communities, enough food is produced for large fractions of the population to specialize in non-agricultural work. While some of the examples, like uranium, correspond to a sharp transition, others like specialization are more continuous. I'll use emergence to refer to qualitative changes that arise from quantitative increases in scale, and phase transitions for cases where the change is sharp. In this post, I'll argue that emergence often occurs in the field of AI, and that this should significantly affect our intuitions about the long-term development and deployment of AI systems. We should expect weird and surprising phenomena to emerge as we scale up systems. This presents opportunities, but also poses important risks. Emergent Shifts in the History of AI There have already been several examples of quantitative differences leading to important qualitative changes in machine learning. Storage and Learning. The emergence of machine learning as a viable approach to AI is itself an example of More Is Different. While learning had been discussed since the 1950s, it wasn't until the 80s-90s that it became a dominant paradigm: for instance, IBM's first statistical translation model was published in 1988, even though the idea was proposed in 1949. Not coincidentally, 1GB of storage cost over $100k in 1981 but only around $9k in 1990 (adjusted to 2021 dollars). The Hansard corpus used to train IBM's model comprised 2.87 million sentences and would have been difficult to use before the 80s. Even the simple MNIST dataset would have required $4000 in hardware just to store in 1981, but that had fallen to a few dollars by 1998 when it was published. Cheaper hardware thus allowed for a qualitatively new approach to AI: in other words, More storage enabled Different approaches. Compute, Data, and Neural Networks. As hardware improved, it became possible to train neural networks that were very deep for the first time. Better compute enabled bigger models trained for longer, and better storage enabled learning from more data; AlexNet-sized models and ImageNet-sized datasets wouldn't have been feasible for researchers to experiment with in 1990. Deep learning performs well with lots of data and compute, but struggles at smaller scales. Without many resources, simpler algorithms tend to outperform it, but with sufficient resources it pulls far ahead of the pack. This reversal of fortune led to qualitative changes in the field. As one example, the field of machine t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D
Hey guys, in this episode I explain how an autonomous vehicle can learn to drive in a specific city and be able to generalize the driving knowledge to any other city. To explain it, I show the concept of of semi-supervised learning and domain adaptation. I explain the the ideia with the first proposed architecture and what we have today as state of the art for this problem. Instagram: https://www.instagram.com/podcast.lifewithai/ Linkedin: https://www.linkedin.com/company/life-with-ai Code: https://github.com/filipelauar/projects/blob/main/domain_adaptation_semi_supervised_learning_MNIST_pytorch.ipynb
Fala galera, nesse episódio eu explico como um carro autônomo pode ser treinado para dirigir em um país, na Alemanha por exemplo, e mesmo assim ser capaz de dirigir no Brasil com tudo tão diferente. Para explicar isso, eu falo sobre o conceito de domain adaptation (adaptação de domínio) e também semi-supervised learning (aprendizado semi supervisionado). Eu explico a ideia com a primeira arquitetura proposta and também falo sobre o estado da arte que temos hoje para resolver o problema. Instagram: https://www.instagram.com/podcast.lifewithai/ Linkedin: https://www.linkedin.com/company/life-with-ai Código: https://github.com/filipelauar/projects/blob/main/domain_adaptation_semi_supervised_learning_MNIST_pytorch.ipynb
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding “Deep Double Descent” , published by evhub on the AI Alignment Forum. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts. Prior work The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon: Belkin et al. describe their hypothesis for what's happening as follows: All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam's razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers. I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through. I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding “Deep Double Descent”, published by Evan Hubinger on the AI Alignment Forum. If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts. Prior work The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon: Belkin et al. describe their hypothesis for what's happening as follows: All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam's razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers. I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through. I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal circuits being malign suggest that such strong simplicity and speed priors could...
Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants. Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below. RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement? 1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc). 2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we'd much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87). 3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights. 4. Soundness and quality: This is a general category for things like “do we know that the signal isn't overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”. What sorts of things might you measure? 1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become? 2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor. 3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data? 4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer? Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don't think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I'm thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.) You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better theories and arguments, which can rule out the conceptual arguments above. Separately, these measurements are also useful as a form of legible evidence about risk to others who are more skeptical of conceptual arguments. RFP: Techniques for enhancing human feedback (Ajeya Cotra) (summarized by Rohin): Consider a topic previously analyzed in aligning narrowly superhuman models (AN #141): how can we use human feedback to train models to do what we want, in cases where the models are more knowledgeable than the humans providing the feedback? A variety of techniques have been proposed to solve this problem, including iterated amplification (AN #40), debate (AN #5), recursive reward modeling (AN #34), market making (AN #108), and generalizing from short deliberations to long deliberations. This RFP solicits proposals that aim to test these or other mechanisms on existing systems. There are a variety of ways that to set up the experiments so that the models are more knowledgeable than the humans providing the feedback, for example: 1. Train a language model to accurately explain things about a field that the feedback providers are not familiar with. 2. Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can. 3. Train a multilingual model to translate between English and a foreign language that the feedback providers do not know. RFP: Interpretability (Chris Olah) (summarized by Rohin): The author provides this one sentence summary: We would like to see research building towards the ability to “reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models. This RFP is primarily focused on an aspirational “intermediate” goal: to fully reverse engineer some modern neural network, such as an ImageNet classifier. (Despite the ambition, it is only an “intermediate” goal because what we would eventually need is a general method for cheaply reverse engineering any neural network.) The proposed areas of research are primarily inspired by the Circuits line of work (AN #142): 1. Discovering Features and Circuits: This is the most obvious approach to the aspirational goal. We simply “turn the crank” using existing tools to study new features and circuits, and this fairly often results in an interesting result, that makes progress towards reverse engineering a neural network. 2. Scaling Circuits to Larger Models: So far the largest example of reverse engineering is curve circuits, with 50K parameters. Can we find examples of structure in the neural networks that allow us to drastically reduce the amount of effort required per parameter? (As examples, see equivariance and branch specialization.) 3. Resolving Polysemanticity: One of the core building blocks of the circuits approach is to identify a neuron with a concept, so that connections between neurons can be analyzed as connections between concepts. Unfortunately, some neurons are polysemantic, that is, they encode multiple different concepts. This greatly complicates analysis of the connections and circuits between these neurons. How can we deal with this potential obstacle? Rohin's opinion: The full RFP has many, many more points about these topics; it's 8 pages of remarkably information-dense yet readable prose. If you're at all interested in mechanistic interpretability, I recommend reading it in full. This RFP also has the benefit of having the most obvious pathway to impact: if we understand what algorithm neural networks are running, there's a much better chance that we can catch any problems that arise, especially ones in which the neural network is deliberately optimizing against us. It's one of the few areas where nearly everyone agrees that further progress is especially valuable. RFP: Truthful and honest AI (Owain Evans) (summarized by Rohin): This RFP outlines research projects on Truthful AI (summarized below). They fall under three main categories: 1. Increasing clarity about “truthfulness” and “honesty”. While there are some tentative definitions of these concepts, there is still more precision to be had: for example, how do we deal with statements with ambiguous meanings, or ones involving figurative language? What is the appropriate standard for robustly truthful AI? It seems too strong to require the AI system to never generate a false statement; for example it might misunderstand the meaning of a newly coined piece of jargon. 2. Creating benchmarks and tasks for Truthful AI, such as TruthfulQA (AN #165), which checks for imitative falsehoods. This is not just meant to create a metric to improve on; it may also simply perform as a measurement. For example, we could experimentally evaluate whether honesty generalizes (AN #158), or explore how much truthfulness is reduced when adding in a task-specific objective. 3. Improving the truthfulness of models, for example by finetuning models on curated datasets of truthful utterances, finetuning on human feedback, using debate (AN #5), etc. Besides the societal benefits from truthful AI, building truthful AI systems can also help with AI alignment: 1. A truthful AI system can be used to supervise its own actions, by asking it whether its selected action was good. 2. A robustly truthful AI system could continue to do this after deployment, allowing for ongoing monitoring of the AI system. 3. Similarly, we could have a robustly truthful AI system supervise its own actions in hypothetical scenarios, to make it more robustly aligned. Rohin's opinion: While I agree that making AI systems truthful would then enable many alignment strategies, I'm actually more interested in the methods by which we make AI systems truthful. Many of the ideas suggested in the RFP are ones that would apply for alignment more generally, and aren't particularly specific to truthful AI. So it seems like whatever techniques we used to build truthful AI could then be repurposed for alignment. In other words, I expect that the benefit to AI alignment of working on truthful AI is that it serves as a good test case for methods that aim to impose constraints upon an AI system. In this sense, it is a more challenging, larger version of the ”never describe someone getting injured” challenge (AN #166). Note that I am only talking about how this helps AI alignment; there are also beneficial effects on society from pursuing truthful AI that I haven't talked about here. AI GOVERNANCE Truthful AI: Developing and governing AI that does not lie (Owain Evans, Owen Cotton-Barratt et al) (summarized by Rohin): This paper argues that we should develop both the technical capabilities and the governance mechanisms necessary to ensure that AI systems are truthful. We will primarily think about conversational AI systems here (so not, say, AlphaFold). Some key terms: 1. An AI system is honest if it only makes statements that it actually believes. (This requires you to have some way of ascribing beliefs to the system.) In contrast, truthfulness only checks if statements correspond to reality, without making any claims about the AI system's beliefs. 2. An AI system is broadly truthful if it doesn't lie, volunteers all the relevant information it knows, is well-calibrated and knows the limits of its information, etc. 3. An AI system is narrowly truthful if it avoids making negligent suspected-falsehoods. These are statements that can feasibly be determined by the AI system to be unacceptably likely to be false. Importantly, a narrowly truthful AI is not required to make contentful statements, it can express uncertainty or refuse to answer. This paper argues for narrow truthfulness as the appropriate standard. Broad truthfulness is not very precisely defined, making it challenging to coordinate on. Honesty does not give us the guarantees we want: in settings in which it is advantageous to say false things, AI systems might end up being honest but deluded. They would honestly report their beliefs, but those beliefs might be false. Narrow truthfulness is still a much stronger standard that we impose upon standards. This is desirable, because (1) AI systems need not be constrained by social norms, the way humans are; consequently they need stronger standards, and (2) it may be less costly to enforce that AI systems are narrowly truthful than to enforce that humans are narrowly truthful, so a higher standard is more feasible. Evaluating the (narrow) truthfulness of a model is non-trivial. There are two parts: first, determining whether a given statement is unacceptably likely to be false, and second, determining whether the model was negligent in uttering such a statement. The former could be done by having human processes that study a wide range of information and determine whether a given statement is unacceptably likely to be false. In addition to all of the usual concerns about the challenges of evaluating a model that might know more than you, there is also the challenge that it is not clear exactly what counts as “unacceptably likely to be false”. For example, if a model utters a false statement, but expresses low confidence, how should that be rated? The second part, determining negligence, needs to account for the fact that the AI system might not have had all the necessary information, or that it might not have been capable enough to come to the correct conclusion. One way of handling this is to compare the AI system to other AI systems built in a similar fashion. How might narrow truthfulness be useful? One nice thing it enables is truthfulness amplification, in which we can amplify properties of a model by asking a web of related questions and combining the answers appropriately. For example, if we are concerned that the AI system is deceiving us on just this question, we could ask it whether it is deceiving us, or whether an investigation into its statement would conclude that it was deceptive. As another example, if we are worried that the AI system is making a mistake on some question where its statement isn't obviously false, we can ask it about its evidence for its position and how strong the evidence is (where false statements are more likely to be negligently false). Section 3 is devoted to the potential benefits and costs if we successfully ensure that AI systems are narrowly truthful, with the conclusion that the costs are small relative to the benefits, and can be partially mitigated. Section 6 discusses other potential benefits and costs if we attempt to create truthfulness standards to ensure the AI systems are narrowly truthful. (For example, we might try to create a truthfulness standard, but instead create an institution that makes sure that AI systems follow a particular agenda (by only rating as true the statements that are consistent with that agenda). Section 4 talks about the governance mechanisms we might use to implement a truthfulness standard. Section 5 describes potential approaches for building truthful AI systems. As I mentioned in the highlighted post, these techniques are general alignment techniques that have been specialized for truthful AI. NEWS Q&A Panel on Applying for Grad School (summarized by Rohin): In this event run by AI Safety Support on November 7, current PhD students will share their experiences navigating the application process and AI Safety research in academia. RSVP here. SafeAI Workshop 2022 (summarized by Rohin): The SafeAI workshop at AAAI is now accepting paper submissions, with a deadline of Nov 12. FLI's $25M Grants Program for Existential Risk Reduction (summarized by Rohin): This podcast talks about FLI's recent grants program for x-risk reduction. I've previously mentioned the fellowships (AN #165) they are running as part of this program. As a reminder, the application deadline is October 29 for the PhD fellowship, and November 5 for the postdoc fellowship.
Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely: 1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events. 2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected model functionality. 3. Alignment: Build models that represent and safely optimize hard-to-specify human values. 4. External Safety: Use ML to address risks to how ML systems are handled, including cyberwarfare and global turbulence. Throughout, the paper attempts to clarify problem's motivation and provide concrete project ideas. Dan Hendrycks' opinion: My coauthors and I wrote this paper with the ML research community as our target audience. Here are some thoughts on this topic: 1. The document includes numerous problems that, if left unsolved, would imply that ML systems are unsafe. We need the effort of thousands of researchers to address all of them. This means that the main safety discussions cannot stay within the confines of the relatively small EA community. I think we should aim to have over one third of the ML research community work on safety problems. We need the broader community to treat AI at least as seriously as safety for nuclear power plants. 2. To grow the ML research community, we need to suggest problems that can progressively build the community and organically grow support for elevating safety standards within the existing research ecosystem. Research agendas that pertain to AGI exclusively will not scale sufficiently, and such research will simply not get enough market share in time. If we do not get the machine learning community on board with proactively mitigating risks that already exist, we will have a harder time getting them to mitigate less familiar and unprecedented risks. Rather than try to win over the community with alignment philosophy arguments, I'll try winning them over with interesting problems and try to make work towards safer systems rewarded with prestige. 3. The benefits of a larger ML Safety community are numerous. They can decrease the cost of safety methods and increase the propensity to adopt them. Moreover, to make ML systems have desirable properties, it is necessary to rapidly accumulate incremental improvements, but this requires substantial growth since such gains cannot be produced by just a few card-carrying x-risk researchers with the purest intentions. 4. The community will fail to grow if we ignore near-term concerns or actively exclude or sneer at people who work on problems that are useful for both near- and long-term safety (such as adversaries). The alignment community will need to stop engaging in textbook territorialism and welcome serious hypercompetent researchers who do not post on internet forums or who happen not to subscribe to effective altruism. (We include a community strategy in the Appendix.) 5. We focus on reinforcement learning but also deep learning. Most of the machine learning research community studies deep learning (e.g., text processing, vision) and does not use, say, Bellman equations or PPO. While existentially catastrophic failures will likely require competent sequential decision making agents, the relevant problems and solutions can often be better studied outside of gridworlds and MuJoCo. There is much useful safety research to be done that does not need to be cast as a reinforcement learning problem. 6. To prevent alienating readers, we did not use phrases such as "AGI." AGI-exclusive research will not scale; for most academics and many industry researchers, it's a nonstarter. Likewise, to prevent needless dismissiveness, we kept x-risks implicit, only hinted at them, or used the phrase "permanent catastrophe." I would have personally enjoyed discussing at length how anomaly detection is an indispensable tool for reducing x-risks from Black Balls, engineered microorganisms, and deceptive ML systems. Here are how the problems relate to x-risk: Adversarial Robustness: This is needed for proxy gaming. ML systems encoding proxies must become more robust to optimizers, which is to say they must become more adversarially robust. We make this connection explicit at the bottom of page 9. Black Swans and Tail Risks: It's hard to be safe without high reliability. It's not obvious we'll achieve high reliability even by the time we have systems that are superhuman in important respects. Even though MNIST is solved for typical inputs, we still do not even have an MNIST classifier for atypical inputs that is reliable! Moreover, if optimizing agents become unreliable in the face of novel or extreme events, they could start heavily optimizing the wrong thing. Models accidentally going off the rails poses an x-risk if they are sufficiently powerful (this is related to "competent errors" and "treacherous turns"). If this problem is not solved, optimizers can use these weaknesses; this is a simpler problem on the way to adversarial robustness. Anomaly and Malicious Use Detection: This is an indispensable tool for detecting proxy gaming, Black Balls, engineered microorganisms that present bio x-risks, malicious users who may misalign a model, deceptive ML systems, and rogue ML systems. Representative Outputs: Making models honest is a way to avoid many treacherous turns. Hidden Model Functionality: This also helps avoid treacherous turns. Backdoors is a potentially useful related problem, as it is about detecting latent but potential sharp changes in behavior. Value Learning: Understanding utilities is difficult even for humans. Powerful optimizers will need to achieve a certain, as-of-yet unclear level of superhuman performance at learning our values. Translating Values to Action: Successfully prodding models to optimize our values is necessary for safe outcomes. Proxy Gaming: Obvious. Value Clarification: This is the philosophy bot section. We will need to decide what values to pursue. If we decide poorly, we may lock in or destroy what is of value. It also possible that there is an ongoing moral catastrophe, which we would not want to replicate across the cosmos. Unintended Consequences: This should help models not accidentally work against our values. ML for Cybersecurity: If you believe that AI governance is valuable and that global turbulence risks can increase risks of terrible outcomes, this section is also relevant. Even if some of the components of ML systems are safe, they can become unsafe when traditional software vulnerabilities enable others to control their behavior. Moreover, traditional software vulnerabilities may lead to the proliferation of powerful advanced models, and this may be worse than proliferating nuclear weapons. Informed Decision Making: We want to avoid decision making based on unreliable gut reactions during a time of crisis. This reduces risks of poor governance of advanced systems. Here are some other notes: 1. We use systems theory to motivate inner optimization as we expect motivation will be more convincing to others. 2. Rather than have a broad call for "interpretability," we focus on specific transparency-related problems that are more tractable and neglected. (See the Appendix for a table assessing importance, tractability, and neglectedness.) For example, we include sections on making models honest and detecting emergent functionality. 3. The "External Safety" section can also be thought of as technical research for reducing "Governance" risks. For readers mostly concerned about AI risks from global turbulence, there still is technical research that can be done. Here are some observations while writing the document: 1. Some approaches that were previously very popular are currently neglected, such as inverse reinforcement learning. This may be due to currently low tractability. 2. Five years ago, I started explicitly brainstorming the content for this document. I think it took the whole time for this document to take shape. Moreover, if this were written last fall, the document would be far more confused, since it took around a year after GPT-3 to become reoriented; writing these types of documents shortly after a paradigm shift may be too hasty. 3. When collecting feedback, it was not uncommon for "in-the-know" researchers to make opposite suggestions. Some people thought some of the problems in the Alignment section were unimportant, while others thought they were the most critical. We attempted to include most research directions. [MLSN #1]: ICLR Safety Paper Roundup (Dan Hendrycks) (summarized by Rohin): This is the first issue of the ML Safety Newsletter, which is "a monthly safety newsletter which is designed to cover empirical safety research and be palatable to the broader machine learning research community". Rohin's opinion: I'm very excited to see this newsletter: this is a category of papers that I want to know about and that are relevant to safety, but I don't have the time to read all of these papers given all the other alignment work I read, especially since I don't personally work in these areas and so often find it hard to summarize them or place them in the appropriate context. Dan on the other hand has written many such papers himself and generally knows the area, and so will likely do a much better job than I would. I recommend you subscribe, especially since I'm not going to send a link to each MLSN in this newsletter. TECHNICAL AI ALIGNMENT TECHNICAL AGENDAS AND PRIORITIZATION Selection Theorems: A Program For Understanding Agents (John Wentworth) (summarized by Rohin): This post proposes a research area for understanding agents: selection theorems. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns). As an example, coherence arguments demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don't lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don't waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state. Coherence arguments aren't the only kind of selection theorem. The good(er) regulator theorem (AN #138) provides a set of scenarios under which agents learn an internal “world model”. The Kelly criterion tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in this followup post. The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another followup post describes some useful properties for which the author expects there are useful selections theorems to prove. Rohin's opinion: People sometimes expect me to be against this sort of work, because I wrote Coherence arguments do not imply goal-directed behavior (AN #35). This is not true. My point in that post is that coherence arguments alone are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed. OTHER PROGRESS IN AI MISCELLANEOUS (AI) State of AI Report 2021 (Nathan Benaich and Ian Hogarth) (summarized by Rohin): As with past (AN #15) reports (AN #120), I'm not going to summarize the entire thing, and instead you get the high-level themes that the authors identified: 1. AI is stepping up in more concrete ways, including in mission critical infrastructure. 2. AI-first approaches have taken biology by storm (and we aren't just talking about AlphaFold). 3. Transformers have emerged as a general purpose architecture for machine learning in many domains, not just NLP. 4. Investors have taken notice, with record funding this year into AI startups, and two first ever IPOs for AI-first drug discovery companies, as well as blockbuster IPOs for data infrastructure and cybersecurity companies that help enterprises retool for the AI-first era. 5. The under-resourced AI-alignment efforts from key organisations who are advancing the overall field of AI, as well as concerns about datasets used to train AI models and bias in model evaluation benchmarks, raise important questions about how best to chart the progress of AI systems with rapidly advancing capabilities. 6. AI is now an actual arms race rather than a figurative one, with reports of recent use of autonomous weapons by various militaries. 7. Within the US-China rivalry, China's ascension in research quality and talent training is notable, with Chinese institutions now beating the most prominent Western ones. 8. There is an emergence and nationalisation of large language models. Rohin's opinion: In last year's report (AN #120), I said that their 8 predictions seemed to be going out on a limb, and that even 67% accuracy woud be pretty impressive. This year, they scored their predictions as 5 “Yes”, 1 “Sort of”, and 2 “No”. That being said, they graded “The first 10 trillion parameter dense model” as “Yes”, I believe on the basis that Microsoft had run a couple of steps of training on a 32 trillion parameter dense model. I definitely interpreted the prediction as saying that a 10 trillion parameter model would be trained to completion, which I do not think happened publicly, so I'm inclined to give it a “No”. Still, this does seem like a decent track record for what seemed to me to be non-trivial predictions. This year's predictions seem similarly "out on a limb" as last year's. This year's report included one slide summaries of many papers I've summarized before. I only found one major issue -- the slide on TruthfulQA (AN #165) implies that larger language models are less honest in general, rather than being more likely to imitate human falsehoods. This is actually a pretty good track record, given the number of things they summarized where I would have noticed if there were major issues. NEWS CHAI Internships 2022 (summarized by Rohin): CHAI internships are open once again! Typically, an intern will execute on an AI safety research project proposed by their mentor, resulting in a first-author publication at a workshop. The early deadline is November 23rd and the regular deadline is December 13th.
#tvae #topographic #equivariant Variational Autoencoders model the latent space as a set of independent Gaussian random variables, which the decoder maps to a data distribution. However, this independence is not always desired, for example when dealing with video sequences, we know that successive frames are heavily correlated. Thus, any latent space dealing with such data should reflect this in its structure. Topographic VAEs are a framework for defining correlation structures among the latent variables and induce equivariance within the resulting model. This paper shows how such correlation structures can be built by correctly arranging higher-level variables, which are themselves independent Gaussians. OUTLINE: 0:00 - Intro 1:40 - Architecture Overview 6:30 - Comparison to regular VAEs 8:35 - Generative Mechanism Formulation 11:45 - Non-Gaussian Latent Space 17:30 - Topographic Product of Student-t 21:15 - Introducing Temporal Coherence 24:50 - Topographic VAE 27:50 - Experimental Results 31:15 - Conclusion & Comments Paper: https://arxiv.org/abs/2109.01394 Code: https://github.com/akandykeller/topog... Abstract: In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences -- a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks. Authors: T. Anderson Keller, Max Welling Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
Этот выпуск состоит из двух частей. Сначала Алина - руководитель бригады подбора стажеров по направлению машинного обучения - немного расскажет о том, как с помощью оплачиваемых стажировок можно получить свой первый опыт работы в ML, а потом я расскажу о том как связать теоретические знания о моделях машинного обучения с практикой их написания и использования на примере учебного датасета изображений рукописных цифр MNIST. Надеюсь, выпуск будет полезен тем, кто уже имеет небольшую теоретическую базу по тому как работает машинное обучение, но еще не очень понимает как от теории перейти к практике написания моделей. Ссылки выпуска: Стажировки в Яндексе (https://yandex.ru/yaintern/) Статья о том, как из пиарщиков девушка стала тестировщиком Яндекс.Станции (https://academy.yandex.ru/posts/byvshiy-piarschik-rasskazyvaet-kak-stat-testirovschikom-yandeks-stantsii) Выпуск подкаста про линейную регрессию (https://anchor.fm/kmsrus/episodes/016-ML-eo11mr/a-a45vl81) Плей-лист selfedu "Нейронные сети на Python. Уроки" (https://www.youtube.com/watch?v=nV7cI5zgOpk&list=PLA0M1Bcd0w8yv0XGiF1wjerjSZVSrYbjh) Мануал "TensorFlow, Keras and deep learning, without a PhD" (https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist#0) Буду благодарен за обратную связь! Оставляйте ваши комментарии там, где можно. Например, в Apple Podcasts. Они помогут сделать подкаст лучше! Напишите что вам было понятно, что не очень, какие темы раскрыть, каких гостей пригласить, ну, и вообще в какую сторону катить этот подкаст :) Подписывайтесь на телеграм-канал "Стать специалистом по машинному обучению" (https://t.me/toBeAnMLspecialist) Телеграм автора подкаста (https://t.me/kmsint) Со мной также можно связаться по электронной почте: kms101@yandex.ru Также теперь подкаст можно найти на YouTube (https://www.youtube.com/channel/UCzvfXLNpB2Bbf32dc7a8oDQ?) и Яндекс.Музыке https://music.yandex.ru/album/9781458 Music by Audionautix.com
Andy and Dave discuss the latest in AI news, including a report that systematically examined 62 studies on COVID-19 ML methods (from a pool o 2200+ studies), and found that none of the models were of potential clinical use due to methodological flaws or underlying biases. MIT and Amazon identify pervasive label errors in popular ML datasets (such as MNIST, CIFAR, Imagenet) and demonstrate that models may learn systematic patterns of label error in order to improve their accuracy. DARPA’s Air Combat Evolution program upgrades its virtual program to include new weapons systems and multiple aircraft, with live Phase 2 tests on schedule for later in 2021. Researchers at the University of Waterloo and Northeastern University publish research working toward self-walking robotic exoskeletons. British researchers add a buccinators (cheek) muscle to robotic busts to better synchronize speech and mouth movements. Russian Promobot is developing hyper-realistic skin for humanoid robots. And Anderson Cooper takes a tour of Boston Dynamics. In research, Leverhulme, Cambridge, Imperial College London, and DeepMind UK publish research on the direct human-AI comparison in the animal-AI environment, using human children ages 6-10 and animal-AI agents across 10 levels of task groupings. Josh Bongard and Michael Levin publish Living Things Are Not (20th Century) Machines, a thought piece on updating how we think of machines and what they *could* be. Professors Jason Jones and Steven Skiena are publishing a running AI Dashboard on Public Opinion of AI. The Australian Department of Defence publishes A Method for Ethical AI in Defence. Raghavendra Gadagkar publishes Experiments in Animal Behavior. And Peter Singer and August Cole publish An Eye for a Storm, envisioning a future of professional military education for the Australian Defence Force. Listeners Survey: https://bit.ly/3bqyiHk Click here to visit our website and explore the links mentioned in the episode.
"Christian just do your emails" 00:00 Plumbing problems 01:31 When do you outsource vs. do it yourself? 13:18 Chris's AI research 17:38 Overview of Machine Learning 22:24 PyTorch vs. TensorFlow 26:02 Python, Javascript, and Ruby 29:41 Slogging through email 33:34 Email systems 40:43 Bookmark bankruptcy 41:44 Zettelkasten Google Search browser plugin HOT TAKE 44:34 Just do your emails Christian Timestamps created with https://clips.marketing by @cgenco 9️⃣MNIST number dataset AI practice: https://twitter.com/chrisachard/status/1369313924970143749
丽莎老师讲机器人之隐私保护新突破:高斯差分隐私框架与深度学习结合欢迎收听丽莎老师讲机器人,想要孩子参加机器人竞赛、创意编程、创客竞赛的辅导,找丽莎老师!欢迎添加微信号:153 5359 2068,或搜索钉钉群:31532843。宾夕法尼亚大学的研究组开发了一个新的数据隐私分析框架,可以在多个类型的机器学习问题中有效保护个人隐私。这个框架现已成功和深度学习结合,并在多个需要保障隐私的深度学习任务中达到最高准确率。什么是差分隐私在这个大数据时代,如何妥善获取和使用与真人相关的数据,渐渐成为迫切需要解决的问题。没有人希望自己生个病,上个网,买件衣服都会被人随意知晓,更别提手机里没有修过的自拍了。一种简单的隐私保护方法就是「匿名」:将收集到的数据中涉及个人信息的特征剔除。可惜这种方法并不可靠,曾有研究将 Netflix 匿名处理过的观影记录通过交叉对比 IMDb 数据库解匿成功,这直接导致了第二届 Netflix 数据分析大奖赛的取消。2006 年,隐私算法的研究迎来了新的里程碑。几位科学家定义了「差分隐私」(以下缩写为 DP),来严谨地分析隐私这个概念。差分隐私很快被证明是个强有效的工具,并被谷歌、苹果、微软、阿里巴巴等各大机构使用。而四位发明者于 2017 年获得了被誉为理论计算机科学界诺贝尔奖的 Godel 奖。要理解差分隐私,我们可以看看下面这个简单的假设检验:假设有两个数据集 S, S'S={小明,小刚,小美};S'={小红,小刚,小美}这两个数据集是邻近的,因为它们的差异仅体现在一个人上。目的是检验模型是否是基于 S 训练的,这等价于检验小明是否存在于我们的数据中。如果这个假设检验非常困难,那么想要获取小明信息的攻击者就难以得逞。严谨来说,一个随机算法 M 符合 (epsilon,delta)-DP 意味着对于任何的事件 E,从定义不难看出,epsilon 和 delta 越小,隐私性越好。那么,如何实现能保证算法的隐私性呢?具体做法是衡量算法的中间产物(比如梯度)的敏感性,并根据其大小施加一个成正比的噪音。由于噪音的存在,想要窃取小明信息的攻击者便无法确定小明是否在训练集中。在深度神经网络中,每一次迭代都会牺牲一部分隐私来换取性能的提高。我们可以对每个批(batch)的梯度加噪音,从而达到混淆攻击者的目的。当然,噪音加的越大,隐私就越安全,但是随之性能也自然越差。在有限的隐私预算下,很多时候隐私算法的性能表现会不如人意。深度学习经常需要敏感的个人信息来训练。现存的差分隐私定义以及隐私模型都试图在性能和隐私中找到一个平衡。可惜的是,这些尝试仍不能很好的处理两个重要环节:subsampling 和 composition。这导致了隐私算法的性能通常远逊于非隐私算法。高斯差分隐私Gaussian differential privacy (GDP) 高斯差分隐私是最近被提出的一种隐私表示方法。它可以精确的刻画 optimizer 在每个 epoch 所消耗的隐私。GDP高斯差分隐私 的表达简洁且是广义的(在 SGD, Adam, Adagrad 等多个优化器上的刻画是完全一样的)。高斯差分隐私GDP 的分析被进一步推广到 Poisson subsampling 和新的优化器上。新的推广得到了理论上严谨的证明,尤其证明了它优于此前最先进的 Moments accountant 方法。在《Gaussian Differential Privacy高斯差分隐私》一文中,宾夕法尼亚大学的研究人员创新性地定义了「f-DP」来刻画隐私。如果用 alpha 来表示第一类错误,beta 来表示第二类错误,对于任何一种拒绝规则 (rejection rule) phi,都存在一个抵换函数 (trade-off function) T:降低第一类错误会导致第二类错误增加,反之亦然。我们将两类错误的和的最小值称为最小错误和。一个随机算法 M 在 S 和 S』上的抵换函数 T 如果始终大于函数 f,那么它就满足 f-DP。对比于传统的 eps,delta-DP,f-DP 使用的是一个函数 f,这也使得其刻画更为自由和准确。作为 f-DP 的一个重要案例,研究人员随后介绍了高斯差分隐私(GDP)来区分两个高斯分布。根据中心极限定理(CLT),任何基于假设检验的隐私定义在极限情况下都会收敛于 GDP。事实上,相对于谷歌在 2016 年提出的,适用于计算 epsilon,delta-DP 的 Moments Accountant (MA) 方法,本文提出的 CLT 方法可以更简易地计算 GDP,而且非常准确。GDP 的好处还不止于此。在最新工作《Deep Learning with Gaussian Differential Privacy》中,卜至祺、董金硕,龙琦和苏炜杰等作者指出 GDP 和 eps,delta-DP 可以通过他们设计的 Dual 函数互相转换。也就是说,研究者可以在 f-DP 的框架下分析算法再转成传统的 dp,或者从传统领域中拿来已有的理论和技巧,不必二次开发。这项技术现在已经在 TensorFlow 中实现。在实验中,将 GDP 和深度学习结合,并在多种类型的任务上取得了不俗的成绩。此前谷歌也曾将 epsDP 和深度学习结合,虽然在 MNIST 图像识别上取得了 97% 的正确率(无隐私算法可达到 99% 以上),在 CIFAR10 上却止步于 73% 的正确率(无隐私算法可达 86%)。而利用 GDP 的精确刻画,作者们在 MNIST 上取得了 98% 的准确率。不仅如此,MA 计算的结果表示 MNIST 的 96.6% 正确率对应的是 9.4% 的最小错误和,意味着攻击者有超过九成的概率猜对一张图片是否在数据集中。而 CLT 的计算表明 epsDP 太过于保守:同样的模型同样的表现,实际对应的最小错误和其实是 77.6%,也就是说隐私并没有损失很多。为了全面探讨 GDP 的优越性,作者在 高斯差分隐私GDP 框架下分析了神经网络的表现。实现了 SGD 和 Adam 的隐私版本,并通过让神经网络不断迭代直到 高斯差分隐私GDP 达到了 mu=2。在 IMDb(自然语言处理),MovieLens 1M(推荐系统)和 Adult Income(非图像型分类任务)上,GDP 模型都取得了非常接近无隐私模型的性能。例如在 Adult Income 数据上,隐私神经网络和无隐私神经网络表现几乎一样好,意味着隐私也许并不需要以很大的性能牺牲为代价。更进一步的,作者强调文中的神经网络都相对简单(不超过三层),如果使用更复杂更高级的神经网络可以在同样的隐私保证下更显著地提升性能。而另一方面,使用高效的优化算法(减少迭代次数,即隐私的损失次数)也能让性能变得更好。既然 CLT 可以在同样性能的条件下比 MA 更好地保护隐私,那么反过来想,在同样的隐私预算下,GDP 也能显示出更强的性能。作者构思了一个实验来说明这一点:训练一个加了 sigma 噪音的神经网络若干步,通过 MA 可以算出目前损失了多少隐私,通过 CLT 和 Dual 反解出真正必须的噪音 sigma hat。注意 sigma hat 必然小于 sigma,然后训练同一个神经网络但只加 sigma hat 噪音。由于噪音变小,新的神经网络学习效果会更好,而且在每一次迭代,新神经网络都会更好地保护隐私。将神经网络和 GDP 结合,可以更精准地呈现隐私损失,从而更好地保护隐私以及提升隐私算法的性能。另一方面,已有的 (epsilon,delta)-DP 研究也可以嫁接到 GDP 中,为两个领域带来了新的机遇。这一隐私算法领域的新进展给予了研究者们更大的信心去相信,随着机器学习的进一步发展,也许在不远的未来就能以可忽略不计的代价来保护我们的隐私。同时,它也鼓励人们更愿意分享涉及个人信息的数据,来推动机器学习的发展。
丽莎老师讲机器人之隐私保护新突破:高斯差分隐私框架与深度学习结合欢迎收听丽莎老师讲机器人,想要孩子参加机器人竞赛、创意编程、创客竞赛的辅导,找丽莎老师!欢迎添加微信号:153 5359 2068,或搜索钉钉群:31532843。宾夕法尼亚大学的研究组开发了一个新的数据隐私分析框架,可以在多个类型的机器学习问题中有效保护个人隐私。这个框架现已成功和深度学习结合,并在多个需要保障隐私的深度学习任务中达到最高准确率。什么是差分隐私在这个大数据时代,如何妥善获取和使用与真人相关的数据,渐渐成为迫切需要解决的问题。没有人希望自己生个病,上个网,买件衣服都会被人随意知晓,更别提手机里没有修过的自拍了。一种简单的隐私保护方法就是「匿名」:将收集到的数据中涉及个人信息的特征剔除。可惜这种方法并不可靠,曾有研究将 Netflix 匿名处理过的观影记录通过交叉对比 IMDb 数据库解匿成功,这直接导致了第二届 Netflix 数据分析大奖赛的取消。2006 年,隐私算法的研究迎来了新的里程碑。几位科学家定义了「差分隐私」(以下缩写为 DP),来严谨地分析隐私这个概念。差分隐私很快被证明是个强有效的工具,并被谷歌、苹果、微软、阿里巴巴等各大机构使用。而四位发明者于 2017 年获得了被誉为理论计算机科学界诺贝尔奖的 Godel 奖。要理解差分隐私,我们可以看看下面这个简单的假设检验:假设有两个数据集 S, S'S={小明,小刚,小美};S'={小红,小刚,小美}这两个数据集是邻近的,因为它们的差异仅体现在一个人上。目的是检验模型是否是基于 S 训练的,这等价于检验小明是否存在于我们的数据中。如果这个假设检验非常困难,那么想要获取小明信息的攻击者就难以得逞。严谨来说,一个随机算法 M 符合 (epsilon,delta)-DP 意味着对于任何的事件 E,从定义不难看出,epsilon 和 delta 越小,隐私性越好。那么,如何实现能保证算法的隐私性呢?具体做法是衡量算法的中间产物(比如梯度)的敏感性,并根据其大小施加一个成正比的噪音。由于噪音的存在,想要窃取小明信息的攻击者便无法确定小明是否在训练集中。在深度神经网络中,每一次迭代都会牺牲一部分隐私来换取性能的提高。我们可以对每个批(batch)的梯度加噪音,从而达到混淆攻击者的目的。当然,噪音加的越大,隐私就越安全,但是随之性能也自然越差。在有限的隐私预算下,很多时候隐私算法的性能表现会不如人意。深度学习经常需要敏感的个人信息来训练。现存的差分隐私定义以及隐私模型都试图在性能和隐私中找到一个平衡。可惜的是,这些尝试仍不能很好的处理两个重要环节:subsampling 和 composition。这导致了隐私算法的性能通常远逊于非隐私算法。高斯差分隐私Gaussian differential privacy (GDP) 高斯差分隐私是最近被提出的一种隐私表示方法。它可以精确的刻画 optimizer 在每个 epoch 所消耗的隐私。GDP高斯差分隐私 的表达简洁且是广义的(在 SGD, Adam, Adagrad 等多个优化器上的刻画是完全一样的)。高斯差分隐私GDP 的分析被进一步推广到 Poisson subsampling 和新的优化器上。新的推广得到了理论上严谨的证明,尤其证明了它优于此前最先进的 Moments accountant 方法。在《Gaussian Differential Privacy高斯差分隐私》一文中,宾夕法尼亚大学的研究人员创新性地定义了「f-DP」来刻画隐私。如果用 alpha 来表示第一类错误,beta 来表示第二类错误,对于任何一种拒绝规则 (rejection rule) phi,都存在一个抵换函数 (trade-off function) T:降低第一类错误会导致第二类错误增加,反之亦然。我们将两类错误的和的最小值称为最小错误和。一个随机算法 M 在 S 和 S』上的抵换函数 T 如果始终大于函数 f,那么它就满足 f-DP。对比于传统的 eps,delta-DP,f-DP 使用的是一个函数 f,这也使得其刻画更为自由和准确。作为 f-DP 的一个重要案例,研究人员随后介绍了高斯差分隐私(GDP)来区分两个高斯分布。根据中心极限定理(CLT),任何基于假设检验的隐私定义在极限情况下都会收敛于 GDP。事实上,相对于谷歌在 2016 年提出的,适用于计算 epsilon,delta-DP 的 Moments Accountant (MA) 方法,本文提出的 CLT 方法可以更简易地计算 GDP,而且非常准确。GDP 的好处还不止于此。在最新工作《Deep Learning with Gaussian Differential Privacy》中,卜至祺、董金硕,龙琦和苏炜杰等作者指出 GDP 和 eps,delta-DP 可以通过他们设计的 Dual 函数互相转换。也就是说,研究者可以在 f-DP 的框架下分析算法再转成传统的 dp,或者从传统领域中拿来已有的理论和技巧,不必二次开发。这项技术现在已经在 TensorFlow 中实现。在实验中,将 GDP 和深度学习结合,并在多种类型的任务上取得了不俗的成绩。此前谷歌也曾将 epsDP 和深度学习结合,虽然在 MNIST 图像识别上取得了 97% 的正确率(无隐私算法可达到 99% 以上),在 CIFAR10 上却止步于 73% 的正确率(无隐私算法可达 86%)。而利用 GDP 的精确刻画,作者们在 MNIST 上取得了 98% 的准确率。不仅如此,MA 计算的结果表示 MNIST 的 96.6% 正确率对应的是 9.4% 的最小错误和,意味着攻击者有超过九成的概率猜对一张图片是否在数据集中。而 CLT 的计算表明 epsDP 太过于保守:同样的模型同样的表现,实际对应的最小错误和其实是 77.6%,也就是说隐私并没有损失很多。为了全面探讨 GDP 的优越性,作者在 高斯差分隐私GDP 框架下分析了神经网络的表现。实现了 SGD 和 Adam 的隐私版本,并通过让神经网络不断迭代直到 高斯差分隐私GDP 达到了 mu=2。在 IMDb(自然语言处理),MovieLens 1M(推荐系统)和 Adult Income(非图像型分类任务)上,GDP 模型都取得了非常接近无隐私模型的性能。例如在 Adult Income 数据上,隐私神经网络和无隐私神经网络表现几乎一样好,意味着隐私也许并不需要以很大的性能牺牲为代价。更进一步的,作者强调文中的神经网络都相对简单(不超过三层),如果使用更复杂更高级的神经网络可以在同样的隐私保证下更显著地提升性能。而另一方面,使用高效的优化算法(减少迭代次数,即隐私的损失次数)也能让性能变得更好。既然 CLT 可以在同样性能的条件下比 MA 更好地保护隐私,那么反过来想,在同样的隐私预算下,GDP 也能显示出更强的性能。作者构思了一个实验来说明这一点:训练一个加了 sigma 噪音的神经网络若干步,通过 MA 可以算出目前损失了多少隐私,通过 CLT 和 Dual 反解出真正必须的噪音 sigma hat。注意 sigma hat 必然小于 sigma,然后训练同一个神经网络但只加 sigma hat 噪音。由于噪音变小,新的神经网络学习效果会更好,而且在每一次迭代,新神经网络都会更好地保护隐私。将神经网络和 GDP 结合,可以更精准地呈现隐私损失,从而更好地保护隐私以及提升隐私算法的性能。另一方面,已有的 (epsilon,delta)-DP 研究也可以嫁接到 GDP 中,为两个领域带来了新的机遇。这一隐私算法领域的新进展给予了研究者们更大的信心去相信,随着机器学习的进一步发展,也许在不远的未来就能以可忽略不计的代价来保护我们的隐私。同时,它也鼓励人们更愿意分享涉及个人信息的数据,来推动机器学习的发展。
In this episode we will talk all about the various steps to transition to data science from non computer science backgrounds. One of the main difficulties people face from non-CS backgrounds is how overwhelming it can be to transition to data science field, I talk about my own journey, and share the 6 steps which can help you in your own data science career! 00:00 to 02:10: Introduction 02:11 to 06:00: My Background of moving to data science from electrical engineering 06:01 to 10:56: Steps 1 to 3 covering things like using external APIs, already processed datasets and performing full stack data science work 10:57 to 11:55: Break sponsored by Anchor 11:56: End: Steps 4 to 6 covering things like math and statistics, machine learning pipelines and data structures & algorithms Some useful links: 1) Andrew Ng Deep Learning Specialization Coursera https://www.coursera.org/specializations/deep-learning 2) Intro to Statistics by Sebastien Thrun https://www.udacity.com/course/intro-to-statistics--st101 3) Aurelion Geron's book on machine learning https://www.amazon.com/dp/1491962291/?tag=omnilence-20 4) Pramp for mock algorithm sessions on video https://www.pramp.com/ 5) Leetcode for algorithm question datasets https://leetcode.com/ Some great datasets to get started in machine learning: 6) MNIST for hand written digits https://www.kaggle.com/c/digit-recognizer 7) Iris dataset for flower classification http://archive.ics.uci.edu/ml/datasets/iris 8) IMDB movie reviews https://ai.stanford.edu/~amaas/data/sentiment/ Thanks for listening! --- Send in a voice message: https://anchor.fm/the-data-life-podcast/message Support this podcast: https://anchor.fm/the-data-life-podcast/support
This Week in Machine Learning & Artificial Intelligence (AI) Podcast
Today, we’re joined by Aydogan Ozcan, Professor of Electrical and Computer Engineering at UCLA, where his research group focuses on photonics and its applications to nano- and biotechnology. In our conversation, we explore his group's research into the intersection of deep learning and optics, holography and computational imaging. We specifically look at a really interesting project to create all-optical neural networks which work based on diffraction, where the printed pixels of the network are analogous to neurons. We also explore some of the practical applications for their research and other areas of interest for their group. The complete show notes for this episode can be found at twimlai.com/talk/237. Be sure to subscribe to our weekly newsletter at twimlai.com/newsletter!
Манай энэ удаагийн дугаарт Facebook компанийн xиймэл оюун ухааны судалгааны төвд ажилладаг Сайнбаяр оролцон Machine Learning салбарт, тэр дундаа бататган суралцах (Reinforcement Learning) арга барилд гарч буй сонин сайхан болон тус салбарын давуу сул талуудыг тоймлон ярилцлаа. Робот болон дата центрүүдэд нэвтэрсэн энэ хүчээ авч буй салбарын талаар ярилцахдаа Machine Learning салбарт судлаачаар болон хэрэглэгчээр хэрхэн орж суралцах талаар зочин маань үнэтэй зөвлөгөөгөө өгсөн юм. Түүний сонирхуулсан зүйлсийг доор хүргэж байна. 1) https://www.udemy.com/practical-deep-learning-with-pytorch/ 2) Deep Learning Text book: https://www.deeplearningbook.org/ 3) Thesis: https://cs.nyu.edu/media/publications/sukhbaatar_sainbayar.pdf 4) MNIST: http://yann.lecun.com/exdb/mnist/ 5) PyTorch: https://pytorch.org/get-started/locally/
In shorter news items, Andy and Dave discuss the announcement that the Allen Institute for Artificial Intelligence is partnering with Microsoft Research to connect AI2’s Semantic Scholar academic search engine with Microsoft’s Academic Graph. The University of Pavia in Italy demonstrates an artificial neuron (a perceptron) on an actual quantum processor. Another Tesla on Autopilot has an accident; and Waymo demonstrates that pure imitation learning (with 30 million examples) is not sufficient for teaching a model to drive a car. And Tumblr implements a porn-detecting AI. In research topics, researchers with Facebook AI, MIT, and UC Berkeley demonstrate “dataset distillation,” compressing 60,000 MNIST images into 10 synthetic images. Researchers at University of Maryland demonstrate the ability to hide adversarial attacks from network interpretation; so for networks that visually locate the item identified, that network would locate the “original” item instead of the adversarial item. Adobe and Auburn show that neural networks fail miserably for “out-of-distribution” inputs (or, “strange poses of familiar objects”), and they probe deeper into the parameters that cause the misbehavior. In other news, the AI Narratives Report explores how AI is portrayed and perceived. The AI Index releases its 2018 version. AI researchers have a spirited debate on Twitter about deep learning and symbol manipulation. Quantum Computing: Progress and Prospects provides a deeper look at this nascent technology. And Juergen Schmidhuber gives a TEDx talk on how “true AI” will change everything.
Let's jump to the code! Killian and Rosane work through the Windows ML MNIST tutorial and answer questions developers commonly ask along the way. https://aka.ms/OverviewOfWinML
Looking at Lumina Desktop 2.0, 2 months of KPTI development in SmartOS, OpenBSD email service, an interview with Ryan Zezeski, NomadBSD released, and John Carmack's programming retreat with OpenBSD. This episode was brought to you by Headlines Looking at Lumina Desktop 2.0 (https://www.trueos.org/blog/looking-lumina-desktop-2-0/) A few weeks ago I sat down with Lead Developer Ken Moore of the TrueOS Project to get answers to some of the most frequently asked questions about Lumina Desktop from the open source community. Here is what he said on Lumina Desktop 2.0. Do you have a question for Ken and the rest of the team over at the TrueOS Project? Make sure to read the interview and comment below. We are glad to answer your questions! Ken: Lumina Desktop 2.0 is a significant overhaul compared to Lumina 1.x. Almost every single subsystem of the desktop has been streamlined, resulting in a nearly-total conversion in many important areas. With Lumina Desktop 2.0 we will finally achieve our long-term goal of turning Lumina into a complete, end-to-end management system for the graphical session and removing all the current runtime dependencies from Lumina 1.x (Fluxbox, xscreensaver, compton/xcompmgr). The functionality from those utilities is now provided by Lumina Desktop itself. Going along with the session management changes, we have compressed the entire desktop into a single, multi-threaded binary. This means that if any rogue script or tool starts trying to muck about with the memory used by the desktop (probably even more relevant now than when we started working on this), the entire desktop session will close/crash rather than allowing targeted application crashes to bypass the session security mechanisms. By the same token, this also prevents “man-in-the-middle” type of attacks because the desktop does not use any sort of external messaging system to communicate (looking at you dbus). This also gives a large performance boost to Lumina Desktop The entire system for how a user's settings get saved and loaded has been completely redone, making it a “layered” settings system which allows the default settings (Lumina) to get transparently replaced by system settings (OS/Distributor/SysAdmin) which can get replaced by individual user settings. This results in the actual changes in the user setting files to be kept to a minimum and allows for a smooth transition between updates to the OS or Desktop. This also provides the ability to “restrict” a user's desktop session (based on a system config file) to the default system settings and read-only user sessions for certain business applications. The entire graphical interface has been written in QML in order to fully-utilize hardware-based GPU acceleration with OpenGL while the backend logic and management systems are still written entirely in C++. This results in blazing fast performance on the backend systems (myriad multi-threaded C++ objects) as well as a smooth and responsive graphical interface with all the bells and whistles (drag and drop, compositing, shading, etc). Q: Are there future plans to implement something like Lumina in a MAC Jail? While I have never tried out Lumina in a MAC jail, I do not see anything on that page which should stop it from running in one right now. Lumina is already designed to be run as an unpriviledged user and is very smart about probing the system to find out what is/not available before showing anything to the user. The only thing that comes to mind is that you might need to open up some other system devices so that X11 itself can draw to the display (graphical environment setup is a bit different than CLI environment). Q: I look forward to these changes. I know the last time I used it when I would scroll I would get flashes like the refresh rate was not high enough. It will be nice to have a fast system as well as I know with the more changes Linux is becoming slower. Not once it has loaded but in the loading process. I will do another download when these changes come out and install again and maybe stay this time. If I recall correctly, one of the very first versions of Lumina (pre-1.0) would occasionally flicker. If that is still happening, you might want to verify that you are using the proper video driver for your hardware and/or enable the compositor within the Lumina settings. Q: Why was enlightenment project not considered for TrueOS? It is BSD licensed and is written in C. This was a common question about 4(?) years ago with the first release of the Lumina desktop and it basically boiled down to long-term support and reliability of the underlying toolkit. Some of the things we had to consider were: cross-platform/cross-architecture support, dependency reliability and support framework (Qt5 > EFL), and runtime requirements and dependency tracking (Qt5 is lighter than the EFL). That plus the fact that the EFL specifically states that it is linux-focused and the BSD's are just an afterthought (especially at the time we were doing the evaluation). Q: I have two questions. 1) The default layout of Unity(menu bar with actual menu entries on top and icon dock on the side) is one of the few things I liked about my first voyage into non-Windows systems, and have been missing since moving on to other distros(and now also other non-Linux systems). However in 1.4.0 screenshots on Lumina's site, the OSX-like layout has the menu attached to the window. Will 2.0 be able to have the menus on the bar? 2) Is there any timeline for a public release, or are you taking a “when it's ready” approach? In Lumina you can already put panels on the left/right side of the screen and give you something like the layout of the Unity desktop. The embedded menu system is not available in Lumina because that is not a specification supported by X11 and the window manager standards at the present time. The way that functionality is currently run on Linux is a hacky-bypass of the display system which only really works with the GTK3 and Qt5 toolkits, resulting in very odd overall desktop behavior in mixed environments where some apps use other graphical toolkits. We are targetting the 18.06 STABLE release of TrueOS for Lumina 2, but that is just a guideline and if necessary we will push back the release date to allow for additional testing/fixing as needed. A long two months (https://blog.cooperi.net/a-long-two-months) IllumOS/SmartOS developer Alex Wilson describes the journey of developing KPTI for IllumOS > On Monday (January 1st) I had the day off work for New Year's day, as is usual in most of the western world, so I slept in late. Lou and her friend decided to go to the wax museum and see several tourist attractions around SF, and I decided to pass the day at home reading. That afternoon, work chat started talking about a Tumblr post by pythonsweetness about an Intel hardware security bug. At the time I definitely did not suspect that this was going to occupy most of my working life for the next (almost) two months. Like many people who work on system security, I had read Anders Fogh's post about a "Negative Result" in speculative execution research in July of 2017. At the time I thought it was an interesting writeup and I remember being glad that researchers were looking into this area. I sent the post to Bryan and asked him about his thoughts on it at the time, to which he replied saying that "it would be shocking if they left a way to directly leak out memory in the speculative execution". None of us seriously thought that there would be low-hanging fruit down that research path, but we also felt it was important that there was someone doing work in the area who was committed to public disclosure. At first, after reading the blog post on Monday, we thought (or hoped) that the bug might "just" be a KASLR bypass and wouldn't require a lot of urgency. We tried to reach out to Intel at work to get more information but were met with silence. (We wouldn't hear back from them until after the disclosure was already made public.) The speculation on Tuesday intensified, until finally on Wednesday morning I arrived at the office to find links to late Tuesday night tweets revealing exploits that allowed arbitrary kernel memory reads. Wednesday was not a happy day. Intel finally responded to our emails -- after they had already initiated public disclosure. We all spent a lot of time reading. An arbitrary kernel memory read (an info leak) is not that uncommon as far as bugs go, but for the most part they tend to be fairly easy to fix. The thing that makes the Meltdown and Spectre bugs particularly notable is that in order to mitigate them, a large amount of change is required in very deep low-level parts of the kernel. The kind of deep parts of the kernel where there are 20-year old errata workarounds that were single-line changes that you have to be very careful to not accidentally undo; the kind of parts where, as they say, mortals fear to tread. On Friday we saw the patches Matthew Dillon put together for DragonFlyBSD for the first time. These were the first patches for KPTI that were very straightforward to read and understand, and applied to a BSD-derived kernel that was similar to those I'm accustomed to working on. To mitigate Meltdown (and partially one of the Spectre variants), you have to make sure that speculative execution cannot reach any sensitive data from a user context. This basically means that the pages the kernel uses for anything potentially sensitive have to be unmapped when we are running user code. Traditionally, CPUs that were built to run a multi-user, UNIX-like OS did this by default (SPARC is an example of such a CPU which has completely separate address spaces for the kernel and userland). However, x86 descends from a single-address-space microcontroller that has grown up avoiding backwards-incompatible changes, and has never really introduced a clean notion of multiple address spaces (segmentation is the closest feature really, and it was thrown out for 64-bit AMD64). Instead, operating systems for x86 have generally wound up (at least in the post-AMD64 era) with flat address space models where the kernel text and data is always present in the page table no matter whether you're in user or kernel mode. The kernel mappings simply have the "supervisor" bit set on them so that user code can't directly access them. The mitigation is basically to stop doing this: to stop mapping the kernel text, data and other memory into the page table while we're running in userland. Unfortunately, the x86 design does not make this easy. In order to be able to take interrupts or traps, the CPU has to have a number of structures mapped in the current page table at all times. There is also no ability to tell an x86 CPU that you want it to switch page tables when an interrupt occurs. So, the code that we jump to when we take an interrupt, as well as space for a stack to push context onto have to be available in both page tables. And finally, of course, we need to be able to figure out somehow what the other page table we should switch to is when we enter the kernel. When we looked at the patches for Linux (and also the DragonFlyBSD patches at the time) on Friday and started asking questions, it became pretty evident that the initial work done by both was done under time constraints. Both had left the full kernel text mapped in both page tables, and the Linux trampoline design seemed over-complex. I started talking over some ideas with Robert Mustacchi about ways to fix these and who we should talk to, and reached out to some of my old workmates from the University of Queensland who were involved with OpenBSD. It seemed to me that the OpenBSD developers would care about these issues even more than we did, and would want to work out how to do the mitigation right. I ended up sending an email to Philip Guenther on Friday afternoon, and on Saturday morning I drove an hour or so to meet up with him for coffee to talk page tables and interrupt trampolines. We wound up spending a good 6 hours at the coffee shop, and I came back with several pages of notes and a half-decent idea of the shape of the work to come. One detail we missed that day was the interaction of per-CPU structures with per-process page tables. Much of the interrupt trampoline work is most easily done by using per-CPU structures in memory (and you definitely want a per-CPU stack!). If you combine that with per-process page tables, however, you have a problem: if you leave all the per-CPU areas mapped in all the processes, you will leak information (via Meltdown) about the state of one process to a different one when taking interrupts. In particular, you will leak things like %rip, which ruins all the work being done with PIE and ASLR pretty quickly. So, there are two options: you can either allocate the per-CPU structures per-process (so you end up with $NCPUS * $NPROCS of them); or you can make the page tables per-CPU. OpenBSD, like Linux and the other implementations so far, decided to go down the road of per-CPU per-process pages to solve this issue. For illumos, we took the other route. In illumos, it turned out that we already had per-CPU page tables. Robert and I re-discovered this on the Sunday of that week. We use them for 32-bit processes due to having full P>V PAE support in our kernel (which is, as it turns out, relatively uncommon amongst open-source OS). The logic to deal with creating and managing them and updating them was all already written, and after reading the code we concluded we could basically make a few small changes and re-use all of it. So we did. By the end of that second week, we had a prototype that could get to userland. But, when working on this kind of kernel change we have a rule of thumb we use: after the first 70% of the patch is done and we can boot again, now it's time for the second 70%. In fact it turned out to be more like the second 200% for us -- a tedious long tail of bugs to solve that ended up necessitating some changes in the design as well. At first we borrowed the method that Matt Dillon used for DragonFlyBSD, by putting the temporary "stack" space and state data for the interrupt trampolines into an extra page tacked onto the end of *%gs (in illumos the structure that lives there is the cpu_t). If you read the existing logic in interrupt handlers for dealing with %gs though, you will quickly notice that the corner cases start to build up. There are a bunch of situations where the kernel temporarily alters %gs, and some of the ways to mess it up have security consequences that end up being worse than the bug we're trying to fix. As it turns out, there are no less than 3 different ways that ISRs use to try to get to having the right cpu_t in %gs on illumos, as it turns out, and they are all subtly different. Trying to tell which you should use when requires a bunch of test logic that in turn requires branches and changes to the CPU state, which is difficult to do in a trampoline where you're trying to avoid altering that state as much as possible until you've got the real stack online to push things into. I kept in touch with Philip Guenther and Mike Larkin from the OpenBSD project throughout the weeks that followed. In one of the discussions we had, we talked about the NMI/MCE handlers and the fact that their handling currently on OpenBSD neglected some nasty corner-cases around interrupting an existing trap handler. A big part of the solution to those issues was to use a feature called IST, which allows you to unconditionally change stacks when you take an interrupt. Traditionally, x86 only changes the stack pointer (%rsp on AMD64) while taking an interrupt when there is a privilege level change. If you take an interrupt while already in the kernel, the CPU does not change the stack pointer, and simply pushes the interrupt stack frame onto the stack you're already using. IST makes the change of stack pointer unconditional. If used unwisely, this is a bad idea: if you stay on that stack and turn interrupts back on, you could take another interrupt and clobber the frame you're already in. However, in it I saw a possible way to simplify the KPTI trampoline logic and avoid having to deal with %gs. A few weeks into the project, John Levon joined us at work. He had previously worked on a bunch of Xen-related stuff as well as other parts of the kernel very close to where we were, so he quickly got up to speed with the KPTI work as well. He and I drafted out a "crazy idea" on the whiteboard one afternoon where we would use IST for all interrupts on the system, and put the "stack" they used in the KPTI page on the end of the cpu_t. Then, they could easily use stack-relative addresses to get the page table to change to, then pivot their stack to the real kernel stack memory, and throw away (almost) all the conditional logic. A few days later, we had convinced each other that this was the way to go. Two of the most annoying x86 issues we had to work around were related to the SYSENTER instruction. This instruction is used to make "fast" system calls in 32-bit userland. It has a couple of unfortunate properties: firstly, it doesn't save or restore RFLAGS, so the kernel code has to take care of this (and be very careful not to clobber any of it before saving or after restoring it). Secondly, if you execute SYSENTER with the TF ("trap"/single-step flag) set by a debugger, the resulting debug trap's frame points at kernel code instead of the user code where it actually happened. The first one requires some careful gymnastics on the entry and return trampolines specifically for SYSENTER, while the second is a nasty case that is incidentally made easier by using IST. With IST, we can simply make the debug trap trampoline check for whether we took the trap in another trampoline's code, and reset %cr3 and the destination stack. This works for single-stepping into any of the handlers, not just the one for SYSENTER. To make debugging easier, we decided that traps like the debug/single-step trap (as well as faults like page faults, #GP, etc.) would push their interrupt frame in a different part of the KPTI state page to normal interrupts. We applied this change to all the traps that can interrupt another trampoline (based on the instructions we used). These "paranoid" traps also set a flag in the KPTI struct to mark it busy (and jump to the double-fault handler if it is), to work around some bugs where double-faults are not correctly generated. It's been a long and busy two months, with lots of time spent building, testing, and validating the code. We've run it on as many kinds of machines as we could get our hands on, to try to make sure we catch issues. The time we've spent on this has been validated several times in the process by finding bugs that could have been nasty in production. One great example: our patches on Westmere-EP Xeons were causing busy machines to throw a lot of L0 I-cache parity errors. This seemed very mysterious at first, and it took us a few times seeing it to believe that it was actually our fault. This was actually caused by the accidental activation of a CPU errata for Westmere (B52, "Memory Aliasing of Code Pages May Cause Unpredictable System Behaviour") -- it turned out we had made a typo and put the "cacheable" flag into a variable named flags instead of attrs where it belonged when setting up the page tables. This was causing performance degradation on other machines, but on Westmere it causes cache parity errors as well. This is a great example of the surprising consequences that small mistakes in this kind of code can end up having. In the end, I'm glad that that erratum existed, otherwise it may have been a long time before we caught that bug. As of this week, Mike and Philip have committed the OpenBSD patches for KPTI to their repository, and the patches for illumos are out for review. It's a nice kind of symmetry that the two projects who started on the work together after the public disclosure at the same time are both almost ready to ship at the same time at the other end. I'm feeling hopeful, and looking forward to further future collaborations like this with our cousins, the BSDs. The IllumOS work has since landed, on March 12th (https://github.com/joyent/illumos-joyent/commit/d85fbfe15cf9925f83722b6d62da49d549af615c) *** OpenBSD Email Service (https://github.com/vedetta-com/caesonia) Features Efficient: configured to run on min. 512MB RAM and 20GB SSD, a KVM (cloud) VPS for around $2.50/mo 15GB+ uncompressed Maildir, rivals top free-email providers (grow by upgrading SSD) Email messages are gzip compressed, at least 1/3 more space with level 6 default Server side full text search (headers and body) can be enabled (to use the extra space) Mobile data friendly: IMAPS connections are compressed Subaddress (+tag) support, to filter and monitor email addresses Virtual domains, aliases, and credentials in files, Berkeley DB, or SQLite3 Naive Bayes rspamd filtering with supervised learning: the lowest false positive spam detection rates Carefree automated Spam/ and Trash/ cleaning service (default: older than 30 days) Automated quota management, gently assists when over quota Easy backup MX setup: using the same configuration, install in minutes on a different host Worry-free automated master/master replication with backup MX, prevents accidental loss of email messages Resilient: the backup MX can be used as primary, even when the primary is not down, both perfect replicas Flexible: switching roles is easy, making the process of changing VPS hosts a breeze (no downtime) DMARC (with DKIM and SPF) email-validation system, to detect and prevent email spoofing Daily (spartan) stats, to keep track of things Your sieve scripts and managesieve configuration, let's get started Considerations By design, email message headers need to be public, for exchanges to happen. The body of the message can be encrypted by the user, if desired. Moreover, there is no way to prevent the host from having access to the virtual machine. Therefore, full disk encryption (at rest) may not be necessary. Given our low memory requirements, and the single-purpose concept of email service, Roundcube or other web-based IMAP email clients should be on a different VPS. Antivirus software users (usually) have the service running on their devices. ClamAV can easily be incorporated into this configuration, if affected by the types of malware it protects against, but will require around 1GB additional RAM (or another VPS). Every email message is important, if properly delivered, for Bayes classification. At least 200 ham and 200 spam messages are required to learn what one considers junk. By default (change to use case), a rspamd score above 50% will send the message to Spam/. Moving messages in and out of Spam/ changes this score. After 95%, the message is flagged as "seen" and can be safely ignored. Spamd is effective at greylisting and stopping high volume spam, if it becomes a problem. It will be an option when IPv6 is supported, along with bgp-spamd. System mail is delivered to an alias mapped to a virtual user served by the service. This way, messages are guaranteed to be delivered via encrypted connection. It is not possible for real users to alias, nor mail an external mail address with the default configuration. e.g. puffy@mercury.example.com is wheel, with an alias mapped to (virtual) puffy@example.com, and user (puffy) can be different for each. Interview - Ryan Zezeski - rpz@joyent.com (mailto:rpz@joyent.com) / @rzezeski (https://twitter.com/rzezeski) News Roundup John Carmack's programming retreat to hermit coding with OpenBSD (https://www.facebook.com/permalink.php?story_fbid=2110408722526967&id=100006735798590) After a several year gap, I finally took another week-long programming retreat, where I could work in hermit mode, away from the normal press of work. My wife has been generously offering it to me the last few years, but I'm generally bad at taking vacations from work. As a change of pace from my current Oculus work, I wanted to write some from-scratch-in-C++ neural network implementations, and I wanted to do it with a strictly base OpenBSD system. Someone remarked that is a pretty random pairing, but it worked out ok. Despite not having actually used it, I have always been fond of the idea of OpenBSD — a relatively minimal and opinionated system with a cohesive vision and an emphasis on quality and craftsmanship. Linux is a lot of things, but cohesive isn't one of them. I'm not a Unix geek. I get around ok, but I am most comfortable developing in Visual Studio on Windows. I thought a week of full immersion work in the old school Unix style would be interesting, even if it meant working at a slower pace. It was sort of an adventure in retro computing — this was fvwm and vi. Not vim, actual BSD vi. In the end, I didn't really explore the system all that much, with 95% of my time in just the basic vi / make / gdb operations. I appreciated the good man pages, as I tried to do everything within the self contained system, without resorting to internet searches. Seeing references to 30+ year old things like Tektronix terminals was amusing. I was a little surprised that the C++ support wasn't very good. G++ didn't support C++11, and LLVM C++ didn't play nicely with gdb. Gdb crashed on me a lot as well, I suspect due to C++ issues. I know you can get more recent versions through ports, but I stuck with using the base system. In hindsight, I should have just gone full retro and done everything in ANSI C. I do have plenty of days where, like many older programmers, I think “Maybe C++ isn't as much of a net positive as we assume...”. There is still much that I like, but it isn't a hardship for me to build small projects in plain C. Maybe next time I do this I will try to go full emacs, another major culture that I don't have much exposure to. I have a decent overview understanding of most machine learning algorithms, and I have done some linear classifier and decision tree work, but for some reason I have avoided neural networks. On some level, I suspect that Deep Learning being so trendy tweaked a little bit of contrarian in me, and I still have a little bit of a reflexive bias against “throw everything at the NN and let it sort it out!” In the spirit of my retro theme, I had printed out several of Yann LeCun's old papers and was considering doing everything completely off line, as if I was actually in a mountain cabin somewhere, but I wound up watching a lot of the Stanford CS231N lectures on YouTube, and found them really valuable. Watching lecture videos is something that I very rarely do — it is normally hard for me to feel the time is justified, but on retreat it was great! I don't think I have anything particularly insightful to add about neural networks, but it was a very productive week for me, solidifying “book knowledge” into real experience. I used a common pattern for me: get first results with hacky code, then write a brand new and clean implementation with the lessons learned, so they both exist and can be cross checked. I initially got backprop wrong both times, comparison with numerical differentiation was critical! It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made. I was pretty happy with my multi-layer neural net code; it wound up in a form that I can just drop it into future efforts. Yes, for anything serious I should use an established library, but there are a lot of times when just having a single .cpp and .h file that you wrote ever line of is convenient. My conv net code just got to the hacky but working phase, I could have used another day or two to make a clean and flexible implementation. One thing I found interesting was that when testing on MNIST with my initial NN before adding any convolutions, I was getting significantly better results than the non-convolutional NN reported for comparison in LeCun ‘98 — right around 2% error on the test set with a single 100 node hidden layer, versus 3% for both wider and deeper nets back then. I attribute this to the modern best practices —ReLU, Softmax, and better initialization. This is one of the most fascinating things about NN work — it is all so simple, and the breakthrough advances are often things that can be expressed with just a few lines of code. It feels like there are some similarities with ray tracing in the graphics world, where you can implement a physically based light transport ray tracer quite quickly, and produce state of the art images if you have the data and enough runtime patience. I got a much better gut-level understanding of overtraining / generalization / regularization by exploring a bunch of training parameters. On the last night before I had to head home, I froze the architecture and just played with hyperparameters. “Training!” Is definitely worse than “Compiling!” for staying focused. Now I get to keep my eyes open for a work opportunity to use the new skills! I am dreading what my email and workspace are going to look like when I get into the office tomorrow. Stack-register Checking (https://undeadly.org/cgi?action=article;sid=20180310000858) Recently, Theo de Raadt (deraadt@) described a new type of mitigation he has been working on together with Stefan Kempf (stefan@): How about we add another new permission! This is not a hardware permission, but a software permission. It is opportunistically enforced by the kernel. The permission is MAP_STACK. If you want to use memory as a stack, you must mmap it with that flag bit. The kernel does so automatically for the stack region of a process's stack. Two other types of stack occur: thread stacks, and alternate signal stacks. Those are handled in clever ways. When a system call happens, we check if the stack-pointer register points to such a page. If it doesn't, the program is killed. We have tightened the ABI. You may no longer point your stack register at non-stack memory. You'll be killed. This checking code is MI, so it works for all platforms. For more detail, see Theo's original message (https://marc.info/?l=openbsd-tech&m=152035796722258&w=2). This is now available in snapshots, and people are finding the first problems in the ports tree already. So far, few issues have been uncovered, but as Theo points out, more testing is necessary: Fairly good results. A total of 4 problems have been found so far. go, SBCL, and two cases in src/regress which failed the new page-alignment requirement. The SBCL and go ones were found at buildtime, since they use themselves to complete build. But more page-alignment violations may be found in ports at runtime. This is something I worry about a bit. So please everyone out there can help: Use snapshots which contain the stack-check diff, update to new packages, and test all possible packages. Really need a lot of testing for this, so please help out. So, everybody, install the latest snapshot and try all your favorite ports. This is the time to report issues you find, so there is a good chance this additional security feature is present in 6.3 (and works with third party software from packages). NomadBSD 1.0 has been released (https://freeshell.de/~mk/projects/nomadbsd.html) NomadBSD is a live system for flash drives, based on FreeBSD® 11.1 (amd64) Change Log The setup process has been improved. Support for optional geli encryption of the home partition has been added Auto-detection of NVIDIA graphics cards and their corresponding driver has been added. (Thanks to holgerw and lme from BSDForen.de) An rc script to start the GEOM disk scheduler on the root device has been added. More software has been added: accessibility/redshift (starts automatically) audio/cantata audio/musicpd audio/ncmpc ftp/filezilla games/bsdtris mail/neomutt math/galculator net-p2p/transmission-qt5 security/fpm2 sysutils/bsdstats x11/metalock x11/xbindkeys Several smaller improvements and bugfixes. Screenshots https://freeshell.de/~mk/projects/nomadbsd-ss1.png https://freeshell.de/~mk/projects/nomadbsd-ss2.png https://freeshell.de/~mk/projects/nomadbsd-ss3.png https://freeshell.de/~mk/projects/nomadbsd-ss4.png https://freeshell.de/~mk/projects/nomadbsd-ss5.png https://freeshell.de/~mk/projects/nomadbsd-ss6.png Beastie Bits KnoxBug - Nagios (http://knoxbug.org/2018-03-27) vBSDcon videos landing (https://www.youtube.com/playlist?list=PLfJr0tWo35bc9FG_reSki2S5S0G8imqB4) AsiaBSDCon 2017 videos (https://www.youtube.com/playlist?list=PLnTFqpZk5ebBTyXedudGm6CwedJGsE2Py) DragonFlyBSD Adds New "Ptr_Restrict" Security Option (https://www.phoronix.com/scan.php?page=news_item&px=DragonFlyBSD-Ptr-Restrict) A Dexter needs your help (https://twitter.com/michaeldexter/status/975603855407788032) Mike Larkin at bhyvecon 2018: OpenBSD vmm(4) update (https://undeadly.org/cgi?action=article;sid=20180309064801) [HEADS UP] - OFED/RDMA stack update (https://lists.freebsd.org/pipermail/freebsd-arch/2018-March/018900.html) *** Feedback/Questions Ron - Interview someone using DragonflyBSD (http://dpaste.com/3BM6GSW#wrap) Brad - Gaming and all (http://dpaste.com/3X4ZZK2#wrap) Mohammad - Sockets vs TCP (http://dpaste.com/0PJMKRD#wrap) Paul - All or at least most of Bryan Cantrill's Talks (http://dpaste.com/2WXVR1X#wrap) ***
안녕하세요. 오늘은 지난 시간에 이어, 몇 년사이에 큰 이슈가 되고 있는 인공지능 딥러닝에 대한 이야기를 간략히 소개해 보겠습니다. 이 내용은 BIM학회에 '인공지능 딥러닝 기술 동향 및 구현 사례'로 소개된 글의 일부 내용입니다. CNN, RNN, LSTM, GAN과 같은 대표적인 딥러닝 신경망 종류 및 3차원 스캔과 관련된 CNN활용 사례 등을 소개합니다.No.70 Podcast 방송 - 인공지능 딥러닝 기술 소개 및 사례소개된 부분에 대한 좀 더 상세한 내용은 다음 레퍼런스에서 확인할 수 있습니다. 1. Daddy maker, 텐서플로우 최신버전 설치 및 개념2. Daddy maker, 텐서플로우 MNIST 딥러닝 및 텐서보드 그래프 가시화3. BIM principle, 2017, 페이스북 딥러닝 CNN 기반 오픈소스 번역 기술 CSSL4. 딥러닝 Hello World - MNIST, CIFAR-10 데이터베이스 구조와 이미지넷5. Konstantin Lackner, 2016, Composing a melody with long-short term memory (LSTM) Recurrent Neural Networks, Institute for DataProcessing Technische Universitat Munchen6. Patrick Hebron, 2016, Unsupervised Learning, patrickhebron.com7. Alec Radford & Luke Metz, Soumith Chintala, 2016, Unsupervised Representation Learning with Deep Convolutional GenerativeAdversarial Network8. Adam Santoro외, 2017, A simple neural network module for relational reasoning, DeepMind9. Microsoft, Going Deep: Convolutional Neural Networks10. The Asimov Institute, 2016, THE NEURAL NETWORK ZOO11. Adit Deshpande, 2016, The 9 Deep Learning Papers You Need To Know About CNN12. Adrian Rosebrock, 2016, My Top 9 Favorite Python Deep Learning Libraries13. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, 2017.5, A novel approach to neural machine translation,FAIR(Facebook AI Research Group)14. Adit Deshpande, A Beginner's Guide To Understanding Convolutional Neural Networks, UCLA15. Rob Verger, 2017.5, Facebook created a faster, more accurate translation system using artificial intelligence, Popular Science16. James Vincent, 2017.5.9, Facebook says its prototype translation technique is nine times faster than rivals, THE VERGE17. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, 2017.5, Convolutional Sequence to Sequence Learning,FAIR(Facebook AI Research Group)18. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, 2016, PointNet: Deep Learning on Point Sets for 3D Classification andSegmentation19. Sepp H., Jurgen S, 1997, Long short-term memory, Neural Computation. 9 (8) ,pp.1735–178020. Goodfellow. Ian J, Pouget-Abadie. J, Mirza. M, Xu. Bing, Warde-Farley. David, Ozair. Sherjil, Courville. Aaron, Bengio. Yoshua, 2014,Generative Adversarial Networks
Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2020.04.18.046235v1?rss=1 Authors: Kori, A. Abstract: This paper is concerned with the theoretical investigation of game theory concepts in analyzing the behavior of dynamically coupled oscillators. Here, we claim that the coupling strength in any neuronal oscillators can be modeled as a game. We formulate the game to describe the effect of pure-strategy Nash equilibrium on two neuron systems of Hopf-oscillator and later demonstrate the application of the same assumptions and methods to N x N neuronal sheet. We also demonstrate the effect of the proposed method on MNIST data to show the equilibrium behavior of neurons in a N x N neuronal grid for all different digits. A significant outcome of the paper is a modified Hebbian algorithm, which adapts the coupling weights to neural potential resulting in a stable phase difference. Which in turn, makes it possible for an individual neuron to encode input information. Copy rights belong to original authors. Visit the link for more info