Podcasts about mnist

Play Episode Listen Later Dec 15, 2023 12:55

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Current AIs Provide Nearly No Data Relevant to AGI Alignment, published by Thane Ruthenis on December 15, 2023 on The AI Alignment Forum. Recently, there's been a fair amount of pushback on the "canonical" views towards the difficulty of AGI Alignment (the views I call the "least forgiving" take). Said pushback is based on empirical studies of how the most powerful AIs at our disposal currently work, and is supported by fairly convincing theoretical basis of its own. By comparison, the "canonical" takes are almost purely theoretical. At a glance, not updating away from them in the face of ground-truth empirical evidence is a failure of rationality: entrenched beliefs fortified by rationalizations. I believe this is invalid, and that the two views are much more compatible than might seem. I think the issue lies in the mismatch between their subject matters. It's clearer if you taboo the word "AI": The "canonical" views are concerned with scarily powerful artificial agents: with systems that are human-like in their ability to model the world and take consequentialist actions in it, but inhuman in their processing power and in their value systems. The novel views are concerned with the systems generated by any process broadly encompassed by the current ML training paradigm. It is not at all obvious that they're one and the same. Indeed, I would say that to claim that the two classes of systems overlap is to make a very strong statement regarding how cognition and intelligence work. A statement we do not have much empirical evidence on, but which often gets unknowingly, implicitly snuck-in when people extrapolate findings from LLM studies to superintelligences. It's an easy mistake to make: both things are called "AI", after all. But you wouldn't study manually-written FPS bots circa 2000s, or MNIST-classifier CNNs circa 2010s, and claim that your findings generalize to how LLMs circa 2020s work. By the same token, LLM findings do not necessarily generalize to AGI. What the Fuss Is All About To start off, let's consider where all the concerns about the AGI Omnicide Risk came from in the first place. Consider humans. Some facts: Humans posses an outstanding ability to steer the world towards their goals, and that ability grows sharply with their "intelligence". Sure, there are specific talents, and "idiot savants". But broadly, there does seem to be a single variable that mediates a human's competence in all domains. An IQ 140 human would dramatically outperform an IQ 90 human at basically any cognitive task, and crucially, be much better at achieving their real-life goals. Humans have the ability to plot against and deceive others. That ability grows fast with their g-factor. A brilliant social manipulator can quickly maneuver their way into having power over millions of people, out-plotting and dispatching even those that are actively trying to stop them or compete with them. Human values are complex and fragile, and the process of moral philosophy is more complex still. Humans often arrive at weird conclusions that don't neatly correspond to their innate instincts or basic values. Intricate moral frameworks, weird bullet-biting philosophies, and even essentially-arbitrary ideologies like cults. And when people with different values interact... People who differ in their values even just a bit are often vicious, bitter enemies. Consider the history of heresies, or of long-standing political rifts between factions that are essentially indistinguishable from the outside. People whose cultures evolved in mutual isolation often don't even view each other as human. Consider the history of xenophobia, colonization, culture shocks. So, we have an existence proof of systems able to powerfully steer the world towards their goals. Some of these system can be strictly more powerfu...

ai data current cnn human humans speech alignment relevant iq ea ml llm fps agi thane intricate rationalist mnist

AF - My Criticism of Singular Learning Theory by Joar Skalse

Play Episode Listen Later Nov 19, 2023 20:01

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Criticism of Singular Learning Theory, published by Joar Skalse on November 19, 2023 on The AI Alignment Forum. In this post, I will briefly give my criticism of Singular Learning Theory (SLT), and explain why I am skeptical of its significance. I will especially focus on the question of generalisation --- I do not believe that SLT offers any explanation of generalisation in neural networks. I will also briefly mention some of my other criticisms of SLT, describe some alternative solutions to the problems that SLT aims to tackle, and describe some related research problems which I would be more excited about. (I have been meaning to write this for almost 6 months now, since I attended the SLT workshop last June, but things have kept coming in the way.) For an overview of SLT, see this sequence. This post will also refer to the results described in this post, and will also occasionally touch on VC theory. However, I have tried to make it mostly self-contained. The Mystery of Generalisation First of all, what is the mystery of generalisation? The issue is this; neural networks are highly expressive, and typically overparameterised. In particular, when a real-world neural network is trained on a real-world dataset, it is typically the case that this network is able to express many functions which would fit the training data well, but which would generalise poorly. Moreover, among all functions which do fit the training data, there are more functions (by number) that generalise poorly, than functions that generalise well. And yet neural networks will typically find functions that generalise well. To make this point more intuitive, suppose we have a 500,000-degree polynomial, and that we fit this to 50,000 data points. In this case, we have 450,000 degrees of freedom, and we should by default expect to end up with a function which generalises very poorly. But when we train a neural network with 500,000 parameters on 50,000 MNIST images, we end up with a neural network that generalises well. Moreover, adding more parameters to the neural network will typically make generalisation better, whereas adding more parameters to the polynomial is likely to make generalisation worse. A simple hypothesis might be that some of the parameters in a neural network are redundant, so that even if it has 500,000 parameters, the dimensionality of the space of all functions which it can express is still less than 500,000. This is true. However, the magnitude of this effect is too small to solve the puzzle. If you get the MNIST training set, and assign random labels to the test data, and then try to fit the network to this function, you will find that this often can be done. This means that while neural networks have redundant parameters, they are still able to express more functions which generalise poorly, than functions which generalise well. Hence the puzzle. The anwer to this puzzle must be that neural networks have an inductive bias towards low-complexity functions. That is, among all functions which fit a given training set, neural networks are more likely to find a low-complexity function (and such functions are more likely to generalise well, as per Occam's Razor). The next question is where this inductive bias comes from, and how it works. Understanding this would let us better understand and predict the behaviour of neural networks, which would be very useful for AI alignment. I should also mention that generalisation only is mysterious when we have an amount of training data that is small relative to the overall expressivity of the learning machine. Classical statistical learning theory already tells us that any sufficiently well-behaved learning machine will generalise well in the limit of infinite training data. For an overview of these results, see this post. Thus, the quest...

ai mystery theory speech criticism vc ea classical razor singular occam slt learning theories rationalist joar mnist

LW - Estimating effective dimensionality of MNIST models by Arjun Panickssery

Play Episode Listen Later Nov 2, 2023 2:09

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Estimating effective dimensionality of MNIST models, published by Arjun Panickssery on November 2, 2023 on LessWrong. The local learning coefficent λ is a measure of a model's "effective dimensionality" that captures its complexity (more background here ). Lau et al recently described a sampling method (SGLD) using noisy gradients to find a stochastic estimate ^ λ in a computationally tractable way (good explanation here ): I present results ( Github repo ) of the "Task Variability" project suggested on the DevInterp project list . To see how degeneracy scales with task difficulty and model class, I trained a fully-connected MLP and a CNN (both with ~120k parameters) on nine MNIST variants with different subsets of the labels (just the labels {0, 1}, then {0, 1, 2}, etc.). All models were trained to convergence using the same number of training data points. I implement Lau et al's algorithm on each of the trained models. The results below are averaged over three runs: The results for the full MNIST dataset are comparable to Lau et al's results while using models ten times smaller trained for ten times fewer epochs. The sampling method is finicky and sensitive to the hyperparameter choices of learning rate and noise factor. It will fail by producing negative or very high ^ λ values if the noise ϵ and the distance penalty γ (see lines 6 and 7 in the pseudocode above) isn't calibrated to the model's level of convergence. The results show a linear scaling law relating number of labels to task complexity. The CNN typically has a lower ^ λ than the MLP, which matches intuitions that some of the complexity is "stored" in the architecture because the convolutions apply a useful prior on functions good at solving image recognition tasks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

cnn speech models ea github estimating lau mlp rationalist dimensionality lesswrong mnist

LW - Estimating effective dimensionality of MNIST models by Arjun Panickssery

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Nov 2, 2023 2:09

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Estimating effective dimensionality of MNIST models, published by Arjun Panickssery on November 2, 2023 on LessWrong. The local learning coefficent λ is a measure of a model's "effective dimensionality" that captures its complexity (more background here ). Lau et al recently described a sampling method (SGLD) using noisy gradients to find a stochastic estimate ^ λ in a computationally tractable way (good explanation here ): I present results ( Github repo ) of the "Task Variability" project suggested on the DevInterp project list . To see how degeneracy scales with task difficulty and model class, I trained a fully-connected MLP and a CNN (both with ~120k parameters) on nine MNIST variants with different subsets of the labels (just the labels {0, 1}, then {0, 1, 2}, etc.). All models were trained to convergence using the same number of training data points. I implement Lau et al's algorithm on each of the trained models. The results below are averaged over three runs: The results for the full MNIST dataset are comparable to Lau et al's results while using models ten times smaller trained for ten times fewer epochs. The sampling method is finicky and sensitive to the hyperparameter choices of learning rate and noise factor. It will fail by producing negative or very high ^ λ values if the noise ϵ and the distance penalty γ (see lines 6 and 7 in the pseudocode above) isn't calibrated to the model's level of convergence. The results show a linear scaling law relating number of labels to task complexity. The CNN typically has a lower ^ λ than the MLP, which matches intuitions that some of the complexity is "stored" in the architecture because the convolutions apply a useful prior on functions good at solving image recognition tasks. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

cnn speech models ea github estimating lau mlp rationalist dimensionality lesswrong mnist

September 20th, 2023 | DALL·E 3

Hacker News Recap

Play Episode Listen Later Sep 21, 2023 20:01

This is a recap of the top 10 posts on Hacker News on September 20th, 2023.This podcast was generated by wondercraft.ai(00:37): Car allergic to vanilla ice cream (2000)Original post: https://news.ycombinator.com/item?id=37584399&utm_source=wondercraft_ai(02:23): DALL·E 3Original post: https://news.ycombinator.com/item?id=37586900&utm_source=wondercraft_ai(04:14): My uBlock Origin filters to remove distractionsOriginal post: https://news.ycombinator.com/item?id=37584134&utm_source=wondercraft_ai(05:56): OpenTF is now OpenTofuOriginal post: https://news.ycombinator.com/item?id=37581132&utm_source=wondercraft_ai(07:50): We have successfully completed our migration to RAM-only VPN infrastructureOriginal post: https://news.ycombinator.com/item?id=37581485&utm_source=wondercraft_ai(09:40): 78% MNIST accuracy using GZIP in under 10 lines of codeOriginal post: https://news.ycombinator.com/item?id=37583593&utm_source=wondercraft_ai(11:20): Toyota Research claims breakthrough in teaching robots new behaviorsOriginal post: https://news.ycombinator.com/item?id=37586264&utm_source=wondercraft_ai(13:36): British journalist held by police at Luton airport for five hours without arrestOriginal post: https://news.ycombinator.com/item?id=37584038&utm_source=wondercraft_ai(15:31): Gokrazy is coolOriginal post: https://news.ycombinator.com/item?id=37583234&utm_source=wondercraft_ai(17:30): ‘Less than half' fresh produce sold globally makes any profitOriginal post: https://news.ycombinator.com/item?id=37581316&utm_source=wondercraft_aiThis is a third-party project, independent from HN and YC. Text and audio generated using AI, by wondercraft.ai. Create your own studio quality podcast with text as the only input in seconds at app.wondercraft.ai. Issues or feedback? We'd love to hear from you: team@wondercraft.ai

ai british original car ram dall luton yc hn hacker news mnist

Doing it the Hard Way: Making the AI engine and language

Play Episode Listen Later Sep 14, 2023 89:22

Want to help define the AI Engineer stack? Have opinions on the top tools, communities and builders? We're collaborating with friends at Amplify to launch the first State of AI Engineering survey! Please fill it out (and tell your friends)!If AI is so important, why is its software so bad?This was the motivating question for Chris Lattner as he reconnected with his product counterpart on Tensorflow, Tim Davis, and started working on a modular solution to the problem of sprawling, monolithic, fragmented platforms in AI development. They announced a $30m seed in 2022 and, following their successful double launch of Modular/Mojo

ceo president ai google starting apple future state design phd goals performance dna microsoft iphone language cnn tesla attention fall in love tree humans matrix discord android origins nerds switzerland mac lego ios ipads windows intel senior director cto actors swift siri openai load residence nvidia rust hardware api engine generally learnings cs ads prom mojo gpt python ui ml linux enabling llama amplify autopilot automatic amd macbook guido macos underneath llm hard way macs cpu gpu flute google cloud modular docker walrus gpus tf simplest alessio speculative satya rl playgrounds cpus gcp proms clippy dsl cpp hpc tensorflow cuda keras xcode chris do caffe compiler risc v smol pytorch distinguished engineer risc tim davis objective c google brain clang intel cpus product engineering tpu jupyter notebooks cutlass jeremy howard swift playgrounds a100 imagenet pypi llvm andrej karpathy graviton halide amx chris lattner tvm alerted numpy ai engineer george hotz scott forstall chris ray christ here chris yeah chris you sagemaker tabular cisc resnet chris no chris well chris oh chris so chris one xla pypy mnist chris right chris they wkb mlir cython

Burstprop for Learning in Spiking Neuromorphic Hardware

Play Episode Listen Later Jul 28, 2023

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.25.550525v1?rss=1 Authors: Stuck, M., Naud, R. Abstract: The need for energy-efficient solutions in Deep Neural Network (DNN) applications has led to a growing interest in Spiking Neural Networks (SNNs) implemented in neuromorphic hardware. The Burstprop algorithm enables online and local learning in hierarchical networks, and therefore can potentially be implemented in neuromorphic hardware. This work presents an adaptation of the algorithm for training hierarchical SNNs on MNIST. Our implementation requires an order of magnitude fewer neurons than the previous ones. While Burstprop outperforms Spike-timing dependent plasticity (STDP), it falls short compared to training with backpropagation through time (BPTT). This work establishes a foundation for further improvements in the Burstprop algorithm, developing such algorithms is essential for achieving energy-efficient machine learning in neuromorphic hardware. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

learning llc copy spike hardware spiking naud biorxiv neuromorphic mnist

Learning Neural Representations in Task-Switching Guided by Context Biases

Play Episode Listen Later Jul 26, 2023

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.24.550365v1?rss=1 Authors: Galella, S., Ardid, S. Abstract: The ability to switch between tasks effectively in response to external stimuli is a hallmark of cognitive control. Our brain can filter and integrate external information to accomplish goal-directed behavior. Task switching occurs rapidly and efficiently, allowing us to perform multiple tasks with ease. Similarly, artificial neural networks can be tailored to exhibit multi-task capabilities and achieve high performance across domains. In terms of explainability, understanding how neural networks make predictions is crucial for their deployment in many real-world scenarios. In this study, we delve into neural representations learned by task-switching networks, which use task-specific bias for multitasking. Task-specific biases, mediated by context inputs, are learned by alternating the tasks the neural network learns during training. By using the MNIST dataset and binary tasks, we find that task-switching networks produce representations that resemble other multitasking paradigms, namely parallel networks in the early stages of processing and sequential networks in the last stages, respectively. We analyze the importance of inserting task contexts in different stages of processing and its role in aligning the task with relevant features. Moreover, we visualize how networks generalize neural representations during task-switching for different tasks. The use of context inputs improves the interpretability of simple neural networks for multitasking, helping to pave the way for the future study of architectures and tasks of higher complexity. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

learning llc context tasks guided switching copy biases neural representations biorxiv mnist

AF - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

Play Episode Listen Later Jul 23, 2023 16:29

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on The AI Alignment Forum. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking l...

power training humans speech pope studying ea ml hypothesis lightspeed quintin evocative rationalist grokking mnist

LW - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

Play Episode Listen Later Jul 23, 2023 16:28

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on LessWrong. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking longer about a...

power training humans speech pope studying ea ml hypothesis lightspeed quintin evocative rationalist lesswrong grokking mnist

LW - QAPR 5: grokking is maybe not that big a deal? by Quintin Pope

Play Episode Listen Later Jul 23, 2023 16:28

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on LessWrong. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking longer about a...

power training humans speech pope studying ea ml hypothesis lightspeed quintin evocative rationalist lesswrong grokking mnist

AF - Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1 by Stefan Heimersheim

spoilers challenges cnn speech ea intervention transformer marius tl gpu hackathons pca mechanistic rationalist interpretability gb ram convolutional mnist

Play Episode Listen Later May 9, 2023 17:20

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by Stefan Heimersheim on May 9, 2023 on The AI Alignment Forum. We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM). The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image: The MNIST network consists of 2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2 2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2]) and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels. Spoilers ahead! Summary of our solution (TL,DR) The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image. The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity: The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0. This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error. We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation. Part 1: How we found this solution Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points through...

LW - Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1 by StefanHex

spoilers challenges cnn speech ea intervention transformer marius tl gpu hackathons pca mechanistic rationalist interpretability gb ram lesswrong convolutional mnist

Play Episode Listen Later May 9, 2023 17:19

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by StefanHex on May 9, 2023 on LessWrong. We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM). The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image: The MNIST network consists of 2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2 2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2]) and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels. Spoilers ahead! Summary of our solution (TL,DR) The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image. The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity: The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0. This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error. We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation. Part 1: How we found this solution Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points throughout the analysis. PCA ...

LW - Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1 by StefanHex

spoilers challenges cnn speech ea intervention transformer marius tl gpu hackathons pca mechanistic rationalist interpretability gb ram lesswrong convolutional mnist

Play Episode Listen Later May 9, 2023 17:19

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1, published by StefanHex on May 9, 2023 on LessWrong. We solved the first Mechanistic Interpretability challenge that Stephen Casper posed in EIS VII. We spent the last Alignment Jam hackathon attempting to solve the two challenges presented there, and present our (confirmed) solution to the CNN challenge here. We will present a write-up of our work on the Transformer challenge in a future post.Stefan and Marius submitted an early version of this work at the end of the Hackathon, and Stefan added Intervention and Causal Scrubbing tests to the final write-up. A notebook reproducing all results is provided here (requires no GPU but ~13 GB RAM). The challenges each provide a pre-trained network, and the task is to reverse engineer the network as well as to infer the labeling function used for training. The first challenge network is a MNIST CNN that takes MNIST images and outputs labels. The hints given are that [1] the labels are binary, [2] the test set accuracy is 95.58%, [3] that the (secret) labeling function is simple, and [4] this image: The MNIST network consists of 2 Convolutional layers (Conv -> ReLU -> Dropout -> Pool)x2 2 Fully connected layers (fc1[400,200] -> ReLU -> fc2[200,2]) and we can access the data (torchvision.datasets.MNIST) but not the ground truth labels. Spoilers ahead! Summary of our solution (TL,DR) The inputs are labelled based on similarity with a 1 versus similarity with an inverted 1 ("anti-1"). If the difference is large (either clearly 1 or clearly anti-1) the image is labeled as class 0, otherwise the image is labeled as class 0. Specifically, the template for 1 seems to be is the given hint (clue_image), and the "anti-1" is 1-clue_image. The similarity is measured as the sum over the element-wise product of the image matrices (or equivalently dot product of flattened image arrays).Then the the ~17k most similar images to “1” and ~14k most similar images to “anti 1” are labelled class 1, the remaining ~29k images are labelled class 0. We can also phrase this as a band filter for similarity with (clue_image - 0.5), defining class 0 as where -17305 < (image (clue_image - 0.5)).sum() < -7762.We can observe this internally by looking at the embedding in the 200-neurons space, a PCA decomposition colored by label shows how the model judges this similarity: The model internally implements this via two groups of feature detectors (in the 200-dimensional neuron layer). These are “1-detectors” (detecting clue_image) and “anti-1-detectors” (detecting 1-clue_image). If either group fires sufficiently strongly, the model is classified as class 1, otherwise class 0. This classification has 96.2% overlap with model labels. Since the model itself has only 95.6% accuracy on the test set we think the difference is plausibly due to model error. We test our hypothesis with Causal Scrubbing. Specifically we test our hypothesis that of the 200 neurons there are 48 neurons detecting "1" similarity, 31 neurons detecting "anti-1" similarity, and 121 dead (useless) neurons. We resample-ablate all neurons by replacing each neurons activation with its activation on a different dataset example where our hypothesis predicts it to have a similar activation. Part 1: How we found this solution Since the model prediction depends on the logit (final neuron) difference we can directly say that in the final layer we only care about the logit of label 1 minus the logit of label 0, the logit diff direction. Then, using the [200,2] fc2 weight matrix (biases irrelevant) we can directly translate this logit diff direction into the 200-dim neuron space (by taking the difference of the respective weight vectors). We will make use of the logit diff directions at multiple points throughout the analysis. PCA ...

LW - Exploring the Lottery Ticket Hypothesis by Rauno Arike

Play Episode Listen Later Apr 26, 2023 18:56

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Exploring the Lottery Ticket Hypothesis, published by Rauno Arike on April 25, 2023 on LessWrong. I have recently been fascinated by the breadth of important mysteries in deep learning, including deep double descent and phase changes, that could be explained by a curious conjectured property of neural networks called the lottery ticket hypothesis. Despite this explanatory potential, however, I haven't seen much discussion about the evidence behind and the implications of this hypothesis in the alignment community. Being confused about these things motivated me to conduct my own survey of the phenomenon, which resulted in this post. The Lottery Ticket Hypothesis, explained in one minute The lottery ticket hypothesis (LTH) was originally proposed in a paper by Frankle and Carbin (2018): A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The authors call such subnetworks "winning lottery tickets". As the simplest example, they train a LeNet-300-100 model on the MNIST dataset and report that a network containing only 21.1% of the weights of the dense version reaches a higher test accuracy on less training iterations, while a network where only 3.6% of the weights remain performs almost identically to the dense network of the network. The lottery ticket hypothesis extends a long line of work in neural network pruning, a technique proposed in as early as 1990 by LeCun et al. Pruning simply means deleting some fraction of unimportant weights from the network after training to make inference more efficient. The vital insight of the lottery ticket hypothesis paper is that it may also be possible to prune the network before training to make both training and inference more efficient. In practice, the method that Frankle and Carbin used for finding winning tickets didn't yet eliminate the need to train the full network, instead just suggesting the possibility. The technique they used is iterative pruning, a procedure that roughly looks as follows: Train the full dense network on some classification task Prune out some fraction of the weights with the smallest magnitude Reinitialize the remaining weights to their original values Repeat the same procedure for a number of times This is computationally quite expensive, but as we'll see below, alternative approaches have been proposed later on. The paper also defined a stronger version of the hypothesis that they named the lottery ticket conjecture: We extend our hypothesis into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket. In the next section, I'll argue that making a distinction between the hypothesis and the conjecture appears to be quite important. Relevance to alignment Phase changes In his post about mechanistically interpreting grokking, Neel Nanda argues that the lottery ticket hypothesis may constitute the reason for why neural networks form sophisticated circuits. One may naively think that neural networks are optimized in a similar way to how linear regression classifiers are optimized: each weight slowly changes in a direction that marginally improves performance, and the result of these tiny individual improvements smoothly improve the performance of the entire ensemble. In practice, though, we observe the forming of sophisticated circuits like the induction circuit, which is comprised of a previous token head and an induction head. Either of those heads improves loss only in the case the other ...

train speech exploring ea relevance pruning hypothesis dense lottery tickets l th rationalist sgd neel nanda lesswrong carbin arike mnist

LW - Exploring the Lottery Ticket Hypothesis by Rauno Arike

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 26, 2023 18:56

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Exploring the Lottery Ticket Hypothesis, published by Rauno Arike on April 25, 2023 on LessWrong. I have recently been fascinated by the breadth of important mysteries in deep learning, including deep double descent and phase changes, that could be explained by a curious conjectured property of neural networks called the lottery ticket hypothesis. Despite this explanatory potential, however, I haven't seen much discussion about the evidence behind and the implications of this hypothesis in the alignment community. Being confused about these things motivated me to conduct my own survey of the phenomenon, which resulted in this post. The Lottery Ticket Hypothesis, explained in one minute The lottery ticket hypothesis (LTH) was originally proposed in a paper by Frankle and Carbin (2018): A randomly-initialized, dense neural network contains a subnetwork that is initialized such that—when trained in isolation—it can match the test accuracy of the original network after training for at most the same number of iterations. The authors call such subnetworks "winning lottery tickets". As the simplest example, they train a LeNet-300-100 model on the MNIST dataset and report that a network containing only 21.1% of the weights of the dense version reaches a higher test accuracy on less training iterations, while a network where only 3.6% of the weights remain performs almost identically to the dense network of the network. The lottery ticket hypothesis extends a long line of work in neural network pruning, a technique proposed in as early as 1990 by LeCun et al. Pruning simply means deleting some fraction of unimportant weights from the network after training to make inference more efficient. The vital insight of the lottery ticket hypothesis paper is that it may also be possible to prune the network before training to make both training and inference more efficient. In practice, the method that Frankle and Carbin used for finding winning tickets didn't yet eliminate the need to train the full network, instead just suggesting the possibility. The technique they used is iterative pruning, a procedure that roughly looks as follows: Train the full dense network on some classification task Prune out some fraction of the weights with the smallest magnitude Reinitialize the remaining weights to their original values Repeat the same procedure for a number of times This is computationally quite expensive, but as we'll see below, alternative approaches have been proposed later on. The paper also defined a stronger version of the hypothesis that they named the lottery ticket conjecture: We extend our hypothesis into an untested conjecture that SGD seeks out and trains a subset of well-initialized weights. Dense, randomly-initialized networks are easier to train than the sparse networks that result from pruning because there are more possible subnetworks from which training might recover a winning ticket. In the next section, I'll argue that making a distinction between the hypothesis and the conjecture appears to be quite important. Relevance to alignment Phase changes In his post about mechanistically interpreting grokking, Neel Nanda argues that the lottery ticket hypothesis may constitute the reason for why neural networks form sophisticated circuits. One may naively think that neural networks are optimized in a similar way to how linear regression classifiers are optimized: each weight slowly changes in a direction that marginally improves performance, and the result of these tiny individual improvements smoothly improve the performance of the entire ensemble. In practice, though, we observe the forming of sophisticated circuits like the induction circuit, which is comprised of a previous token head and an induction head. Either of those heads improves loss only in the case the other ...

train speech exploring ea relevance pruning hypothesis dense lottery tickets l th rationalist sgd neel nanda lesswrong carbin arike mnist

Segment Anything Model and the Hard Problems of Computer Vision — with Joseph Nelson of Roboflow

Play Episode Listen Later Apr 13, 2023 79:35

2023 is the year of Multimodal AI, and Latent Space is going multimodal too! * This podcast comes with a video demo at the 1hr mark and it's a good excuse to launch our YouTube - please subscribe! * We are also holding two events in San Francisco — the first AI | UX meetup next week (already full; we'll send a recap here on the newsletter) and Latent Space Liftoff Day on May 4th (signup here; but get in touch if you have a high profile launch you'd like to make). * We also joined the Chroma/OpenAI ChatGPT Plugins Hackathon last week where we won the Turing and Replit awards and met some of you in person!This post featured on Hacker News.Out of the five senses of the human body, I'd put sight at the very top. But weirdly when it comes to AI, Computer Vision has felt left out of the recent wave compared to image generation, text reasoning, and even audio transcription. We got our first taste of it with the OCR capabilities demo in the GPT-4 Developer Livestream, but to date GPT-4's vision capability has not yet been released. Meta AI leapfrogged OpenAI and everyone else by fully open sourcing their Segment Anything Model (SAM) last week, complete with paper, model, weights, data (6x more images and 400x more masks than OpenImages), and a very slick demo website. This is a marked change to their previous LLaMA release, which was not commercially licensed. The response has been ecstatic:SAM was the talk of the town at the ChatGPT Plugins Hackathon and I was fortunate enough to book Joseph Nelson who was frantically integrating SAM into Roboflow this past weekend. As a passionate instructor, hacker, and founder, Joseph is possibly the single best person in the world to bring the rest of us up to speed on the state of Computer Vision and the implications of SAM. I was already a fan of him from his previous pod with (hopefully future guest) Beyang Liu of Sourcegraph, so this served as a personal catchup as well. Enjoy! and let us know what other news/models/guests you'd like to have us discuss! - swyxRecorded in-person at the beautiful StudioPod studios in San Francisco.Full transcript is below the fold.Show Notes* Joseph's links: Twitter, Linkedin, Personal* Sourcegraph Podcast and Game Theory Story* Represently* Roboflow at Pioneer and YCombinator* Udacity Self Driving Car dataset story* Computer Vision Annotation Formats* SAM recap - top things to know for those living in a cave* https://segment-anything.com/* https://segment-anything.com/demo* https://arxiv.org/pdf/2304.02643.pdf * https://ai.facebook.com/blog/segment-anything-foundation-model-image-segmentation/* https://blog.roboflow.com/segment-anything-breakdown/* https://ai.facebook.com/datasets/segment-anything/* Ask Roboflow https://ask.roboflow.ai/* GPT-4 Multimodal https://blog.roboflow.com/gpt-4-impact-speculation/Cut for time:* WSJ mention* Des Moines Register story* All In Pod: timestamped mention* In Forbes: underrepresented investors in Series A* Roboflow greatest hits* https://blog.roboflow.com/mountain-dew-contest-computer-vision/* https://blog.roboflow.com/self-driving-car-dataset-missing-pedestrians/* https://blog.roboflow.com/nerualhash-collision/ and Apple CSAM issue * https://www.rf100.org/Timestamps* [00:00:19] Introducing Joseph* [00:02:28] Why Iowa* [00:05:52] Origin of Roboflow* [00:16:12] Why Computer Vision* [00:17:50] Computer Vision Use Cases* [00:26:15] The Economics of Annotation/Segmentation* [00:32:17] Computer Vision Annotation Formats* [00:36:41] Intro to Computer Vision & Segmentation* [00:39:08] YOLO* [00:44:44] World Knowledge of Foundation Models* [00:46:21] Segment Anything Model* [00:51:29] SAM: Zero Shot Transfer* [00:51:53] SAM: Promptability* [00:53:24] SAM: Model Assisted Labeling* [00:56:03] SAM doesn't have labels* [00:59:23] Labeling on the Browser* [01:00:28] Roboflow + SAM Video Demo * [01:07:27] Future Predictions* [01:08:04] GPT4 Multimodality* [01:09:27] Remaining Hard Problems* [01:13:57] Ask Roboflow (2019)* [01:15:26] How to keep up in AITranscripts[00:00:00] Hello everyone. It is me swyx and I'm here with Joseph Nelson. Hey, welcome to the studio. It's nice. Thanks so much having me. We, uh, have a professional setup in here.[00:00:19] Introducing Joseph[00:00:19] Joseph, you and I have known each other online for a little bit. I first heard about you on the Source Graph podcast with bian and I highly, highly recommend that there's a really good game theory story that is the best YC application story I've ever heard and I won't tease further cuz they should go listen to that.[00:00:36] What do you think? It's a good story. It's a good story. It's a good story. So you got your Bachelor of Economics from George Washington, by the way. Fun fact. I'm also an econ major as well. You are very politically active, I guess you, you did a lot of, um, interning in political offices and you were responding to, um, the, the, the sheer amount of load that the Congress people have in terms of the, the support.[00:01:00] So you built, representing, which is Zendesk for Congress. And, uh, I liked in your source guide podcast how you talked about how being more responsive to, to constituents is always a good thing no matter what side of the aisle you're on. You also had a sideline as a data science instructor at General Assembly.[00:01:18] As a consultant in your own consultancy, and you also did a bunch of hackathon stuff with Magic Sudoku, which is your transition from N L P into computer vision. And apparently at TechCrunch Disrupt, disrupt in 2019, you tried to add chess and that was your whole villain origin story for, Hey, computer vision's too hard.[00:01:36] That's full, the platform to do that. Uh, and now you're co-founder c e o of RoboFlow. So that's your bio. Um, what's not in there that[00:01:43] people should know about you? One key thing that people realize within maybe five minutes of meeting me, uh, I'm from Iowa. Yes. And it's like a funnily novel thing. I mean, you know, growing up in Iowa, it's like everyone you know is from Iowa.[00:01:56] But then when I left to go to school, there was not that many Iowans at gw and people were like, oh, like you're, you're Iowa Joe. Like, you know, how'd you find out about this school out here? I was like, oh, well the Pony Express was running that day, so I was able to send. So I really like to lean into it.[00:02:11] And so you kind of become a default ambassador for places that. People don't meet a lot of other people from, so I've kind of taken that upon myself to just make it be a, a part of my identity. So, you know, my handle everywhere Joseph of Iowa, like I I, you can probably find my social security number just from knowing that that's my handle.[00:02:25] Cuz I put it plastered everywhere. So that's, that's probably like one thing.[00:02:28] Why Iowa[00:02:28] What's your best pitch for Iowa? Like why is[00:02:30] Iowa awesome? The people Iowa's filled with people that genuinely care. You know, if you're waiting a long line, someone's gonna strike up a conversation, kinda ask how you were Devrel and it's just like a really genuine place.[00:02:40] It was a wonderful place to grow up too at the time, you know, I thought it was like, uh, yeah, I was kind of embarrassed and then be from there. And then I actually kinda looking back it's like, wow, you know, there's good schools, smart people friendly. The, uh, high school that I went to actually Ben Silverman, the CEO and, or I guess former CEO and co-founder of Pinterest and I have the same teachers in high school at different.[00:03:01] The co-founder, or excuse me, the creator of crispr, the gene editing technique, Dr. Jennifer. Doudna. Oh, so that's the patent debate. There's Doudna. Oh, and then there's Fang Zang. Uh, okay. Yeah. Yeah. So Dr. Fang Zang, who I think ultimately won the patent war, uh, but is also from the same high school.[00:03:18] Well, she won the patent, but Jennifer won the[00:03:20] prize.[00:03:21] I think that's probably, I think that's probably, I, I mean I looked into it a little closely. I think it was something like she won the patent for CRISPR first existing and then Feng got it for, uh, first use on humans, which I guess for commercial reasons is the, perhaps more, more interesting one. But I dunno, biolife Sciences, is that my area of expertise?[00:03:38] Yep. Knowing people that came from Iowa that do cool things, certainly is. Yes. So I'll claim it. Um, but yeah, I, I, we, um, at Roble actually, we're, we're bringing the full team to Iowa for the very first time this last week of, of April. And, well, folks from like Scotland all over, that's your company[00:03:54] retreat.[00:03:54] The Iowa,[00:03:55] yeah. Nice. Well, so we do two a year. You know, we've done Miami, we've done. Some of the smaller teams have done like Nashville or Austin or these sorts of places, but we said, you know, let's bring it back to kinda the origin and the roots. Uh, and we'll, we'll bring the full team to, to Des Moines, Iowa.[00:04:13] So, yeah, like I was mentioning, folks from California to Scotland and many places in between are all gonna descend upon Des Moines for a week of, uh, learning and working. So maybe you can check in with those folks. If, what do they, what do they decide and interpret about what's cool. Our state. Well, one thing, are you actually headquartered in Des Moines on paper?[00:04:30] Yes. Yeah.[00:04:30] Isn't that amazing? That's like everyone's Delaware and you're like,[00:04:33] so doing research. Well, we're, we're incorporated in Delaware. Okay. We we're Delaware Sea like, uh, most companies, but our headquarters Yeah. Is in Des Moines. And part of that's a few things. One, it's like, you know, there's this nice Iowa pride.[00:04:43] And second is, uh, Brad and I both grew up in Brad Mc, co-founder and I grew up in, in Des Moines. And we met each other in the year 2000. We looked it up for the, the YC app. So, you know, I think, I guess more of my life I've known Brad than not, uh, which is kind of crazy. Wow. And during yc, we did it during 2020, so it was like the height of Covid.[00:05:01] And so we actually got a house in Des Moines and lived, worked outta there. I mean, more credit to. So I moved back. I was living in DC at the time, I moved back to to Des Moines. Brad was living in Des Moines, but he moved out of a house with his. To move into what we called our hacker house. And then we had one, uh, member of the team as well, Jacob Sorowitz, who moved from Minneapolis down to Des Moines for the summer.[00:05:21] And frankly, uh, code was a great time to, to build a YC company cuz there wasn't much else to do. I mean, it's kinda like wash your groceries and code. It's sort of the, that was the routine[00:05:30] and you can use, uh, computer vision to help with your groceries as well.[00:05:33] That's exactly right. Tell me what to make.[00:05:35] What's in my fridge? What should I cook? Oh, we'll, we'll, we'll cover[00:05:37] that for with the G P T four, uh, stuff. Exactly. Okay. So you have been featured with in a lot of press events. Uh, but maybe we'll just cover the origin story a little bit in a little bit more detail. So we'll, we'll cover robo flow and then we'll cover, we'll go into segment anything.[00:05:52] Origin of Roboflow[00:05:52] But, uh, I think it's important for people to understand. Robo just because it gives people context for what you're about to show us at the end of the podcast. So Magic Sudoku tc, uh, techers Disrupt, and then you go, you join Pioneer, which is Dan Gross's, um, YC before yc.[00:06:07] Yeah. That's how I think about it.[00:06:08] Yeah, that's a good way. That's a good description of it. Yeah. So I mean, robo flow kind of starts as you mentioned with this magic Sudoku thing. So you mentioned one of my prior business was a company called Represent, and you nailed it. I mean, US Congress gets 80 million messages a year. We built tools that auto sorted them.[00:06:23] They didn't use any intelligent auto sorting. And this is somewhat a solved problem in natural language processing of doing topic modeling or grouping together similar sentiment and things like this. And as you mentioned, I'd like, I worked in DC for a bit and been exposed to some of these problems and when I was like, oh, you know, with programming you can build solutions.[00:06:40] And I think the US Congress is, you know, the US kind of United States is a support center, if you will, and the United States is sports center runs on pretty old software, so mm-hmm. We, um, we built a product for that. It was actually at the time when I was working on representing. Brad, his prior business, um, is a social games company called Hatchlings.[00:07:00] Uh, he phoned me in, in 2017, apple had released augmented reality kit AR kit. And Brad and I are both kind of serial hackers, like I like to go to hackathons, don't really understand new technology until he build something with them type folks. And when AR Kit came out, Brad decided he wanted to build a game with it that would solve Sudoku puzzles.[00:07:19] And the idea of the game would be you take your phone, you hover hold it over top of a Sudoku puzzle, it recognizes the state of the board where it is, and then it fills it all in just right before your eyes. And he phoned me and I was like, Brad, this sounds awesome and sounds like you kinda got it figured out.[00:07:34] What, what's, uh, what, what do you think I can do here? It's like, well, the machine learning piece of this is the part that I'm most uncertain about. Uh, doing the digit recognition and, um, filling in some of those results. I was like, well, I mean digit recognition's like the hell of world of, of computer vision.[00:07:48] That's Yeah, yeah, MNIST, right. So I was like, that that part should be the, the easy part. I was like, ah, I'm, he's like, I'm not so super sure, but. You know, the other parts, the mobile ar game mechanics, I've got pretty well figured out. I was like, I, I think you're wrong. I think you're thinking about the hard part is the easy part.[00:08:02] And he is like, no, you're wrong. The hard part is the easy part. And so long story short, we built this thing and released Magic Sudoku and it kind of caught the Internet's attention of what you could do with augmented reality and, and with computer vision. It, you know, made it to the front ofer and some subreddits it run Product Hunt Air app of the year.[00:08:20] And it was really a, a flash in the pan type app, right? Like we were both running separate companies at the time and mostly wanted to toy around with, with new technology. And, um, kind of a fun fact about Magic Sudoku winning product Hunt Air app of the year. That was the same year that I think the model three came out.[00:08:34] And so Elon Musk won a Golden Kitty who we joked that we share an award with, with Elon Musk. Um, the thinking there was that this is gonna set off a, a revolution of if two random engineers can put together something that makes something, makes a game programmable and at interactive, then surely lots of other engineers will.[00:08:53] Do similar of adding programmable layers on top of real world objects around us. Earlier we were joking about objects in your fridge, you know, and automatically generating recipes and these sorts of things. And like I said, that was 2017. Roboflow was actually co-found, or I guess like incorporated in, in 2019.[00:09:09] So we put this out there, nothing really happened. We went back to our day jobs of, of running our respective businesses, I sold Represently and then as you mentioned, kind of did like consulting stuff to figure out the next sort of thing to, to work on, to get exposed to various problems. Brad appointed a new CEO at his prior business and we got together that summer of 2019.[00:09:27] We said, Hey, you know, maybe we should return to that idea that caught a lot of people's attention and shows what's possible. And you know what, what kind of gives, like the future is here. And we have no one's done anything since. No one's done anything. So why is, why are there not these, these apps proliferated everywhere.[00:09:42] Yeah. And so we said, you know, what we'll do is, um, to add this software layer to the real world. Will build, um, kinda like a super app where if you pointed it at anything, it will recognize it and then you can interact with it. We'll release a developer platform and allow people to make their own interfaces, interactivity for whatever object they're looking at.[00:10:04] And we decided to start with board games because one, we had a little bit of history there with, with Sudoku two, there's social by default. So if one person, you know finds it, then they'd probably share it among their friend. Group three. There's actually relatively few barriers to entry aside from like, you know, using someone else's brand name in your, your marketing materials.[00:10:19] Yeah. But other than that, there's no real, uh, inhibitors to getting things going and, and four, it's, it's just fun. It would be something that'd be bring us enjoyment to work on. So we spent that summer making, uh, boggle the four by four word game provable, where, you know, unlike Magic Sudoku, which to be clear, totally ruins the game, uh, you, you have to solve Sudoku puzzle.[00:10:40] You don't need to do anything else. But with Boggle, if you and I are playing, we might not find all of the words that adjacent letter tiles. Unveil. So if we have a, an AI tell us, Hey, here's like the best combination of letters that make high scoring words. And so we, we made boggle and released it and that, and that did okay.[00:10:56] I mean maybe the most interesting story was there's a English as a second language program in, in Canada that picked it up and used it as a part of their curriculum to like build vocabulary, which I thought was kind of inspiring. Example, and what happens just when you put things on the internet and then.[00:11:09] We wanted to build one for chess. So this is where you mentioned we went to 2019. TechCrunch Disrupt TechCrunch. Disrupt holds a Hackathon. And this is actually, you know, when Brad and I say we really became co-founders, because we fly out to San Francisco, we rent a hotel room in the Tenderloin. We, uh, we, we, uh, have one room and there's like one, there's room for one bed, and then we're like, oh, you said there was a cot, you know, on the, on the listing.[00:11:32] So they like give us a little, a little cot, the end of the cot, like bled and over into like the bathroom. So like there I am sleeping on the cot with like my head in the bathroom and the Tenderloin, you know, fortunately we're at a hackathon glamorous. Yeah. There wasn't, there wasn't a ton of sleep to be had.[00:11:46] There is, you know, we're, we're just like making and, and shipping these, these sorts of many[00:11:50] people with this hack. So I've never been to one of these things, but[00:11:52] they're huge. Right? Yeah. The Disrupt Hackathon, um, I don't, I don't know numbers, but few hundreds, you know, classically had been a place where it launched a lot of famous Yeah.[00:12:01] Sort of flare. Yeah. And I think it's, you know, kind of slowed down as a place for true company generation. But for us, Brad and I, who likes just doing hackathons, being, making things in compressed time skills, it seemed like a, a fun thing to do. And like I said, we'd been working on things, but it was only there that like, you're, you're stuck in a maybe not so great glamorous situation together and you're just there to make a, a program and you wanna make it be the best and compete against others.[00:12:26] And so we add support to the app that we were called was called Board Boss. We couldn't call it anything with Boggle cause of IP rights were called. So we called it Board Boss and it supported Boggle and then we were gonna support chess, which, you know, has no IP rights around it. Uh, it's an open game.[00:12:39] And we did so in 48 hours, we built an app that, or added fit capability to. Point your phone at a chess board. It understands the state of the chess board and converts it to um, a known notation. Then it passes that to stock fish, the open source chess engine for making move recommendations and it makes move recommendations to, to players.[00:13:00] So you could either play against like an ammunition to AI or improve your own game. We learn that one of the key ways users like to use this was just to record their games. Cuz it's almost like reviewing game film of what you should have done differently. Game. Yeah, yeah, exactly. And I guess the highlight of, uh, of chess Boss was, you know, we get to the first round of judging, we get to the second round of judging.[00:13:16] And during the second round of judging, that's when like, TechCrunch kind of brings around like some like celebs and stuff. They'll come by. Evan Spiegel drops by Ooh. Oh, and he uh, he comes up to our, our, our booth and um, he's like, oh, so what does, what does this all do? And you know, he takes an interest in it cuz the underpinnings of, of AR interacting with the.[00:13:33] And, uh, he is kinda like, you know, I could use this to like cheat on chess with my friends. And we're like, well, you know, that wasn't exactly the, the thesis of why we made it, but glad that, uh, at least you think it's kind of neat. Um, wait, but he already started Snapchat by then? Oh, yeah. Oh yeah. This, this is 2019, I think.[00:13:49] Oh, okay, okay. Yeah, he was kind of just checking out things that were new and, and judging didn't end up winning any, um, awards within Disrupt, but I think what we won was actually. Maybe more important maybe like the, the quote, like the co-founders medal along the way. Yep. The friends we made along the way there we go to, to play to the meme.[00:14:06] I would've preferred to win, to be clear. Yes. You played a win. So you did win, uh,[00:14:11] $15,000 from some Des Moines, uh, con[00:14:14] contest. Yeah. Yeah. The, uh, that was nice. Yeah. Slightly after that we did, we did win. Um, some, some grants and some other things for some of the work that we've been doing. John Papa John supporting the, uh, the local tech scene.[00:14:24] Yeah. Well, so there's not the one you're thinking of. Okay. Uh, there's a guy whose name is Papa John, like that's his, that's his, that's his last name. His first name is John. So it's not the Papa John's you're thinking of that has some problematic undertones. It's like this guy who's totally different. I feel bad for him.[00:14:38] His press must just be like, oh, uh, all over the place. But yeah, he's this figure in the Iowa entrepreneurial scene who, um, he actually was like doing SPACs before they were cool and these sorts of things, but yeah, he funds like grants that encourage entrepreneurship in the state. And since we'd done YC and in the state, we were eligible for some of the awards that they were providing.[00:14:56] But yeah, it was disrupt that we realized, you know, um, the tools that we made, you know, it took us better part of a summer to add Boggle support and it took us 48 hours to add chest support. So adding the ability for programmable interfaces for any object, we built a lot of those internal tools and our apps were kind of doing like the very famous shark fin where like it picks up really fast, then it kind of like slowly peters off.[00:15:20] Mm-hmm. And so we're like, okay, if we're getting these like shark fin graphs, we gotta try something different. Um, there's something different. I remember like the week before Thanksgiving 2019 sitting down and we wrote this Readme for, actually it's still the Readme at the base repo of Robo Flow today has spent relatively unedited of the manifesto.[00:15:36] Like, we're gonna build tools that enable people to make the world programmable. And there's like six phases and, you know, there's still, uh, many, many, many phases to go into what we wrote even at that time to, to present. But it's largely been, um, right in line with what we thought we would, we would do, which is give engineers the tools to add software to real world objects, which is largely predicated on computer vision. So finding the right images, getting the right sorts of video frames, maybe annotating them, uh, finding the right sort of models to use to do this, monitoring the performance, all these sorts of things. And that from, I mean, we released that in early 2020, and it's kind of, that's what's really started to click.[00:16:12] Why Computer Vision[00:16:12] Awesome. I think we should just kind[00:16:13] of[00:16:14] go right into where you are today and like the, the products that you offer, just just to give people an overview and then we can go into the, the SAM stuff. So what is the clear, concise elevator pitch? I think you mentioned a bunch of things like make the world programmable so you don't ha like computer vision is a means to an end.[00:16:30] Like there's, there's something beyond that. Yeah.[00:16:32] I mean, the, the big picture mission for the business and the company and what we're working on is, is making the world programmable, making it read and write and interactive, kind of more entertaining, more e. More fun and computer vision is the technology by which we can achieve that pretty quickly.[00:16:48] So like the one liner for the, the product in, in the company is providing engineers with the tools for data and models to build programmable interfaces. Um, and that can be workflows, that could be the, uh, data processing, it could be the actual model training. But yeah, Rob helps you use production ready computer vision workflows fast.[00:17:10] And I like that.[00:17:11] In part of your other pitch that I've heard, uh, is that you basically scale from the very smallest scales to the very largest scales, right? Like the sort of microbiology use case all the way to[00:17:20] astronomy. Yeah. Yeah. The, the joke that I like to make is like anything, um, underneath a microscope and, and through a telescope and everything in between needs to, needs to be seen.[00:17:27] I mean, we have people that run models in outer space, uh, underwater remote places under supervision and, and known places. The crazy thing is that like, All parts of, of not just the world, but the universe need to be observed and understood and acted upon. So vision is gonna be, I dunno, I feel like we're in the very, very, very beginnings of all the ways we're gonna see it.[00:17:50] Computer Vision Use Cases[00:17:50] Awesome. Let's go into a lo a few like top use cases, cuz I think that really helps to like highlight the big names that you've, big logos that you've already got. I've got Walmart and Cardinal Health, but I don't, I don't know if you wanna pull out any other names, like, just to illustrate, because the reason by the way, the reason I think that a lot of developers don't get into computer vision is because they think they don't need it.[00:18:11] Um, or they think like, oh, like when I do robotics, I'll do it. But I think if, if you see like the breadth of use cases, then you get a little bit more inspiration as to like, oh, I can use[00:18:19] CVS lfa. Yeah. It's kind of like, um, you know, by giving, by making it be so straightforward to use vision, it becomes almost like a given that it's a set of features that you could power on top of it.[00:18:32] And like you mentioned, there's, yeah, there's Fortune One there over half the Fortune 100. I've used the, the tools that Robel provides just as much as 250,000 developers. And so over a quarter million engineers finding and developing and creating various apps, and I mean, those apps are, are, are far and wide.[00:18:49] Just as you mentioned. I mean everything from say, like, one I like to talk about was like sushi detection of like finding the like right sorts of fish and ingredients that are in a given piece of, of sushi that you're looking at to say like roof estimation of like finding. If there's like, uh, hail damage on, on a given roof, of course, self-driving cars and understanding the scenes around us is sort of the, you know, very early computer vision everywhere.[00:19:13] Use case hardhat detection, like finding out if like a given workplace is, is, is safe, uh, disseminate, have the right p p p on or p p e on, are there the right distance from various machines? A huge place that vision has been used is environmental monitoring. Uh, what's the count of species? Can we verify that the environment's not changing in unexpected ways or like river banks are become, uh, becoming recessed in ways that we anticipate from satellite imagery, plant phenotyping.[00:19:37] I mean, people have used these apps for like understanding their plants and identifying them. And that dataset that's actually largely open, which is what's given a proliferation to the iNaturalist, is, is that whole, uh, hub of, of products. Lots of, um, people that do manufacturing. So, like Rivian for example, is a Rubal customer, and you know, they're trying to scale from 1000 cars to 25,000 cars to a hundred thousand cars in very short order.[00:20:00] And that relies on having the. Ability to visually ensure that every part that they're making is produced correctly and right in time. Medical use cases. You know, there's actually, this morning I was emailing with a user who's accelerating early cancer detection through breaking apart various parts of cells and doing counts of those cells.[00:20:23] And actually a lot of wet lab work that folks that are doing their PhDs or have done their PhDs are deeply familiar with that is often required to do very manually of, of counting, uh, micro plasms or, or things like this. There's. All sorts of, um, like traffic counting and smart cities use cases of understanding curb utilization to which sort of vehicles are, are present.[00:20:44] Uh, ooh. That can be[00:20:46] really good for city planning actually.[00:20:47] Yeah. I mean, one of our customers does exactly this. They, they measure and do they call it like smart curb utilization, where uhhuh, they wanna basically make a curb be almost like a dynamic space where like during these amounts of time, it's zoned for this during these amounts of times.[00:20:59] It's zoned for this based on the flows and e ebbs and flows of traffic throughout the day. So yeah, I mean the, the, the truth is that like, you're right, it's like a developer might be like, oh, how would I use vision? And then all of a sudden it's like, oh man, all these things are at my fingertips. Like I can just, everything you can see.[00:21:13] Yeah. Right. I can just, I can just add functionality for my app to understand and ingest the way, like, and usually the way that someone gets like almost nerd sniped into this is like, they have like a home automation project, so it's like send Yeah. Give us a few. Yeah. So send me a text when, um, a package shows up so I can like prevent package theft so I can like go down and grab it right away or.[00:21:29] We had a, uh, this one's pretty, pretty niche, but it's pretty funny. There was this guy who, during the pandemic wa, wanted to make sure his cat had like the proper, uh, workout. And so I've shared the story where he basically decided that. He'd make a cat workout machine with computer vision, you might be alone.[00:21:43] You're like, what does that look like? Well, what he decided was he would take a robotic arm strap, a laser pointer to it, and then train a machine to recognize his cat and his cat only, and point the laser pointer consistently 10 feet away from the cat. There's actually a video of you if you type an YouTube cat laser turret, you'll find Dave's video.[00:22:01] Uh, and hopefully Dave's cat has, has lost the weight that it needs to, cuz that's just the, that's an intense workout I have to say. But yeah, so like, that's like a, um, you know, these, uh, home automation projects are pretty common places for people to get into smart bird feeders. I've seen people that like are, are logging and understanding what sort of birds are, uh, in their background.[00:22:18] There's a member of our team that was working on actually this as, as a whole company and has open sourced a lot of the data for doing bird species identification. And now there's, I think there's even a company that's, uh, founded to create like a smart bird feeder, like captures photos and tells you which ones you've attracted to your yard.[00:22:32] I met that. Do, you know, get around the, uh, car sharing company that heard it? Them never used them. They did a SPAC last year and they had raised at like, They're unicorn. They raised at like 1.2 billion, I think in the, the prior round and inspected a similar price. I met the CTO of, of Getaround because he was, uh, using Rob Flow to hack into his Tesla cameras to identify other vehicles that are like often nearby him.[00:22:56] So he's basically building his own custom license plate recognition, and he just wanted like, keep, like, keep tabs of like, when he drives by his friends or when he sees like regular sorts of folks. And so he was doing like automated license plate recognition by tapping into his, uh, camera feeds. And by the way, Elliot's like one of the like OG hackers, he was, I think one of the very first people to like, um, she break iPhones and, and these sorts of things.[00:23:14] Mm-hmm. So yeah, the project that I want, uh, that I'm gonna work on right now for my new place in San Francisco is. There's two doors. There's like a gate and then the other door. And sometimes we like forget to close, close the gate. So like, basically if it sees that the gate is open, it'll like send us all a text or something like this to make sure that the gate is, is closed at the front of our house.[00:23:32] That's[00:23:32] really cool. And I'll, I'll call out one thing that readers and listeners can, uh, read out on, on your history. One of your most popular initial, um, viral blog post was about, um, autonomous vehicle data sets and how, uh, the one that Udacity was using was missing like one third of humans. And, uh, it's not, it's pretty problematic for cars to miss humans.[00:23:53] Yeah, yeah, actually, so yeah, the Udacity self-driving car data set, which look to their credit, it was just meant to be used for, for academic use. Um, and like as a part of courses on, on Udacity, right? Yeah. But the, the team that released it, kind of hastily labeled and let it go out there to just start to use and train some models.[00:24:11] I think that likely some, some, uh, maybe commercial use cases maybe may have come and, and used, uh, the dataset, who's to say? But Brad and I discovered this dataset. And when we were working on dataset improvement tools at Rob Flow, we ran through our tools and identified some like pretty, as you mentioned, key issues.[00:24:26] Like for example, a lot of strollers weren't labeled and I hope our self-driving cars do those, these sorts of things. And so we relabeled the whole dataset by hand. I have this very fond memory is February, 2020. Brad and I are in Taiwan. So like Covid is actually just, just getting going. And the reason we were there is we were like, Hey, we can work on this from anywhere for a little bit.[00:24:44] And so we spent like a, uh, let's go closer to Covid. Well, you know, I like to say we uh, we got early indicators of, uh, how bad it was gonna be. I bought a bunch of like N 90 fives before going o I remember going to the, the like buying a bunch of N 95 s and getting this craziest look like this like crazy tin hat guy.[00:25:04] Wow. What is he doing? And then here's how you knew. I, I also got got by how bad it was gonna be. I left all of them in Taiwan cuz it's like, oh, you all need these. We'll be fine over in the us. And then come to find out, of course that Taiwan was a lot better in terms of, um, I think, yeah. Safety. But anyway, we were in Taiwan because we had planned this trip and you know, at the time we weren't super sure about the, uh, covid, these sorts of things.[00:25:22] We always canceled it. We didn't, but I have this, this very specific time. Brad and I were riding on the train from Clay back to Taipei. It's like a four hour ride. And you mentioned Pioneer earlier, we were competing in Pioneer, which is almost like a gamified to-do list. Mm-hmm. Every week you say what you're gonna do and then other people evaluate.[00:25:37] Did you actually do the things you said you were going to do? One of the things we said we were gonna do was like this, I think re-release of this data set. And so it's like late, we'd had a whole week, like, you know, weekend behind us and, uh, we're on this train and it was very unpleasant situation, but we relabeled this, this data set, and one sitting got it submitted before like the Sunday, Sunday countdown clock starts voting for, for.[00:25:57] And, um, once that data got out back out there, just as you mentioned, it kind of picked up and Venture beat, um, noticed and wrote some stories about it. And we really rereleased of course, the data set that we did our best job of labeling. And now if anyone's listening, they can probably go out and like find some errors that we surely still have and maybe call us out and, you know, put us, put us on blast.[00:26:15] The Economics of Annotation (Segmentation)[00:26:15] But,[00:26:16] um, well, well the reason I like this story is because it, it draws attention to the idea that annotation is difficult and basically anyone looking to use computer vision in their business who may not have an off-the-shelf data set is going to have to get involved in annotation. And I don't know what it costs.[00:26:34] And that's probably one of the biggest hurdles for me to estimate how big a task this is. Right? So my question at a higher level is tell the customers, how do you tell customers to estimate the economics of annotation? Like how many images do, do we need? How much, how long is it gonna take? That, that kinda stuff.[00:26:50] How much money and then what are the nuances to doing it well, right? Like, cuz obviously Udacity had a poor quality job, you guys had proved it, and there's errors every everywhere. Like where do[00:26:59] these things go wrong? The really good news about annotation in general is that like annotation of course is a means to an end to have a model be able to recognize a thing.[00:27:08] Increasingly there's models that are coming out that can recognize things zero shot without any annotation, which we're gonna talk about. Yeah. Which, we'll, we'll talk more about that in a moment. But in general, the good news is that like the trend is that annotation is gonna become decreasingly a blocker to starting to use computer vision in meaningful ways.[00:27:24] Now that said, just as you mentioned, there's a lot of places where you still need to do. Annotation. I mean, even with these zero shot models, they might have of blind spots, or maybe you're a business, as you mentioned, that you know, it's proprietary data. Like only Rivian knows what a rivian is supposed to look like, right?[00:27:39] Uh, at the time of, at the time of it being produced, like underneath the hood and, and all these sorts of things. And so, yeah, that's gonna necessarily require annotation. So your question of how long is it gonna take, how do you estimate these sorts of things, it really comes down to the complexity of the problem that you're solving and the amount of variance in the scene.[00:27:57] So let's give some contextual examples. If you're trying to recognize, we'll say a scratch on one specific part and you have very strong lighting. You might need fewer images because you control the lighting, you know the exact part and maybe you're lucky in the scratch. Happens more often than not in similar parts or similar, uh, portions of the given part.[00:28:17] So in that context, you, you, the function of variance, the variance is, is, is lower. So the number of images you need is also lower to start getting up to work. Now the orders of magnitude we're talking about is that like you can have an initial like working model from like 30 to 50 images. Yeah. In this context, which is shockingly low.[00:28:32] Like I feel like there's kind of an open secret in computer vision now, the general heuristic that often. Users, is that like, you know, maybe 200 images per class is when you start to have a model that you can rely[00:28:45] on? Rely meaning like 90, 99, 90, 90%, um,[00:28:50] uh, like what's 85 plus 85? Okay. Um, that's good. Again, these are very, very finger in the wind estimates cuz the variance we're talking about.[00:28:59] But the real question is like, at what point, like the framing is not like at what point do it get to 99, right? The framing is at what point can I use this thing to be better than the alternative, which is humans, which maybe humans or maybe like this problem wasn't possible at all. And so usually the question isn't like, how do I get to 99?[00:29:15] A hundred percent? It's how do I ensure that like the value I am able to get from putting this thing in production is greater than the alternative? In fact, even if you have a model that's less accurate than humans, there might be some circumstances where you can tolerate, uh, a greater amount of inaccuracy.[00:29:32] And if you look at the accuracy relative to the cost, Using a model is extremely cheap. Using a human for the same sort of task can be very expensive. Now, in terms of the actual accuracy of of what you get, there's probably some point at which the cost, but relative accuracy exceeds of a model, exceeds the high cost and hopefully high accuracy of, of a human comparable, like for example, there's like cameras that will track soccer balls or track events happening during sporting matches.[00:30:02] And you can go through and you know, we actually have users that work in sports analytics. You can go through and have a human. Hours and hours of footage. Cuz not just watching their team, they're watching every other team, they're watching scouting teams, they're watching junior teams, they're watching competitors.[00:30:15] And you could have them like, you know, track and follow every single time the ball goes within blank region of the field or every time blank player goes into, uh, this portion of the field. And you could have, you know, exact, like a hundred percent accuracy if that person, maybe, maybe not a hundred, a human may be like 95, 90 7% accuracy of every single time the ball is in this region or this player is on the field.[00:30:36] Truthfully, maybe if you're scouting analytics, you actually don't need 97% accuracy of knowing that that player is on the field. And in fact, if you can just have a model run at a 1000th, a 10000th of the cost and goes through and finds all the times that Messi was present on the field mm-hmm. That the ball was in this region of the.[00:30:54] Then even if that model is slightly less accurate, the cost is just so orders of magnitude different. And the stakes like the stakes of this problem, of knowing like the total number of minutes that Messi played will say are such that we have a higher air tolerance, that it's a no-brainer to start to use Yeah, a computer vision model in this context.[00:31:12] So not every problem requires equivalent or greater human performance. Even when it does, you'd be surprised at how fast models get there. And in the times when you, uh, really look at a problem, the question is, how much accuracy do I need to start to get value from this? This thing, like the package example is a great one, right?[00:31:27] Like I could in theory set up a camera that's constantly watching in front of my porch and I could watch the camera whenever I have a package and then go down. But of course, I'm not gonna do that. I value my time to do other sorts of things instead. And so like there, there's this net new capability of, oh, great, I can have an always on thing that tells me when a package shows up, even if you know the, the thing that's gonna text me.[00:31:46] When a package shows up, let's say a flat pack shows up instead of a box and it doesn't know what a flat pack likes, looks like initially. Doesn't matter. It doesn't matter because I didn't have this capability at all before. And I think that's the true case where a lot of computer vision problems exist is like it.[00:32:00] It's like you didn't even have this capability, this superpower before at all, let alone assigning a given human to do the task. And that's where we see like this explosion of, of value.[00:32:10] Awesome. Awesome. That was a really good overview. I want to leave time for the others, but I, I really want to dive into a couple more things with regards to Robo Flow.[00:32:17] Computer Vision Annotation Formats[00:32:17] So one is, apparently your original pitch for Robo Flow was with regards to conversion tools for computer vision data sets. And I'm sure as, as a result of your job, you have a lot of rants. I've been digging for rants basically on like the best or the worst annotation formats. What do we know? Cause most of us, oh my gosh, we only know, like, you know, I like,[00:32:38] okay, so when we talk about computer vision annotation formats, what we're talking about is if you have an image and you, you picture a boing box around my face on that image.[00:32:46] Yeah. How do you describe where that Monty box is? X, Y, Z X Y coordinates. Okay. X, y coordinates. How, what do you mean from the top lefts.[00:32:52] Okay. You, you, you, you take X and Y and then, and then the. The length and, and the width of the, the[00:32:58] box. Okay. So you got like a top left coordinate and like the bottom right coordinate or like the, the center of the bottom.[00:33:02] Yeah. Yeah. Top, left, bottom right. Yeah. That's one type of format. Okay. But then, um, I come along and I'm like, you know what? I want to do a different format where I wanna just put the center of the box, right. And give the length and width. Right. And by the way, we didn't even talk about what X and Y we're talking about.[00:33:14] Is X a pixel count? Is a relative pixel count? Is it an absolute pixel count? So the point is, the number of ways to describe where a box lives in a freaking image is endless, uh, seemingly and. Everyone decided to kind of create their own different ways of describing the coordinates and positions of where in this context of bounding Box is present.[00:33:39] Uh, so there's some formats, for example, that like use re, so for the x and y, like Y is, uh, like the left, most part of the image is zero. And the right most part of the image is one. So the, the coordinate is like anywhere from zero to one. So 0.6 is, you know, 60% of your way right up the image to describe the coordinate.[00:33:53] I guess that was, that was X instead of Y. But the point is there, of the zero to one is the way that we determined where that was in the position, or we're gonna do an absolute pixel position anyway. We got sick, we got sick of all these different annotation formats. So why do you even have to convert between formats?[00:34:07] Is is another part of this, this story. So different training frameworks, like if you're using TensorFlow, you need like TF Records. If you're using PyTorch, it's probably gonna be, well it depends on like what model you're using, but someone might use Coco JSON with PyTorch. Someone else might use like a, just a YAML file and a text file.[00:34:21] And to describe the cor it's point is everyone that creates a model. Or creates a dataset rather, has created different ways of describing where and how a bounding box is present in the image. And we got sick of all these different formats and doing these in writing all these different converter scripts.[00:34:39] And so we made a tool that just converts from one script, one type of format to another. And the, the key thing is that like if you get that converter script wrong, your model doesn't not work. It just fails silently. Yeah. Because the bounding boxes are now all in the wrong places. And so you need a way to visualize and be sure that your converter script, blah, blah blah.[00:34:54] So that was the very first tool we released of robo. It was just a converter script, you know, like these, like these PDF to word converters that you find. It was basically that for computer vision, like dead simple, really annoying thing. And we put it out there and people found some, some value in, in that.[00:35:08] And you know, to this day that's still like a surprisingly painful[00:35:11] problem. Um, yeah, so you and I met at the Dall-E Hackathon at OpenAI, and we were, I was trying to implement this like face masking thing, and I immediately ran into that problem because, um, you know, the, the parameters that Dall-E expected were different from the one that I got from my face, uh, facial detection thing.[00:35:28] One day it'll go away, but that day is not today. Uh, the worst format that we work with is, is. The mart form, it just makes no sense. And it's like, I think, I think it's a one off annotation format that this university in China started to use to describe where annotations exist in a book mart. I, I don't know, I dunno why that So best[00:35:45] would be TF record or some something similar.[00:35:48] Yeah, I think like, here's your chance to like tell everybody to use one one standard and like, let's, let's, can[00:35:53] I just tell them to use, we have a package that does this for you. I'm just gonna tell you to use the row full package that converts them all, uh, for you. So you don't have to think about this. I mean, Coco JSON is pretty good.[00:36:04] It's like one of the larger industry norms and you know, it's in JS O compared to like V xml, which is an XML format and Coco json is pretty descriptive, but you know, it has, has its own sort of drawbacks and flaws and has random like, attribute, I dunno. Um, yeah, I think the best way to handle this problem is to not have to think about it, which is what we did.[00:36:21] We just created a, uh, library that, that converts and uses things. Uh, for us. We've double checked the heck out of it. There's been hundreds of thousands of people that have used the library and battle tested all these different formats to find those silent errors. So I feel pretty good about no longer having to have a favorite format and instead just rely on.[00:36:38] Dot load in the format that I need. Great[00:36:41] Intro to Computer Vision Segmentation[00:36:41] service to the community. Yeah. Let's go into segmentation because is at the top of everyone's minds, but before we get into segment, anything, I feel like we need a little bit of context on the state-of-the-art prior to Sam, which seems to be YOLO and uh, you are the leading expert as far as I know.[00:36:56] Yeah.[00:36:57] Computer vision, there's various task types. There's classification problems where we just like assign tags to images, like, you know, maybe safe work, not safe work, sort of tagging sort of stuff. Or we have object detection, which are the boing boxes that you see and all the formats I was mentioning in ranting about there's instant segmentation, which is the polygon shapes and produces really, really good looking demos.[00:37:19] So a lot of people like instant segmentation.[00:37:21] This would be like counting pills when you point 'em out on the, on the table. Yeah. So, or[00:37:25] soccer players on the field. So interestingly, um, counting you could do with bounding boxes. Okay. Cause you could just say, you know, a box around a person. Well, I could count, you know, 12 players on the field.[00:37:35] Masks are most useful. Polygons are most useful if you need very precise area measurements. So you have an aerial photo of a home and you want to know, and the home's not a perfect box, and you want to know the rough square footage of that home. Well, if you know the distance between like the drone and, and the ground.[00:37:53] And you have the precise polygon shape of the home, then you can calculate how big that home is from aerial photos. And then insurers can, you know, provide say accurate estimates and that's maybe why this is useful. So polygons and, and instant segmentation are, are those types of tasks? There's a key point detection task and key point is, you know, if you've seen those demos of like all the joints on like a hand kind of, kind of outlined, there's visual question answering tasks, visual q and a.[00:38:21] And that's like, you know, some of the stuff that multi-modality is absolutely crushing for, you know, here's an image, tell me what food is in this image. And then you can pass that and you can make a recipe out of it. But like, um, yeah, the visual question in answering task type is where multi-modality is gonna have and is already having an enormous impact.[00:38:40] So that's not a comprehensive survey, very problem type, but it's enough to, to go into why SAM is significant. So these various task types, you know, which model to use for which given circumstance. Most things is highly dependent on what you're ultimately aiming to do. Like if you need to run a model on the edge, you're gonna need a smaller model, cuz it is gonna run on edge, compute and process in, in, in real time.[00:39:01] If you're gonna run a model on the cloud, then of course you, uh, generally have more compute at your disposal Considerations like this now, uh,[00:39:08] YOLO[00:39:08] just to pause. Yeah. Do you have to explain YOLO first before you go to Sam, or[00:39:11] Yeah, yeah, sure. So, yeah. Yeah, we should. So object detection world. So for a while I talked about various different task types and you can kinda think about a slide scale of like classification, then obvious detection.[00:39:20] And on the right, at most point you have like segmentation tasks. Object detection. The bounding boxes is especially useful for a wide, like it's, it's surprisingly versatile. Whereas like classification is kind of brittle. Like you only have a tag for the whole image. Well, that doesn't, you can't count things with tags.[00:39:35] And on the other hand, like the mask side of things, like drawing masks is painstaking. And so like labeling is just a bit more difficult. Plus like the processing to produce masks requires more compute. And so usually a lot of folks kind of landed for a long time on obvious detection being a really happy medium of affording you with rich capabilities because you can do things like count, track, measure.[00:39:56] In some CAGR context with bounding boxes, you can see how many things are present. You can actually get a sense of how fast something's moving by tracking the object or bounding box across multiple frames and comparing the timestamp of where it was across those frames. So obviously detection is a very common task type that solves lots of things that you want do with a given model.[00:40:15] In obviously detection. There's been various model frameworks over time. So kind of really early on there's like R-CNN uh, then there's faster rc n n and these sorts of family models, which are based on like resnet kind of architectures. And then a big thing happens, and that is single shot detectors. So faster, rc n n despite its name is, is very slow cuz it takes two passes on the image.[00:40:37] Uh, the first pass is, it finds par pixels in the image that are most interesting to, uh, create a bounding box candidate out of. And then it passes that to a, a classifier that then does classification of the bounding box of interest. Right. Yeah. You can see, you can see why that would be slow. Yeah. Cause you have to do two passes.[00:40:53] You know, kind of actually led by, uh, like mobile net was I think the first large, uh, single shot detector. And as its name implies, it was meant to be run on edge devices and mobile devices and Google released mobile net. So it's a popular implementation that you find in TensorFlow. And what single shot detectors did is they said, Hey, instead of looking at the image twice, what if we just kind of have a, a backbone that finds candidate bounding boxes?[00:41:19] And then we, we set loss functions for objectness. We set loss function. That's a real thing. We set loss functions for objectness, like how much obj, how object do this part of the images. We send a loss function for classification, and then we run the image through the model on a single pass. And that saves lots of compute time and you know, it's not necessarily as accurate, but if you have lesser compute, it can be extremely useful.[00:41:42] And then the advances in both modeling techniques in compute and data quality, single shot detectors, SSDs has become, uh, really, really popular. One of the biggest SSDs that has become really popular is the YOLO family models, as you described. And so YOLO stands for you only look once. Yeah, right, of course.[00:42:02] Uh, Drake's, uh, other album, um, so Joseph Redman introduces YOLO at the University of Washington. And Joseph Redman is, uh, kind of a, a fun guy. So for listeners, for an Easter egg, I'm gonna tell you to Google Joseph Redman resume, and you'll find, you'll find My Little Pony. That's all I'll say. And so he introduces the very first YOLO architecture, which is a single shot detector, and he also does it in a framework called Darknet, which is like this, this own framework that compiles the Cs, frankly, kind of tough to work with, but allows you to benefit from the speedups that advance when you operate in a low level language like.[00:42:36] And then he releases, well, what colloquially is known as YOLO V two, but a paper's called YOLO 9,000 cuz Joseph Redmond thought it'd be funny to have something over 9,000. So get a sense for, yeah, some fun. And then he releases, uh, YOLO V three and YOLO V three is kind of like where things really start to click because it goes from being an SSD that's very limited to competitive and, and, and superior to actually mobile That and some of these other single shot detectors, which is awesome because you have this sort of solo, I mean, him and and his advisor, Ali, at University of Washington have these, uh, models that are becoming really, really powerful and capable and competitive with these large research organizations.[00:43:09] Joseph Edmond leaves Computer Vision Research, but there had been Alexia ab, one of the maintainers of Darknet released Yola VI four. And another, uh, researcher, Glenn Yer, uh, jocker had been working on YOLO V three, but in a PyTorch implementation, cuz remember YOLO is in a dark implementation. And so then, you know, YOLO V three and then Glenn continues to make additional improvements to YOLO V three and pretty soon his improvements on Yolov theory, he's like, oh, this is kind of its own things.[00:43:36] Then he releases YOLO V five[00:43:38] with some naming[00:43:39] controversy that we don't have Big naming controversy. The, the too long didn't read on the naming controversy is because Glen was not originally involved with Darknet. How is he allowed to use the YOLO moniker? Roe got in a lot of trouble cuz we wrote a bunch of content about YOLO V five and people were like, ah, why are you naming it that we're not?[00:43:55] Um, but you know,[00:43:56] cool. But anyway, so state-of-the-art goes to v8. Is what I gather.[00:44:00] Yeah, yeah. So yeah. Yeah. You're, you're just like, okay, I got V five. I'll skip to the end. Uh, unless, unless there's something, I mean, I don't want, well, so I mean, there's some interesting things. Um, in the yolo, there's like, there's like a bunch of YOLO variants.[00:44:10] So YOLOs become this, like this, this catchall for various single shot, yeah. For various single shot, basically like runs on the edge, it's quick detection framework. And so there's, um, like YOLO R, there's YOLO S, which is a transformer based, uh, yolo, yet look like you only look at one sequence is what s stands were.[00:44:27] Um, the pp yo, which, uh, is PAT Paddle implementation, which is by, which Chinese Google is, is their implementation of, of TensorFlow, if you will. So basically YOLO has like all these variants. And now, um, yo vii, which is Glen has been working on, is now I think kind of like, uh, one of the choice models to use for single shot detection.[00:44:44] World Knowledge of Foundation Models[00:44:44] Well, I think a lot of those models, you know, Asking the first principal's question, like let's say you wanna find like a bus detector. Do you need to like go find a bunch of photos of buses or maybe like a chair detector? Do you need to go find a bunch of photos of chairs? It's like, oh no. You know, actually those images are present not only in the cocoa data set, but those are objects that exist like kind of broadly on the internet.[00:45:02] And so computer visions kind of been like us included, have been like really pushing for and encouraging models that already possess a lot of context about the world. And so, you know, if GB T's idea and i's idea OpenAI was okay, models can only understand things that are in their corpus. What if we just make their corpus the size of everything on the internet?[00:45:20] The same thing that happened in imagery, what's happening now? And that's kinda what Sam represents, which is kind of a new evolution of, earlier on we were talking about the cost of annotation and I said, well, good news. Annotations then become decreasingly necessary to start to get to value. Now you gotta think about it more, kind of like, you'll probably need to do some annotation because you might want to find a custom object, or Sam might not be perfect, but what's about to happen is a big opportunity where you want the benefits of a yolo, right?[00:45:47] Where it can run really fast, it can run on the edge, it's very cheap. But you want the knowledge of a large foundation model that already knows everything about buses and knows everything about shoes, knows everything about real, if the name is true, anything segment, anything model. And so there's gonna be this novel opportunity to take what these large models know, and I guess it's kind of like a form of distilling, like distill them down into smaller architectures that you can use in versatile ways to run in real time to run on the edge.[00:46:13] And that's now happening. And what we're seeing in actually kind of like pulling that, that future forward with, with, with Robo Flow.[00:46:21] Segment Anything Model[00:46:21] So we could talk a bit about, um, about SAM and what it represents maybe into, in relation to like these, these YOLO models. So Sam is Facebook segment Everything Model. It came out last week, um, the first week of April.[00:46:34] It has 24,000 GitHub stars at the time of, of this recording within its first week. And why, what does it do? Segment? Everything is a zero shot segmentation model. And as we're describing, creating masks is a very arduous task. Creating masks of objects that are not already represented means you have to go label a bunch of masks and then train a model and then hope that it finds those masks in new images.[00:47:00] And the promise of Segment anything is that in fact you just pass at any image and it finds all of the masks of relevant things that you might be curious about finding in a given image. And it works remarkably. Segment anything in credit to Facebook and the fair Facebook research team, they not only released the model permissive license to move things forward, they released the full data set, all 11 million images and 1.1 billion segmentation masks and three model sizes.[00:47:29] The largest ones like 2.5 gigabytes, which is not enormous. Medium ones like 1.2 and the smallest one is like 400, 3 75 megabytes. And for context,[00:47:38] for, for people listening, that's six times more than the previous alternative, which, which is apparently open images, uh, in terms of number images, and then 400 times more masks than open[00:47:47] images as well.[00:47:48] Exactly, yeah. So huge, huge order magnitude gain in terms of dataset accessibility plus like the model and how it works. And so the question becomes, okay, so like segment. What, what do I do with this? Like, what does it allow me to do? And it didn't Rob float well. Yeah, you should. Yeah. Um, it's already there.[00:48:04] You um, that part's done. Uh, but the thing that you can do with segment anything is you can almost, like, I almost think about like this, kinda like this model arbitrage where you can basically like distill down a giant model. So let's say like, like let's return to the package example. Okay. The package problem of, I wanna get a text when a package appears on my front porch before segment anything.[00:48:25] The way that I would go solve this problem is I would go collect some images of packages on my porch and I would label them, uh, with bounding boxes or maybe masks in that part. As you mentioned, it can be a long process and I would train a model. And that model it actually probably worked pretty well cause it's purpose-built.[00:48:44] The camera position, my porch, the packages I'm receiving. But that's gonna take some time, like everything that I just mentioned the

covid-19 united states god ceo university california game canada thanksgiving ai english google china internet washington sports san francisco phd miami data fun elon musk dc microsoft nashville safety iphone model bachelor fortune congress robots scotland chatgpt generation boss journal iowa tesla economics idea medical walmart wall street journal boom discord medium curious minneapolis computers roe v wade origin pinterest taiwan gotta snapchat ip sort masks lionel messi cto ability delaware similar coco sciences pioneer slack considerations manual openai venture ux users george washington cs gpt mm cvs object robo github llama slightly disrupt rely monty des moines crispr qa represent yolo ripple browsers gt us congress techcrunch general assembly papa johns taipei gpu unveil altman ssd spac labeling mesh turing deploy hackathons truthfully cuz rivian my little pony ocr spacs tf feng yc darknet zendesk stack overflow computer vision iowans sudoku xml meta ai getaround multimodal pony express des moines register tensorflow future predictions cagr tenderloin readme hacker news annotations cardinal health udacity devrel inaturalist hard problems pytorch replit yaml arkit roble boggle evan spiegel techcrunch disrupt greg brockman gbt foundation models doudna polygons bounding jeff dean robel ben silverman hatchlings apple csam latent space mnist dan gross rofo beyang liu

AI Fundamentals: Benchmarks 101

Latent Space: The AI Engineer Podcast â€” CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 7, 2023 50:38

We're trying a new format, inspired by Acquired.fm! No guests, no news, just highly prepared, in-depth conversation on one topic that will level up your understanding. We aren't experts, we are learning in public. Please let us know what we got wrong and what you think of this new format!When you ask someone to break down the basic ingredients of a Large Language Model, you'll often hear a few things: You need lots of data. You need lots of compute. You need models with billions of parameters. Trust the Bitter Lesson, more more more, scale is all you need. Right?Nobody ever mentions the subtle influence of great benchmarking.LLM Benchmarks mark our progress in building artificial intelligences, progressing from * knowing what words go with others (1985 WordNet)* recognizing names and entities (2004 Enron Emails) * and image of numbers, letters, and clothes (1998-2017 MNIST)* language translation (2002 BLEU → 2020 XTREME)* more and more images (2009 ImageNet, CIFAR)* reasoning in sentences (2016 LAMBADA) and paragraphs (2019 AI2RC, DROP)* stringing together whole sentences (2018 GLUE and SuperGLUE)* question answering (2019 CoQA)* having common sense (2018 Swag and HellaSwag, 2019 WinoGrande)* knowledge of all human tasks and professional exams (2021 MMLU)* knowing everything (2022 BIG-Bench)People who make benchmarks are the unsung heroes of LLM research, because they dream up ever harder tests that last ever shorter periods of time.In our first AI Fundamentals episode, we take a trek through history to try to explain what we have learned about LLM Benchmarking, and what issues we have discovered with them. There are way, way too many links and references to include in this email. You can follow along the work we did for our show prep in this podcast's accompanying repo, with all papers and selected tests pulled out.Enjoy and please let us know what other fundamentals topics you'd like us to cover!Timestamps* [00:00:21] Benchmarking Questions* [00:03:08] Why AI Benchmarks matter* [00:06:02] Introducing Benchmark Metrics* [00:08:14] Benchmarking Methodology* [00:09:45] 1985-1989: WordNet and Entailment* [00:12:44] 1998-2004 Enron Emails and MNIST* [00:14:35] 2009-14: ImageNet, CIFAR and the AlexNet Moment for Deep Learning* [00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference* [00:23:21] 2018-19: Swag and HellaSwag - Common Sense Inference* [00:26:07] Aside: How to Design Benchmarks* [00:26:51] 2021: MMLU - Human level Professional Knowledge* [00:29:39] 2021: HumanEval - Code Generation* [00:31:51] 2020: XTREME - Multilingual Benchmarks* [00:35:14] 2022: BIG-Bench - The Biggest of the Benches* [00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results* [00:38:25] Issue: GPT4 vs the mystery of the AMC10/12* [00:40:28] Issue: Data Contamination* [00:42:13] Other Issues: Benchmark Data Quality and the Iris data set* [00:45:44] Tradeoffs of Latency, Inference Cost, Throughput* [00:49:45] ConclusionTranscript[00:00:00] Hey everyone. Welcome to the Latent Space Podcast. This is Alessio, partner and CTO and residence at Decibel Partners, and I'm joined by my co-host, swyx writer and editor of Latent Space.[00:00:21] Benchmarking Questions[00:00:21] Up until today, we never verified that we're actually humans to you guys. So we'd have one good thing to do today would be run ourselves through some AI benchmarks and see if we are humans.[00:00:31] Indeed. So, since I got you here, Sean, I'll start with one of the classic benchmark questions, which is what movie does this emoji describe? The emoji set is little Kid Bluefish yellow, bluefish orange Puffer fish. One movie does that. I think if you added an octopus, it would be slightly easier. But I prepped this question so I know it's finding Nemo.[00:00:57] You are so far a human. Second one of these emoji questions instead, depicts a superhero man, a superwoman, three little kids, one of them, which is a toddler. So you got this one too? Yeah. It's one of my favorite movies ever. It's the Incredibles. Uh, second one was kind of a letdown, but the first is a.[00:01:17] Awesome. Okay, I'm gonna ramp it up a little bit. So let's ask something that involves a little bit of world knowledge. So when you drop a ball from rest, it accelerates downward at 9.8 meters per second if you throw it downward instead, assuming no air resistance, so you're throwing it down instead of dropping it, it's acceleration immediately after leaving your hand is a 9.8 meters per second.[00:01:38] B, more than 9.8 meters per second. C less than 9.8 meters per second. D cannot say unless the speed of the throw is. I would say B, you know, I started as a physics major and then I changed, but I think I, I got enough from my first year. That is B Yeah. Even proven that you're human cuz you got it wrong.[00:01:56] Whereas the AI got it right is 9.8 meters per second. The gravitational constant, uh, because you are no longer accelerating after you leave the hand. The question says if you throw it downward after leaving your hand, what is the. It is, it goes back to the gravitational constant, which is 9.8 meters per, I thought you said you were a physics major.[00:02:17] That's why I changed. So I'm a human. I'm a human. You're human. You're human. But you, you got them all right. So I can't ramp it up. I can't ramp it up. So, Assuming, uh, the AI got all of that right, you would think that AI will get this one wrong. Mm-hmm. Because it's just predicting the next token, right?[00:02:31] Right. In the complex Z plane, the set of points satisfying the equation. Z squared equals modulars. Z squared is A, a pair points B circle, C, a half line D, online D square. The processing is, this is going on in your head. You got minus three. A line. This is hard. Yes, that is. That is a line. Okay. What's funny is that I think if, if an AI was doing this, it would take the same exact amount of time to answer this as it would every single other word.[00:03:05] Cuz it's computationally the same to them. Right.[00:03:08] Why AI Benchmarks matter[00:03:08] Um, so anyway, if you haven't caught on today, we're doing our first, uh, AI fundamentals episode, which just the two of us, no guess because we wanted to go deep on one topic and the topic. AI benchmarks. So why are we focusing on AI benchmarks? So, GPT4 just came out last week and every time a new model comes out, All we hear about is it's so much better than the previous model on benchmark X, on benchmark Y.[00:03:33] It performs better on this, better on that. But most people don't actually know what actually goes on under these benchmarks. So we thought it would be helpful for people to put these things in context. And also benchmarks evolved. Like the more the models improve, the harder the benchmarks get. Like I couldn't even get one of the questions right.[00:03:52] So obviously they're working and you'll see that. From the 1990s where some of the first ones came out to day, the, the difficulty of them is truly skyrocketed. So we wanna give a, a brief history of that and leave you with a mental model on, okay, what does it really mean to do well at X benchmark versus Y benchmark?[00:04:13] Um, so excited to add that in. I would also say when you ask people what are the ingredients going into a large language model, they'll talk to you about the data. They'll talk to you about the neural nets, they'll talk to you about the amount of compute, you know, how many GPUs are getting burned based on this.[00:04:30] They never talk to you about the benchmarks. And it's actually a shame because they're so influential. Like that is the entirety of how we judge whether a language model is better than the other. Cuz a language model can do anything out of. Potentially infinite capabilities. How do you judge one model versus another?[00:04:48] How do you know you're getting better? And so I think it's an area of intense specialization. Also, I think when. Individuals like us, you know, we sort of play with the language models. We are basically doing benchmarks. We're saying, look, it's, it's doing this awesome thing that I found. Guess what? There have been academics studying this for 20 years who have, uh, developed a science to this, and we can actually benefit from studying what they have done.[00:05:10] Yep. And obviously the benchmarks also drive research, you know, in a way whenever you're working on, in a new model. Yeah. The benchmark kind of constraints what you're optimizing for in a way. Because if you've read a paper and it performs worse than all the other models, like you're not gonna publish it.[00:05:27] Yeah. So in a way, there's bias in the benchmark itself. Yeah. Yeah. We'll talk a little bit about that. Right. Are we optimizing for the right things when we over-optimize for a single benchmark over over some others? And also curiously, when GPT4 was released, they emitted some very. Commonplace industry benchmarks.[00:05:44] So the way that you present yourself, it is a form of marketing. It is a form of trying to say you're better than something else. And, and trying to explain where you think you, you do better. But it's very hard to verify as well because there are certain problems with reproducing benchmarks, uh, especially when you come to large language models.[00:06:02] Introducing Benchmark Metrics[00:06:02] So where do we go from here? Should we go over the, the major concept? Yeah. When it comes to benchmark metrics, we get three main measures. Accuracy, precision, recall accuracy is just looking at how many successful prediction the model does. Precision is the ratio of true positives, meaning how many of them are good compared to the overall amount of predictions made Versus recall is what proportion of the positives were identified.[00:06:31] So if you think. Spotify playlist to maybe make it a little more approachable, precision is looking. How many songs in a Spotify playlist did you like versus recall is looking at of all the Spotify songs that you like in the word, how many of them were put in the in the playlist? So it's more looking at how many of the true positives can you actually bring into the model versus like more focusing on just being right.[00:06:57] And the two things are precision and recall are usually in tension.. If you're looking for a higher position, you wanna have a higher percentage of correct results. You're usually bringing recall down because you lead to kind of like lower response sets, you know, so there's always trade offs. And this is a big part of the benchmarking too.[00:07:20] You know, what do you wanna optimize for? And most benchmarks use this, um, F1 score, which is the harmonic mean of precision and recall. Which is, you know, we'll put it in the show notes, but just like two times, like the, you know, precision Times Recall divided by the sum. So that's one. And then you get the Stanford Helm metrics.[00:07:38] Um, yeah, so ultimately I think we have advanced a lot in the, in the past few decades on how we measure language models. And the most interesting one came out January of this year from Percy Lang's research lab at Stanford, and he's got. A few metrics, accuracy, calibration, robustness, fairness, efficiency, general information bias and toxicity, and caring that your language models are not toxic and not biased.[00:08:03] So is is, mm-hmm. Kind of a new thing because we have solved the other stuff, therefore we get to care about the toxic of, uh, the language models yelling at us.[00:08:14] Benchmarking Methodology[00:08:14] But yeah, I mean, maybe we can also talk about the other forms of how their be. Yeah, there's three main modes. You can need a benchmark model in a zero shot fashion, few shot or fine tune models, zero shots.[00:08:27] You do not provide any example and you're just testing how good the model is at generalizing few shots, you have a couple examples that you provide and then. You see from there how good the model is. These are the number of examples usually represented with a K, so you might see few shots, K equal five, it means five examples were passed, and then fine tune is you actually take a bunch of data and fine tune the model for that specific task, and then you test it.[00:08:55] These all go from the least amount of work required to the most amount of work required. If you're doing zero shots benchmarking, you do not need to have any data, so you can just take 'em out and do. If you're fine tuning it, you actually need a lot of data and a lot of compute time. You're expecting to see much better results from there.[00:09:14] Yeah. And sometimes the number of shots can go up to like a hundred, which is pretty surprising for me to see that people are willing to test these language models that far. But why not? You just run the computer a little bit longer. Yeah. Uh, what's next? Should we go into history and then benchmarks? Yeah.[00:09:29] History of Benchmarking since 1985[00:09:29] Okay, so I was up all night yesterday. I was like, this is a fascinating topic. And I was like, all right, I'll just do whatever's in the G PT three paper. And then I read those papers and they all cited previous papers, and I went back and back and back all the way to 1985. The very first benchmark that I can find.[00:09:45] 1985-1989: WordNet and Entailment[00:09:45] Which is WordNet, which is uh, an English benchmark created in at Princeton University by George Miller and Christian Fellbaum. Uh, so fun fact, Chris George Miller also authored the paper, the Magical Number seven plus Minus two, which is the observation that people have a short term memory of about seven for things.[00:10:04] If you have plus or minus two of seven, that's about all you can sort of remember in the short term, and I just wanted. Say like, this was before computers, right? 1985. This was before any of these personal computers were around. I just wanna give people a sense of how much work manual work was being done by these people.[00:10:22] The database, uh, WordNet. Sorry. The WordNet database contains 155,000 words organized in 175,000 sys. These sys are basically just pairings of nouns and verbs and adjectives and adverbs that go together. So in other words, for example, if you have nouns that are hyper names, if every X is a, is a kind of Y.[00:10:44] So a canine is a hyper name of a dog. It's a holo. If X is a part of Y, so a building is a hollow name of a window. The most interesting one for in terms of formal, uh, linguistic logic is entailment, which captures the relationship between two words, where the verb Y is entailed by X. So if by doing X, you must be doing Y.[00:11:02] So in other words, two, sleep is entailed by two snore because you cannot snore without also sleeping and manually mapping 155,000 words like that, the relationships between all of them in a, in a nested tree, which is. Incredible to me. Mm-hmm. And people just did that on faith. They were like, this will be useful somehow.[00:11:21] Right. Uh, and they were interested in cycle linguistics, like understanding how humans thought, but then it turned out that this was a very good dataset for understanding semantic similarity, right? Mm-hmm. Like if you measure the distance between two words by traversing up and down the graph, you can find how similar to two words are, and therefore, Try to figure out like how close they are and trade a model to, to predict that sentiment analysis.[00:11:42] You can, you can see how far something is from something that is considered a good sentiment or a bad sentiment or machine translation from one language to the other. Uh, they're not 200 word languages, which is just amazing. Like people had to do this without computers. Penn Tree Bank, I was in 1989, I went to Penn, so I always give a shout out to my university.[00:12:01] This one expanded to 4.5 million words of text, which is every uh, wall Street Journal. For three years, hand collected, hand labeled by grad students your tuition dollars at work. So I'm gonna skip forward from the eighties to the nineties. Uh, NYS was the most famous data set that came out of this. So this is the, uh, data set of 60,000.[00:12:25] Training images of, uh, of numbers. And this was the first visual dataset where, uh, people were tr tracking like, you know, handwritten numbers and, and mapping them to digital numbers and seeing what the error rate for them was. Uh, these days I think this can be trained in like e every Hello world for machine learning is just train missed in like four lanes of code.[00:12:44] 1998-2004 Enron Emails and MNIST[00:12:44] Then we have the Enron email data set. Enron failed in 2001. Uh, the emails were released in 2004 and they've been upgraded every, uh, every few years since then. That is 600,000 emails by 150 senior employees of Enron, which is really interesting because these are email people emailing each other back and forth in a very natural.[00:13:01] Context not knowing they're being, they're about to be observed, so you can do things like email classification, email summarization, entity recognition and language modeling, which is super cool. Any thoughts about that be before we go into the two thousands? I think like in a way that kind of puts you back to the bias, you know, in some of these benchmarks, in some of these data sets.[00:13:21] You know, like if your main corpus of benchmarking for entity recognition is a public energy company. Mm-hmm. You know, like if you're building something completely different and you're building a model for that, maybe it'll be worse. You know, you start to see how we started. With kind of like, WordNet is just like human linguistics, you know?[00:13:43] Yes. It's not domain related. And then, um, same with, you know, but now we're starting to get into more and more domain-specific benchmarks and you'll see this increase over time. Yeah. NY itself was very biased towards, um, training on handwritten letter. Uh, and handwritten numbers. So, um, in 2017 they actually extended it to Eist, which is an extended to extension to handwritten letters that seems very natural.[00:14:08] And then 2017, they also had fashion ness, which is a very popular data set, which is images of clothing items pulled from Zando. So you can see the capabilities of computer vision growing from single digit, 0, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, to all the letters of the alphabet. To now we can recognize images, uh, of fashion, clothing items.[00:14:28] So it's pretty. So the big one for deep learning, cuz all of that was just, just the appetizers, just getting started.[00:14:35] 2009-2014 : ImageNet, CIFAR and the AlexNet Moment for Deep Learning[00:14:35] The big one for deep learning was ImageNet, which is where Fafa Lee came into the picture and that's why she's super well known. She started working in 2006 and released it in 2009. Fun fact, she actually met with, uh, Christian Feldbaum, who was, uh, one of the co-authors of, uh, war.[00:14:51] To create ImageNet. So there's a direct lineage from Words to Images. Yeah. And uh, they use Amazon Mechanical Turk to help with classification images. No longer grad students. But again, like I think, uh, this goes, kind of goes back to your observation about bias, like when I am a mechanical Turk worker. And I'm being paid by the image to classify an image.[00:15:10] Do you think I'll be very careful at my job? Right? Yeah. Whereas when I'm a, you know, Enron employee, emailing my, my fellow coworker, trying to just communicate something of, of natural language that is a different type of, uh, environment. Mm-hmm. So it's a pretty interesting benchmark. So it was released in 2009 ish and, you know, people were sort of competing to recognize and classify that properly.[00:15:33] The magic moment for ImageNet came in 2012, uh, which is called the AlexNet moment cuz I think that grad student that, um, created this recognition model was, uh, named Alex, I forget his last name, achieved a error rate of 15%, which is, More than 10% lower than the runner up. So it was used just so much better than the second place that everyone else was like, what are you doing?[00:15:54] Uh, and it turned out that he was, he was the first to use, uh, deep learning, uh, c n n 10 percentage points. So like 15 and the other one was 25. Yeah, exactly. So it was just so much, so much better than the others. It was just unbelievable that no one else was, no other approach was even coming close.[00:16:09] Therefore, everyone from there on out for the next, until today we're just learning the lessons of deep learning because, um, it is so much superior to the other approaches. And this was like a big. Images and visual moment because then you had like a sci-fi 10, which is a, another, like a data set that is mostly images.[00:16:27] Mm-hmm. Focused. Mm-hmm. So it took a little bit before we got back to to text. And nowadays it feels like text, you know, text models are kind of eating the word, you know, we're making the text one multi-model. Yeah. So like we're bringing the images to GBT four instead of the opposite. But yeah, in 2009 we had a, another 60,000 images that set.[00:16:46] 32 by 32. Color images with airplanes, automobiles, like, uh, animals, like all kind of stuff. Like I, I think before we had the numbers, then we had the handwritten letters. Then we had clothing, and then we finally made clothing items came after, oh, clothing items. 2009. Yeah, this is 2009. I skipped, I skipped time a little bit.[00:17:08] Yeah, yeah. But yeah, CFR 10 and CFR 100. CFR 10 was for 10 classes. And that that was chosen. And then obviously they optimized that and they were like, all right, we need a new problem now. So in 20 14, 5 years later, they introduced CFAR 100, which was a hundred classes of other items. And I think this is a very general pattern, which is used.[00:17:25] You create a data set for a specific be. You think it's too hard for machines? Mm-hmm. It lasts for five years before it's no longer too hard for machines, and you have to find a new data set and you have to extend it again. So it's Similarly, we are gonna find that in glue, which is another, which is one of more modern data sets.[00:17:42] 2018-19: GLUE and SuperGLUE - Single Sentence, Similarity and Paraphrase, Inference[00:17:42] This one came out in 2018. Glue stands for general Language Understanding Evaluation. This is one of the most influential, I think, early. Earlier, um, language model benchmarks, and it has nine tasks. Um, so it has single sentence tasks, similarity and paraphrase tasks and inference tasks. So a single sentence task, uh, would be something like, uh, the Stanford Sentiment Tree Bank, which is a.[00:18:05] Uh, sentences from movie reviews and human annotations of the sentiment, whether it's positive or negative, in a sort of like a four point scale. And your job is to predict the task of a single sentence. This similarity task would involve corpuses, like the Microsoft research paraphrase corpus. So it's a corpus of sentence pairs automatically extracted from online news sources with human annotations for whether or not the sentence is in the para semantically equivalent.[00:18:28] So you just predict true or false and again, Just to call back to the math that we did earlier in this episode, the classes here are imbalance. This data set, for example, is 68% positive. So we report both accuracy and F1 scores. F1 is a more balanced approach because it, it adjusts for, uh, imbalanced, um, data sets.[00:18:48] Mm-hmm. Yeah. And then finally, inference. Inference is the one where we really start to have some kind of logic. So for example, the M N L I. Um, actually I'm, I'm gonna focus on squad, the Stanford questioning question answering dataset. It's another data set of pairs, uh, questions, uh, uh, p question paragraphs, pairs.[00:19:04] So where one of the sentences of the paragraph drawn from Wikipedia contains the answer to the corresponding question, we convert the task into a sentence, para classification by forming a pair between each question in each sentence into corresponding context and filtering out pairs of low overlap. So basically annotating whether or not.[00:19:20] Is the answer to the question inside of this paragraph that I pulled. Can you identify that? And again, like Entailment is kind of included inside of each of these inference tasks because it starts to force the language model to understand whether or not one thing implies the other thing. Mm-hmm. Yeah.[00:19:37] And the, the models evolving. This came out in 2018, lasted one year exactly. One year later, people were like, that's too easy. That's too easy. So in 2019, they actually came out with super. I love how you'll see later with like swag and hella swag. It's like they come up with very good names for these things.[00:19:55] Basically what's super glue dead is stick glue and try and move outside of the single sentence evaluation. So most of the tasks that. Sean was talking about focus on one sentence. Yeah, one sentence, one question. It's pretty straightforward in that way. Superglue kind of at the, so one, it went from single sentence to having some multi sentence and kind of like a context driven thing.[00:20:21] So you might have questions where, The answer is not in the last paragraph that you've read. So it starts to test the, the context window on this model. Some of them are more, in order to know the answer, you need to know what's not in the question kind of thing. So like you may say, Hey, this drink is owned by the Coca-Cola company.[00:20:43] Is this a Pepsi product? You know, so you need to make the connection false. Exactly, yeah. Then you have also like, um, embedded clauses. So you have things that are not exactly said, have to be inferred, and like a lot of this stack is very conversational. So some of the example contain a lot of the, um, um, you know, or this question's very hard to read out.[00:21:07] Yeah, I know. It's like, it sounds like you are saying, um, but no, you're actually, you're actually. And yet I hope to see employer base, you know, helping out child, um, care centers at the place of employment, things like that, that will help out. It's kind of hard to even read it. And then the hypothesis is like they're setting a trend.[00:21:27] It's going from something very simple like a big p d extract to something that is more similar to how humans communicate. Transcripts, like audio transcripts. Exactly. Of how people talk. Yeah. And some of them are also, Plausibility. You know, like most of these models have started to get good at understanding like a clear cause, kind of like a.[00:21:48] You know, cause effect things. But some of the plausible ones are like, for example, this one is a copa. They're called choice of plausible alternatives. The premises, my body cast a shadow over the grass. What's the cost for this alternative? One, the sun was rising. Alternative to the grass was cut.[00:22:07] Obviously it's the sun was rising, but nowhere. In the question we're actually mentioning the sun, uh, we are mentioning the grass. So some models, some of the older models might see the grass and make the connection that the grass is part of the reason, but the models start to get better and better and go from simply looking at the single sentence context to a more of a, a word new, uh, word knowledge.[00:22:27] It's just really impressive, like the fact that. We can expect that out of a model. It still blows my mind. I think we should not take it for granted that when we're evaluating models, we're asking questions like this that is not obvious from just the given text itself. Mm-hmm. So it, it is just coming with a memorized view of the world, uh, or, or world knowledge. And it understands the premise on, on some form. It is not just random noise. Yeah, I know. It's really impressive. This one, I actually wanted multi rc I actually wanted to spring on you as a, as a test, but it's just too long to read. It's just like a very long logic question.[00:23:03] And then it'll ask you to do, uh, comprehension. But uh, yeah, we'll just, we'll just kinda skip that. We'll put it, we'll put it in the show notes, and then you have to prove us that you're a human. Send us the answer exactly. Exactly and subscribe to the podcast. So superglue was a lot harder, and I think also was superseded eventually, pretty soon.[00:23:21] 2018-2019: Swag and HellaSwag - Common Sense Inference[00:23:21] And, uh, yeah, then we started coming onto the more recent cohort of tests. I don't know how to introduce the rest. Uh, there, there are just so many tests here that I, I struggle a little bit picking from these. Uh, but perhaps we can talk about swag and heli swyx since you mentioned it. Yeah. So SWAG stands for situations with Adversarial Generations.[00:23:39] Uh, also came out in 2018, but this guy, zes Etal, likes to name his data sets and his benchmarks in a very memorable way. And if you look at the PDF of the paper, he also has a little icon, uh, image icon for swag. And he doesn't just go by, uh, regular language. So he definitely has a little bit of branding to this and it's.[00:24:00] Part. So I'll give you an example of the kind of problems that swyx poses. Uh, it it is focused on common sense inference. So what's common sense inference? So, for example, given a partial description, like she opened the hood of the car, humans can reason about the situation and anticipate what might come next.[00:24:16] Then she examined the engine. So you're supposed to pick based on what happened in the first part. What is most likely to happen in the second part based on the, uh, multiple choice question, right? Another example would be on stage, a woman takes a seat at the piano. She a, sits on a bench as her sister plays at the doll.[00:24:33] B. Smiles with someone as the music play. C is in the crowd watching the dancers. D nervously set her fingers on the keys, so A, B, C, or D. It's not all of them are plausible. When you look at the rules of English, we're we've, we're not even checking for whether or not produces or predicts grammatical English.[00:24:54] We're checking for whether the language model can correctly pick what is most likely given the context. The only information that you're given is on stage. A woman takes a seat at the piano, what is she most likely to do next? And D makes sense. It's arguable obviously. Sometimes it could be a. In common sense, it's D.[00:25:11] Mm-hmm. So we're training these models to have common. Yeah, which most humans don't have. So it's a, it's already a step up. Obviously that only lasted a year. Uh, and hello, SWAG was no longer, was no longer challenging in 2019, and they started extending it quite a lot more, a lot more questions. I, I forget what, how many questions?[00:25:33] Um, so Swag was a, swag was a data set. A hundred thousand multiple choice questions. Um, and, and part of the innovation of swag was really that you're generating these questions rather than manually coming up with them. Mm-hmm. And we're starting to get into not just big data, but big questions and big benchmarks of the, of the questions.[00:25:51] That's where the adversarial generations come in, but how that swag. Starts pulling in from real world questions and, and data sets like, uh, wikiHow and activity net. And it's just really, you know, an extension of that. I couldn't even add examples just cuz there's so many. But just to give you an idea of, uh, the progress over time.[00:26:07] Aside: How to Design Benchmarks[00:26:07] Most of these benchmarks are, when they're released, they set. Benchmark at a level where if you just randomly guessed all of the questions, you'll get a 25%. That's sort of the, the baseline. And then you can run each of the language models on them, and then you can run, uh, human evaluations on them. You can have median evaluations, and then you have, um, expert evaluations of humans.[00:26:28] So the randoms level was, uh, for halla. swyx was 20. GT one, uh, which is the, uh, 2019 version that got a 41 on the, on the Hello Sue X score. Bert from Google, got 47. Grover, also from Google, got 57 to 75. Roberta from Facebook, got 85 G P T, 3.5, got 85, and then GPT4 got 95 essentially solving hello swag. So this is useless too.[00:26:51] 2021 - MMLU - Human level Professional Knowledge[00:26:51] We need, we need super Hell now's use this. Super hell swyx. I think the most challenging one came from 2021. 2021 was a very, very good year in benchmarking. So it's, we had two major benchmarks that came out. Human eval and M M L U, uh, we'll talk about mm. M L U first, cuz that, that's probably the more, more relevant one.[00:27:08] So M M L U. Stands for measuring mul massive multitask language understanding, just by far the biggest and most comprehensive and most human-like, uh, benchmark that we've had for until 2021. We had a better one in 2022, but we'll talk about that. So it is a test that covers 57 tasks, including elementary, math, US history, computer science law, and more.[00:27:29] So to attain high accuracy on this task, models must possess extensive world knowledge and prop problem solving. Its. Includes practice questions for the GRE test and the U United States, um, m l e, the medical exam as. It also includes questions from the undergrad courses from Oxford, from all the way from elementary high school to college and professional.[00:27:49] So actually the opening question that I gave you for this podcast came from the math test from M M L U, which is when you drop a ball from rest, uh, what happens? And then also the question about the Complex Z plane, uh, but it equally is also asking professional medicine question. So asking a question about thyroid cancer and, uh, asking you to diagnose.[00:28:10] Which of these four options is most likely? And asking a question about microeconomics, again, giving you a, a situation about regulation and monopolies and asking you to choose from a list of four questions. Mm-hmm. Again, random baseline is 25 out of 100 G P T two scores, 32, which is actually pretty impressive.[00:28:26] GT three scores between 43 to 60, depending on the the size. Go. Scores 60, chinchilla scores 67.5, GT 3.5 scores, 70 GPT4 jumps, one in 16 points to 86.4. The author of M M L U, Dan Hendrix, uh, was commenting on GPT4 saying this is essentially solved. He's basically says like, GT 4.5, the, the next incremental improvement on GPT4 should be able to reach expert level human perform.[00:28:53] At which point it is passing simultaneously, passing all the law exams, all the medical exams, all the graduate student exams, every single test from AP history to computer science to. Math to physics, to economics. It's very impressive. Yeah. And now you're seeing, I mean, it's probably unrelated, but Ivy League universities starting to drop the a t as a requirement for getting in.[00:29:16] So yeah. That might be unrelated as well, because, uh, there's a little bit of a culture war there with regards to, uh, the, the inherent bias of the SATs. Yeah. Yeah. But I mean, that's kinda, I mean exactly. That's kinda like what we were talking about before, right? It's. If a model can solve all of these, then like how good is it really?[00:29:33] How good is it as a Exactly. Telling us if a person should get in. It captures it. Captures with just the beginning. Yeah. Right.[00:29:39] 2021: HumanEval - Code Generation[00:29:39] Well, so I think another significant. Benchmark in 2021 was human eval, which is, uh, the first like very notable benchmark for code code generation. Obviously there's a, there's a bunch of research preceding this, but this was the one that really caught my eye because it was simultaneously introduced with Open Eyes Codex, which is the code generation model, the version of G P T that was fine tuned for generating code.[00:30:02] Uh, and that is, Premise of, well, there is the origin or the the language model powering GitHub co-pilot and yeah, now we can write code with language models, just with that, with that benchmark. And it's good too. That's the other thing, I think like this is one where the jump from GT 3.5 to GPT4 was probably the biggest, like GT 3.4 is like 48% on. On this benchmark, GPT4 is 67%. So it's pretty big. Yeah. I think coders should rest a little bit. You know, it's not 90 something, it's, it's still at 67, but just wait two years. You know, if you're a lawyer, if you're a lawyer, you're done. If you're a software engineer, you got, you got a couple more years, so save your money.[00:30:41] Yeah. But the way they test it is also super creative, right? Like, I think maybe people don't understand that actually all of the tests that are given here are very intuitive. Like you. 90% of a function, and then you ask the language model to complete it. And if it completes it like any software engineer would, then you give it a win.[00:31:00] If not, you give it a loss, run that model 164 times, and that is human eval. Yeah. Yeah. And since a lot of our listeners are engineers too, I think the big thing here is, and there was a, a link that we had that I missed, but some of, for example, some of. Coding test questions like it can answer older ones very, very well.[00:31:21] Like it doesn't not answer recent ones at all. So like you see some of like the data leakage from the training, like since it's been trained on the issues, massive data, some of it leaks. So if you're a software engineer, You don't have to worry too much. And hopefully, especially if you're not like in the JavaScript board, like a lot of these frameworks are brand new every year.[00:31:41] You get a lot of new technologies. So there's Oh, there's, oh yeah. Job security. Yes, exactly. Of course. Yeah. You got a new, you have new framework every year so that you have job security. Yeah, exactly. I'll sample, uh, data sets.[00:31:51] 2020 - XTREME - Multilingual Benchmarks[00:31:51] So before we get to big bench, I'll mention a couple more things, which is basically multilingual benchmarks.[00:31:57] Uh, those are basically simple extensions of monolingual benchmarks. I feel like basical. If you can. Accurately predicts the conversion of one word or one part of the word to another part of the word. Uh, you get a score. And, and I think it's, it's fairly intuitive over there. Uh, but I think the, the main benchmarks to know are, um, extreme, which is the, uh, x the x lingual transfer evaluation, the multilingual encoders, and much prefer extreme.[00:32:26] I know, right? Uh, that's why, that's why they have all these, uh, honestly, I think they just wanted the acronym and then they just kinda worked backwards. And then the other one, I can't find it in my notes for, uh, what the other multilingual ones are, but I, I just think it's interesting to always keep in mind like what the other.[00:32:43] Language capabilities are like, one language is basically completely equivalent to another. And I think a lot of AI ethicists or armchair AI ethicists are very angry that, you know, most of the time we optimize for English because obviously that has, there's the most, uh, training corpuses. I really like extreme the work that's being done here, because they took a, a huge amount of effort to make sure they cover, uh, sparse languages like the, the less popular ones.[00:33:06] So they had a lot of, uh, the, the, obviously the, the popular. Uh, the world's top languages. But then they also selected to maximize language diversity in terms of the complete diversity in, uh, human languages like Tamil Telugu, maam, and Sohi and Yoruba from Africa. Mm-hmm. So I just thought like that kind of effort is really commendable cuz uh, that means that the rest of the world can keep up in, in this air race.[00:33:28] Right. And especially on a lot of the more human based things. So I think we talked about this before, where. A lot of Israel movies are more[00:33:36] focused on culture and history and like are said in the past versus a lot of like the Western, did we talk about this on the podcast? No, not on the podcast. We talked and some of the Western one are more focused on the future and kind of like what's to come.[00:33:48] So I feel like when you're, some of the benchmarks that we mentioned before, you know, they have movie reviews as like, uh, one of the. One of the testing things. Yeah. But there's obviously a big cultural difference that it's not always captured when you're just looking at English data. Yeah. So if you ask the a motto, it's like, you know, are people gonna like this movie that I'm writing about the future?[00:34:10] Maybe it's gonna say, yeah, that's a really good idea. Or if I wanna do a movie about the past, it's gonna be like maybe people want to hear about robots. But that wouldn't be the case in, in every country. Well, since you and I speak different languages, I speak Chinese, you speak Italian, I'm sure you've tested the Italian capabilities.[00:34:29] What do you think? I think like as. Italy, it's so much more, um, dialect driven. So it can be, it can be really hard. So what kind of Italian does g PT three speak? Actually Italian, but the reality is most people have like their own, their own like dialect. So it would be really hard for a model to fool. An Italian that it's like somebody from where they are, you know?[00:34:49] Yeah. Like you can actually tell if you're speaking to AI bot in Chinese because they would not use any of the things that human with humans would use because, uh, Chinese humans would use all sorts of replacements for regular Chinese words. Also, I tried one of those like language tutor things mm-hmm.[00:35:06] That people are making and they're just not good Chinese. Not colloquial Chinese, not anything that anyone would say. They would understand you, but they were from, right, right.[00:35:14] 2022: BIG-Bench - The Biggest of the Benches[00:35:14] So, 2022, big bench. This was the biggest of the biggest, of the biggest benchmarks. I think the, the main pattern is really just, Bigger benchmarks rising in opposition to bigger and bigger models.[00:35:27] In order to evaluate these things, we just need to combine more and more and way more tasks, right? Like swag had nine tasks, hello swag had nine more tasks, and then you're, you're just adding and adding and adding and, and just running a battery of tasks all over. Every single model and, uh, trying to evaluate how good they are at each of them.[00:35:43] Big bench was 204 tasks contributed by 442 authors across 132 institutions. The task topics are diverse, drawing from linguistics, childhood development, math, common sense reasoning, biology, physics, social bias, software development, and beyond. I also like the fact that these authors also selected tasks that are not solved by current language models, but also not solvable by memorizing the internet, which is mm-hmm.[00:36:07] Tracking back to a little bit of the issues that we're, we're gonna cover later. Right. Yeah. I think that's, that's super interesting. Like one of, some of the examples would include in the following chess position, find a checkmate, which is, some humans cannot do that. What is the name of the element within a topic number of six?[00:36:22] Uh, that one you can look up, right? By consulting a periodic table. We just expect language models to memorize that. I really like this one cuz it's, uh, it's inherent. It's, uh, something that you can solve.[00:36:32] Identify whether this sentence has an anachronism. So, option one. During the Allied bombardment of the beaches of Iwojima, Ralph spoke loudly into his radio.[00:36:41] And in option two, during the allied bombardment of the beaches of Iwojima, Ralph spoke loudly into his iPhone. And you have to use context of like when iPhone, when Ally bombarding. Mm-hmm. And then sort of do math to like compare one versus the other and realize that okay, this one is the one that's out of place.[00:36:57] And that's asking more and more and more of the language model to do in implicitly, which is actually modeling what we do when we listen to language, which is such a big. Gap. It's such a big advancement from 1985 when we were comparing synonyms. Mm-hmm. Yeah, I know. And it's not that long in the grand scheme of like humanity, you know, like it's 40 years.[00:37:17] It's crazy. It's crazy. So this is a big missing gap in terms of research. Big benches seems like the most comprehensive, uh, set of benchmarks that we have. But it is curiously missing from Gypsy four. Mm-hmm. I don't know. On paper, for code, I only see Gopher two 80. Yeah. On it. Yeah. Yeah. It could be a curious emission because it maybe looks.[00:37:39] Like it didn't do so well.[00:37:40] EDIT: Why BIG-Bench is missing from GPT4 Results[00:37:40] Hello, this is Swyx from the editing room sometime in the future. I just wanted to interject that. Uh, we now know why the GPT for benchmark results did not include the big bench. Benchmark, even though that was the state-of-the-art benchmark at the time. And that's because the. Uh, GPC four new the Canary G U I D of the big bench.[00:38:02] Benchmark. Uh, so Canary UID is a random string, two, six[00:38:08] eight six B eight, uh, blah, blah, blah. It's a UID. UID, and it should not be knowable by the language model. And in this case it was therefore they had to exclude big bench and that's. And the issue of data contamination, which we're about to go into right now.[00:38:25] Issue: GPT4 vs the mystery of the AMC10/12[00:38:25] And there's some interesting, if you dive into details of GPT4, there's some interesting results in GPT4, which starts to get into the results with benchmarking, right? Like so for example, there was a test that GPT4 published that is very, very bizarre to everyone who is even somewhat knowledgeable.[00:38:41] And this concerns the Ammc 10 and AMC 12. So the mc. Is a measure of the American math 10th grade student and the AMC12 is a, uh, is a measure of the American 12th grade student. So 12 is supposed to be harder than 10. Because the students are supposed to be older, it's, it's covering topics in algebra, geometry number, theory and combinatorics.[00:39:04] GPT4 scored a 30 on AMC10 and scored a 60 on AMC12. So the harder test, it got twice as good, and 30 was really, really bad. So the scoring format of AMC10. It is 25 questions. Each correct answer is worth six points. Each incorrect answer is worth 1.5 points and unanswered questions receive zero points.[00:39:25] So if you answer every single question wrong, you will get more than GPT4 got on AMC10. You just got everything wrong. Yeah, it's definitely better in art medics, you know, but it's clearly still a, a long way from, uh, from being even a high school student. Yeah. There's a little bit of volatility in these results and it, it shows that we, it's not quite like machine intelligence is not the same, or not linearly scaling and not intuitive as human intelligence.[00:39:54] And it's something that I think we should be. Aware of. And when it freaks out in certain ways, we should not be that surprised because Yeah, we're seeing that. Yeah. I feel like part of it is also human learning is so structured, you know, like you learn the new test, you learn the new test, you learn the new test.[00:40:10] But these models, we kind of throw everything at them all at once, you know, when we train them. So when, when the model is strained, are you excusing the model? No, no, no. I'm just saying like, you know, and you see it in everything. It's like some stuff. I wonder what the percentage of. AMC 10 versus AMC 12.[00:40:28] Issue: Data Contamination[00:40:28] Content online is, yes. This comes in a topic of contamination and memorization. Right. Which we can get into if we, if we, if we want. Yeah. Yeah, yeah. So, uh, we're getting into benchmarking issues, right? Like there's all this advancements in benchmarks, uh, language models. Very good. Awesome. Awesome, awesome. Uh, what are the problems?[00:40:44] Uh, the problem is that in order to train these language models, we are scraping the vast majority of the internet. And as time passes, the. Of previous runs of our tests will be pasted on the internet, and they will go into the corpus and the leg model will be memorizing them rather than reasoning them from first principles.[00:41:02] So in, in the machine, classic machine learning parlance, this would be overfitting mm-hmm. Uh, to the test rather than to the generalizing to the, uh, the results that we really want. And so there's an example of, uh, code forces as well also discovered on GPT4. So Code Forces has annual vintages and there was this guy, uh, C H H Halle on Twitter who ran GPT4 on pre 2021 problems, solved all of them and then ran it on 2022 plus problems and solved zero of them.[00:41:31] And we know that the cutoff for GPT4 was 2021. Mm-hmm. So it just memorized the code forces problems as far as we can tell. And it's just really bad at math cuz it also failed the mc 10 stuff. Mm-hmm. It's actually. For some subset of its capabilities. I bet if you tested it with GPT3, it might do better, right?[00:41:50] Yeah. I mean, this is the, you know, when you think about models and benchmarks, you can never take the benchmarks for what the number says, you know, because say, you know, you're focusing on code, like the benchmark might only include the pre 2021 problems and it scores great, but it's actually bad at generalizing and coming up with new solutions.[00:42:10] So, yeah, that, that's a. Big problem.[00:42:13] Other Issues: Benchmark Data Quality and the Iris data set[00:42:13] Yeah. Yeah. So bias, data quality, task specificity, reproducibility, resource requirements, and then calibrating confidence. So bias is, is, is what you might think it is. Basically, there's inherent bias in the data. So for example, when you think about doctor, do you think about a male doctor, a female doctor, in specifically an image net?[00:42:31] Businessmen, white people will be labeled businessmen, whereas Asian businessmen will be labeled Asian businessmen and that can reinforce harmful serotypes. That's the bias issue. Data quality issue. I really love this one. Okay, so there's a famous image data set we haven't talked about called the pedals or iris.[00:42:47] Iris dataset mm-hmm. Contains measurements of, uh, of, uh, length with petal length and petal with, uh, three different species of iris, iris flowers, and they have labeling issues in. So there's a mini, there's a lowest level possible error rate because the error rate exists in the data itself. And if you have a machine learning model that comes out with better error rate than the data, you have a problem cuz your machine learning model is lying to you.[00:43:12] Mm-hmm. Specifically, there's, we know this for a fact because especially for Iris flowers, the length should be longer than the, than the width. Um, but there. Number of instances in the data set where the length was shorter than the, than the width, and that's obviously impossible. So there was, so somebody made an error in the recording process.[00:43:27] Therefore if your machine learning model fits that, then it's doing something wrong cuz it's biologically impossible. Mm-hmm. Task specificity basically if you're overfitting to, to one type of task, for example, answering questions based on a single sentence or you're not, you know, facing something real world reproducibility.[00:43:43] This one is actually, I guess, the fine details of machine learning, which people don't really like to talk about. There's a lot. Pre-processing and post-processing done in I Python notebooks. That is completely un versions untested, ad hoc, sticky, yucky, and everyone does it differently. Therefore, your test results might not be the same as my test results.[00:44:04] Therefore, we don't agree that your scores are. The right scores for your benchmark, whereas you're self reporting it every single time you publish it on a, on a paper. The last two resource requirements, these are, these are more to do with GPTs. The larger and larger these models get, the harder, the more, more expensive it is to run some.[00:44:22] And some of them are not open models. In other words, they're not, uh, readily available, so you cannot tell unless they run it themselves on, on your benchmark. So for example, you can't run your GPT3, you have to kind of run it through the api. If you don't have access to the API like GPT4, then you can't run it at all.[00:44:39] The last one is a new one from GPT4's Paper itself. So you can actually ask the language models to expose their log probabilities and show you how confident they think they are in their answer, which is very important for calibrating whether the language model has the right amount of confidence in itself and in the GPT4 people. It. They were actually very responsible in disclosing that They used to have about linear correspondence between the amount of confidence and the amount of times it was right, but then adding R L H F onto GPT4 actually skewed this prediction such that it was more confident than it should be. It was confidently incorrect as as people say.[00:45:18] In other words, hallucinating. And that is a problem. So yeah, those are the main issues with benchmarking that we have to deal with. Mm-hmm. Yeah, and a lot of our friends, our founders, we work with a lot of founders. If you look at all these benchmarks, all of them just focus on how good of a score they can get.[00:45:38] They don't focus on what's actually feasible to use for my product, you know? So I think.[00:45:44] Tradeoffs of Latency, Inference Cost, Throughput[00:45:44] Production benchmarking is something that doesn't really exist today, but I think we'll see the, the rise off. And I think the main three drivers are one latency. You know, how quickly can I infer the answer cost? You know, if I'm using this model, how much does each call cost me?[00:46:01] Like is that in line with my business model I, and then throughput? I just need to scale these models to a lot of questions on the ones. Again, I just do a benchmark run and you kind of come up. For quadrants. So if on the left side you have model size going from smallest to biggest, and on the X axis you have latency tolerance, which is from, I do not want any delay to, I'll wait as long as I can to get the right answer.[00:46:27] You start to see different type of use cases, for example, I might wanna use a small model that can get me an answer very quickly in a short amount of time, even though the answer is narrow. Because me as a human, maybe I'm in a very iterative flow. And we have Varun before on the podcast, and we were talking about a kind of like a acceleration versus iteration use cases.[00:46:50] Like this is more for acceleration. If I'm using co-pilot, you know, the code doesn't have to be a hundred percent correct, but it needs to happen kind of in my flow of writing. So that's where a model like that would be. But instead, other times I might be willing, like if I'm asking it to create a whole application, I'm willing to wait one hour, you know, for the model to get me a response.[00:47:11] But you don't have, you don't have a way to choose that today with most models. They kind of do just one type of work. So I think we're gonna see more and more of these benchmark. Focus on not only on the research side of it, which is what they really are today when you're developing a new model, like does it meet the usual standard research benchmarks to having more of a performance benchmark for production use cases?[00:47:36] And I wonder who's gonna be the first company that comes up with, with something like this, but I think we're seeing more and more of these models go from a research thing to like a production thing. And especially going from companies like. Google and Facebook that have kinda unlimited budget for a lot of these things to startups, starting to integrate them in the products.[00:48:00] And when you're on a tight budget paying, you know, 1 cent per thousand tokens or 0.10 cent for a thousand tokens, like it's really important. So I think that's, um, that's what's missing to get a lot of these things to productions. But hopefully we, we see them.[00:48:16] Yeah, the software development lifecycle I'm thinking about really is that most people will start with large models and then they will prototype with that because that is the most capable ones.[00:48:25] But then as they put more and more of those things in production, people always want them to run faster and faster and faster and cheaper. So you will distill towards a more domain specific model, and every single company that puts this into production, we'll, we'll want something like that, but I, I think it's, it's a reasonable bet because.[00:48:41] There's another branch of the AI builders that I see out there who are build, who are just banking on large models only. Mm-hmm. And seeing how far they can stretch them. Right. With building on AI agents that can take arbitrarily long amounts of time because they're saving you lots of, lots of time with, uh, searching the web for you and doing research for you.[00:48:59] And I think. I'm happy to wait for Bing for like 10 seconds if it does a bunch of searches for median. Mm-hmm. Just ends with, ends with the right, right result. You know, I was, I was tweeting the other day that I wanted an AI enabled browser because I was seeing this table, uh, there was an image and I just needed to screenshot an image and say, plot this on a chart for me.[00:49:17] And I just wanted to do that, but it would have to take so many steps and I would be willing to wait for a large model to do that for me. Mm-hmm. Yeah. I mean, web development so far has been, Reduce, reduce, reduce the loading times. You know, it's like first we had the, I don't know about that. There, there are people who disagree.[00:49:34] Oh. But I, I think, like if you think about, you know, the CDN and you think about deploying things at the edge, like the focus recently has been on lowering the latency time versus increasing it.[00:49:45] Conclusion[00:49:45] Yeah. So, well that's the, that's Benchmark 1 0 1. Um. Let us know how we, how you think we did. This is something we're trying for the first time.[00:49:52] We're very inspired by other podcasts that we like where we do a bunch of upfront prep, but then it becomes a single topical episode that is hopefully a little bit more timeless. We don't have to keep keeping up with the news. I think there's a lot of history that we can go back on and. Deepen our understanding of the context of all these evolutions in, uh, language models.[00:50:12] Yeah. And if you have ideas for the next, you know, 1 0 1 fundamentals episode, yeah, let us know in the, in the comments and we'll see you all soon. Bye. Get full access to Latent Space at www.latent.space/subscribe

american spotify history trust ai english google israel hell training africa chinese italy data fun ny focus microsoft western italian drop iphone language asian starts human color production oxford stanford math identify incredible conclusion wikipedia tracking focused coca cola context reduce tasks cto images individuals fundamentals f1 stands gap includes pepsi aware amc scores api penn princeton university ivy league coding bing gpt mm minus precision github assuming swag smiles incredibles accuracy glue javascript gt allied acquired llm versus nemo bleu deepen benchmark deep learning grover george miller turk cuz enron benchmarks gpus large language models sats benchmarking yoruba captures cdn premise tradeoffs alessio xtreme gopher varun gre cfr nys benches gpts inference latency puffer businessmen similarity super glue wikihow throughput commonplace gpc lambada paraphrase gbt imagenet plausibility eist uid amazon mechanical turk cifar b yeah latent space cfar mnist entailment

Competitive plasticity to reduce the energetic costs of learning

Play Episode Listen Later Apr 4, 2023

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.04.535544v1?rss=1 Authors: van Rossum, M. C. Abstract: The brain is not only constrained by energy needed to fuel computation, but it is also constrained by energy needed to form memories. Experiments have shown that learning simple conditioning tasks already carries a significant metabolic cost. Yet, learning a task like MNIST to 95% accuracy appears to require at least 10^{8} synaptic updates. Therefore the brain has likely evolved to be able to learn using as little energy as possible. We explored the energy required for learning in feedforward neural networks. Based on a parsimonious energy model, we propose two plasticity restricting algorithms that save energy: 1) only modify synapses with large updates, and 2) restrict plasticity to subsets of synapses that form a path through the network. Combining these two methods leads to substantial energy savings while only incurring a small increase in learning time. In biology networks are often much larger than the task requires. In particular in that case, large savings can be achieved. Thus competitively restricting plasticity helps to save metabolic energy associated to synaptic plasticity. The results might lead to a better understanding of biological plasticity and a better match between artificial and biological learning. Moreover, the algorithms might also benefit hardware because in electronics memory storage is energetically costly as well. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

learning llc experiments reduce costs competitive copy energetic rossum plasticity biorxiv mnist

Predictive coding with spiking neurons and feedforward gist signalling

Play Episode Listen Later Apr 3, 2023

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.04.03.535317v1?rss=1 Authors: Lee, K., Dora, S., Mejias, J., Bohte, S., Pennartz, C. Abstract: Predictive coding (PC) is an influential theory in neuroscience, which suggests the existence of a cortical architecture that is constantly generating and updating predictive representations of sensory inputs. Owing to its hierarchical and generative nature, PC has inspired many computational models of perception in the literature. However, the biological plausibility of existing models has not been sufficiently explored due to their use of artificial neural network features such as a non-linear, continuous, and clock-driven function approximator as basic unit of computation. Therefore, we have developed a spiking neural network for predictive coding (SNN-PC), in which neurons communicate using event-driven and asynchronous spikes. While adopting the hierarchical structure and Hebbian learning algorithms from previous PC neural network models, SNN-PC introduces two novel features: 1) a fast feedforward sweep from the input to higher areas, which generates a spatially reduced and abstract representation of input (i.e., a neural code for the gist of a scene) and provides a neurobiological alternative to an arbitrary choice of priors; and 2) a separation of positive and negative error-computing neurons, which counters the biological implausibility of a bi-directional error neuron with a very high basal firing rate. After training with the MNIST handwritten digit dataset, SNN-PC developed hierarchical internal representations and was able to reconstruct samples it had not seen during training. SNN-PC suggests biologically plausible mechanisms by which the brain may perform perceptual inference and learning in an unsupervised manner. In addition, it may be used in neuromorphic applications that can utilize its energy-efficient, event-driven, local learning, and parallel information processing nature. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

llc pc copy coding predictive gist neurons spiking owing signalling feedforward mejias biorxiv mnist

AF - EIS VII: A Challenge for Mechanists by Stephen Casper

Play Episode Listen Later Feb 18, 2023 4:14

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VII: A Challenge for Mechanists, published by Stephen Casper on February 18, 2023 on The AI Alignment Forum. Part 7 of 12 in the Engineer's Interpretability Sequence. Thanks to Neel Nanda. I used some very nicely-written code of his from here. And thanks to both Chris Olah and Neel Nanda for briefly discussing this challenge with me. MI = “mechanistic interpretability” Given a network, recover its labeling function. In the last post, I argued that existing works in MI focus on solving problems that are too easy. Here, I am posing a challenge for mechanists that is still a toy problem but one that is quite a bit less convenient than studying a simple model or circuit implementing a trivial, known task. The the best of my knowledge: Unlike prior work on MI from the AI safety interpretability community, beating this challenge would be the first example of mechanistically explaining a network's solution to a task that was not cherrypicked by the researcher(s) doing so. Gaining a mechanistic understanding of the models in this challenge may be difficult, but it will probably be much less difficult than mechanistically interpreting highly intelligent systems in high stakes settings in the real world. So if an approach can't solve the type of challenge posed here, it may not be very promising for doing much heavy lifting with AI safety work. This post comes with a GitHub repository. Check it out here. The challenge is actually two challenges in one, and the basic idea is similar to some ideas presented in Lindner et al. (2023). Challenge 1, MNIST CNN I made up a nonlinear labeling function that labels approximately half of all MNIST images as 0's and the other half as 1's. Then I trained a small CNN on these labels, and it got 96% testing accuracy. The challenge is to use MI tools on the network to recover that labeling function. Hint 1: The labels are binary. Hint 2: The network gets 95.58% accuracy on the test set. Hint 3: This image may be helpful. Challenge 2, Transformer I made up a labeling function that takes in two integers from 0 to 113 and outputs either a 0 or 1. Then, using a lot of code from Neel Nanda's grokking work, I trained a 1-layer transformer on half of the data. It then got 97% accuracy on the test half. As before, the challenge is to use MI tools to recover the labeling function. Hint 1: The labels are binary. Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half. Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary... Prizes If you are the first person to send me the labeling function and a mechanistic explanation for either challenge, I will sing your praises on my Twitter, and I would be happy to help you write a post about how you solved a problem I thought would be very difficult. Neel Nanda and I are also offering a cash prize. (Thanks to Neel for offering to contribute to the pool!) Neel will donate $250, and I will donate $500 to a high-impact charity of choice for the first person to solve each challenge. That makes the total donation prize pool $1,500. Good luck For this challenge, I intentionally designed the labeling functions to not be overly simple. But I will not be too surprised if someone reverse-engineers them with MI tools, and if so, I will be extremely interested in how. Neither of the models perfectly label the validation set. One may object that this will make the problem unfairly difficult because if there is no convergence on the same behavior as the actual labeling function, then how is one supposed to find that function inside the model? This is kind of the point though. Real models that real engineers have to work with models don't tend to conveniently grok onto a simple, elegant, programmat...

ai real cnn speech engineers hint gaining ea casper github lindner neel rationalist neel nanda mnist

AF - The effect of horizon length on scaling laws by Jacob Hilton

Play Episode Listen Later Feb 1, 2023 0:57

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The effect of horizon length on scaling laws, published by Jacob Hilton on February 1, 2023 on The AI Alignment Forum. The scaling of optimal model size with compute is a key input into the biological anchors framework for forecasting transformative AI. In particular, the "effective horizon length" introduces a multiplier into this scaling law that can have a big effect on forecasts. This paper studies this scaling law for several RL environments: Procgen, Dota 2 and a toy MNIST-based environment. The last of these is used to study the effect of the task horizon length in a toy setting. There are a number of takeaways for the biological anchors framework, which are summarized in Section 5.4. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

ai laws speech scaling horizon ea hilton dota rl rationalist mnist

AF - Paper: Superposition, Memorization, and Double Descent (Anthropic) by Lawrence Chan

speech ea chan descent anthropic memorization superposition rationalist mnist

Play Episode Listen Later Jan 5, 2023 2:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: Superposition, Memorization, and Double Descent (Anthropic), published by Lawrence Chan on January 5, 2023 on The AI Alignment Forum. (This is a follow up to Anthropic's prior work on Toy Models of Superposition.) The authors study how neural networks interpolate between memorization and generalization in the "ReLU Output" toy model from the first toy model paper: They train models to perform a synthetic regression task with T training points, for models with m hidden dimensions. First, they find that for small training sets, while the features are messy, the training set hidden vectors hi=WXi (the projection of the input datapoints into the hidden space) often show clean structures: They then extend their old definition of feature dimensionality to measure the dimensionalities allocated to each of the training examples: DXi = hi2∑j(^hihj)2and plot this against the data set size (and also test loss): This shows that as you increase the amount of data, you go from a regime with high dimensionality allocated to training vectors and low dimensionality allocated to features, to one where the opposite is true. In between the two, both feature and hidden vector dimensionalities receive low dimensionality, which coincides with an increase in test loss, which they compare to the phenomena of "data double descent" (where as you increase data on overparameterized models with small amounts of regularization, test loss can go up before it goes down). Finally, they visualize how varying T and m affects test loss, and find double descent along both dimensions: They also included some (imo very interesting) experiments from Adam Jermyn, 1) replicating the results, 2) exploring how weight decay interacts with this double descent-like phenomenon, and 3) studying what happens if you repeat particular datapoints. Some limitations of the work, based on my first read through: The authors note that the results seem quite sensitive to hyperparameters, especially for low hidden dimension m∈{2,3}. For example, Adam Jermyn's results differ from the Anthropic Interp team results (though the figures still look qualitatively similar). I'm still not super convinced how much results from the superposition work apply in practice. I'd be interested in seeing more work along the lines of the MNIST preliminary experiment done by Chris Olah at the bottom. (I'll probably have more thoughts as I think for longer.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Episodio 183 - Proyecto - Clasificador de imagenes

Ingenios@s de Sistemas

Play Episode Listen Later Dec 28, 2022 12:06

Hoy vamos a describir un clasificador de imagenes Abra un nuevo cuaderno Colab en su cuenta de Google y asegúrese de tener acceso a Internet y a los paquetes de Python necesarios, como TensorFlow y Keras. Obtenga un conjunto de datos de imágenes etiquetadas que desee utilizar para entrenar y evaluar su modelo. Puede utilizar conjuntos de datos públicos o crear su propio conjunto de datos recopilando imágenes y etiquetándolas manualmente. Un conjunto de datos de imágenes etiquetadas es un conjunto de imágenes que han sido previamente clasificadas y etiquetadas con ciertas categorías o etiquetas. Estos conjuntos de datos se utilizan a menudo para entrenar y evaluar modelos de clasificación de imágenes, ya que proporcionan una forma de evaluar la precisión del modelo al clasificar imágenes nuevas. Por ejemplo, un conjunto de datos de imágenes etiquetadas podría contener imágenes de perros y gatos, etiquetadas como "perro" o "gato", respectivamente. Al entrenar un modelo de clasificación de imágenes en este conjunto de datos, podría evaluar la precisión del modelo al clasificar imágenes de perros y gatos. Existen muchos conjuntos de datos de imágenes etiquetadas disponibles en línea, tanto gratuitos como de pago, que pueden utilizarse para entrenar y evaluar modelos de clasificación de imágenes. Algunos ejemplos populares incluyen el conjunto de datos de imágenes de perros y gatos de Kaggle, el conjunto de datos de imágenes de flores de Oxford y el conjunto de datos de imágenes de dígitos escritos a mano de MNIST. Puede acceder al conjunto de datos de imágenes de perros y gatos de Kaggle de la siguiente manera: Vaya a la página de inicio de Kaggle (https://www.kaggle.com/) y regístrese o inicie sesión en su cuenta. En la barra de búsqueda en la parte superior de la página, escriba "dogs vs. cats" (perros vs. gatos) y presione Enter. Se le llevará a la página del conjunto de datos "Dogs vs. Cats" (Perros vs. Gatos). Haga clic en el botón "Download" (Descargar) para descargar el conjunto de datos a su computadora. Descomprima el archivo descargado para acceder a las imágenes y las etiquetas correspondientes. Las etiquetas se encuentran en el archivo "labels.csv" y las imágenes se encuentran en la carpeta "train" (entrenamiento). Tenga en cuenta que el conjunto de datos de imágenes de perros y gatos de Kaggle consta de 25.000 imágenes de perros y 25.000 imágenes de gatos, divididas en conjuntos de entrenamiento y prueba. Puede utilizar el conjunto de entrenamiento para entrenar un modelo de clasificación de imágenes y el conjunto de prueba para evaluar la precisión del modelo. Calcificación de imagenes Canal telegram Déjame un mensaje de voz

google internet dogs oxford puede algunos proyecto python existen vaya haga gatos tenga colab tensorflow keras kaggle imagenes obtenga mnist

LW - Re-Examining LayerNorm by Eric Winsor

speech examining ea contemporary stretching guevara 1d conjecture winsor carlos ram rationalist lesswrong graphically concretely mnist

Play Episode Listen Later Dec 2, 2022 7:58

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Re-Examining LayerNorm, published by Eric Winsor on December 1, 2022 on LessWrong. Please check out the colab notebook for interactive figures and more detailed technical explanations. This post is part of the work done at Conjecture. Special thanks to Sid Black, Dan Braun, Carlos Ramón Guevara, Beren Millidge, Chris Scammell, Lee Sharkey, and Lucas Teixeira for feedback on early drafts. There's a lot of non-linearities floating around in neural networks these days, but one that often gets overlooked is LayerNorm. This is understandable because it's not "supposed" to be doing anything; it was originally introduced to stabilize training. Contemporary attitudes about LayerNorm's computational power range from "it's just normalizing a vector" to "it can do division apparently". And theories of mechanistic interpretability such as features as directions and polytopes are unhelpful, or even harmful, in understanding normalization's impact on a network's representations. After all, normalization doesn't alter the direction of vectors, but it still bends lines and planes (the boundaries of polytopes) out of shape. As it turns out, LayerNorm can be used as a general purpose activation function (you can solve MNIST with a LayerNorm MLP, for example). Concretely, it can do things like this: and this: We will explain what's going on in these animations later, but the point is that to develop a strong, principled theory of mechanistic interpretability, we need to grapple with this non-linearity. In this interactive notebook, we study LayerNorm systematically using math and geometric intuition to characterize the ways in which it can manipulate data. We show that the core non-linearity of LayerNorm can be understood via simple geometric primitives. We explain how these basic primitives may perform semantic operations. For example, folding can be viewed as extracting extremal features (e.g. separating extreme temperatures from normal ones). We leverage these primitives to understand more complex low-dimensional classification tasks with multiple layers of non-linearities. The methods and intuition developed here extend to non-linearities beyond LayerNorm, and we plan to extend them in future work. Below is an interactive summary of the content of the notebook. (If you find the summary interesting, you really should check out the notebook: manipulating the figures helps build geometric intuition.) Summary The formula for LayerNorm is something messy like But it turns out the core non-linear operation is (almost) normalizing a vector: Graphically, this function has the iconic sigmoid shape in one dimension (note that in 1D the norm is simply the absolute value). Interesting things start happening when we precompose this normalization function with affine transformations (such as scaling and shifting). Below, we start with a collection of points, (x,y), distributed uniformly on the sphere. Then we compute uϵ(tx,y) (stretch/shrink by a factor of t in the x direction and then normalize). Since normalization guarantees points end up on the circle, this operation stretches the distribution along the circle. The right hand panel illustrates how this stretching is captured in the shape of a 1D activation function (the x coordinate of the input against the x coordinate of the output). For example, when t is close to 5, we see that any points with x coordinates not near 0 get compressed towards x=−1 or x=1. This matches the picture in the circle where we see the points bunch up at the left and right sides of the circle. If we make the same plots with a shifting operation, uϵ(x+t,y), the operation folds the input data. We can envision possible "semantic" uses for these geometric operations. Stretching can be used to perform an approximate "sign" operation (as in + or - sign). For example, it ...

LW - Re-Examining LayerNorm by Eric Winsor

speech examining ea contemporary stretching guevara 1d conjecture winsor carlos ram rationalist lesswrong graphically concretely mnist

Play Episode Listen Later Dec 2, 2022 7:58

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Re-Examining LayerNorm, published by Eric Winsor on December 1, 2022 on LessWrong. Please check out the colab notebook for interactive figures and more detailed technical explanations. This post is part of the work done at Conjecture. Special thanks to Sid Black, Dan Braun, Carlos Ramón Guevara, Beren Millidge, Chris Scammell, Lee Sharkey, and Lucas Teixeira for feedback on early drafts. There's a lot of non-linearities floating around in neural networks these days, but one that often gets overlooked is LayerNorm. This is understandable because it's not "supposed" to be doing anything; it was originally introduced to stabilize training. Contemporary attitudes about LayerNorm's computational power range from "it's just normalizing a vector" to "it can do division apparently". And theories of mechanistic interpretability such as features as directions and polytopes are unhelpful, or even harmful, in understanding normalization's impact on a network's representations. After all, normalization doesn't alter the direction of vectors, but it still bends lines and planes (the boundaries of polytopes) out of shape. As it turns out, LayerNorm can be used as a general purpose activation function (you can solve MNIST with a LayerNorm MLP, for example). Concretely, it can do things like this: and this: We will explain what's going on in these animations later, but the point is that to develop a strong, principled theory of mechanistic interpretability, we need to grapple with this non-linearity. In this interactive notebook, we study LayerNorm systematically using math and geometric intuition to characterize the ways in which it can manipulate data. We show that the core non-linearity of LayerNorm can be understood via simple geometric primitives. We explain how these basic primitives may perform semantic operations. For example, folding can be viewed as extracting extremal features (e.g. separating extreme temperatures from normal ones). We leverage these primitives to understand more complex low-dimensional classification tasks with multiple layers of non-linearities. The methods and intuition developed here extend to non-linearities beyond LayerNorm, and we plan to extend them in future work. Below is an interactive summary of the content of the notebook. (If you find the summary interesting, you really should check out the notebook: manipulating the figures helps build geometric intuition.) Summary The formula for LayerNorm is something messy like But it turns out the core non-linear operation is (almost) normalizing a vector: Graphically, this function has the iconic sigmoid shape in one dimension (note that in 1D the norm is simply the absolute value). Interesting things start happening when we precompose this normalization function with affine transformations (such as scaling and shifting). Below, we start with a collection of points, (x,y), distributed uniformly on the sphere. Then we compute uϵ(tx,y) (stretch/shrink by a factor of t in the x direction and then normalize). Since normalization guarantees points end up on the circle, this operation stretches the distribution along the circle. The right hand panel illustrates how this stretching is captured in the shape of a 1D activation function (the x coordinate of the input against the x coordinate of the output). For example, when t is close to 5, we see that any points with x coordinates not near 0 get compressed towards x=−1 or x=1. This matches the picture in the circle where we see the points bunch up at the left and right sides of the circle. If we make the same plots with a shifting operation, uϵ(x+t,y), the operation folds the input data. We can envision possible "semantic" uses for these geometric operations. Stretching can be used to perform an approximate "sign" operation (as in + or - sign). For example, it ...

Biologically plausible unsupervised learning in neural networks with sparse and asymmetric connectivity

turning llc copy connectivity neural networks asymmetric plausible biologically sparse akerman boltzmann biorxiv unsupervised learning rbm mnist

Play Episode Listen Later Dec 1, 2022

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.11.30.518534v1?rss=1 Authors: Brodersen, P. J. N., Akerman, C. J. Abstract: In the search for biologically plausible but mathematically precise theories of learning in the brain, recent studies have begun to investigate how key assumptions underlying algorithms for supervised learning in artificial neural networks can be relaxed in biologically plausible ways. Turning to unsupervised learning, we develop biologically more plausible variants of the restricted Boltzmann machine (RBM), and benchmark their performance on MNIST. We show that RBMs with asymmetric connectivity can still be successfully trained with contrastive divergence, even if no two units are reciprocally connected. Furthermore, RBMs are able to learn if the forward, visible-to-hidden layer weights are kept constant and only the backward, hidden-to-visible layer weights are updated. These findings indicate that neural networks with biologically plausible connectivity support contrastive learning. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

Training stochastic stabilized supralinear networks by dynamics-neutral growth

Play Episode Listen Later Oct 21, 2022

Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2022.10.19.512820v1?rss=1 Authors: Soo, W. W. M., Lengyel, M. Abstract: There continues to be a trade-off between the biological realism and performance of neural networks. Contemporary deep learning techniques allow neural networks to be trained to perform challenging computations at (near) human-level, but these networks typically violate key biological constraints. More detailed models of biological neural networks can incorporate many of these constraints but typically suffer from subpar performance and trainability. Here, we narrow this gap by developing an effective method for training a canonical model of cortical neural circuits, the stabilized supralinear network (SSN), that in previous work had to be constructed manually or trained with undue constraints. SSNs are particularly challenging to train for the same reasons that make them biologically realistic: they are characterized by strongly-connected excitatory cells and expansive firing rate non-linearities that together make them prone to dynamical instabilities unless stabilized by appropriately tuned recurrent inhibition. Our method avoids such instabilities by initializing a small network and gradually increasing network size via the dynamics-neutral addition of neurons during training. We first show how SSNs can be trained to perform typical machine learning tasks by training an SSN on MNIST classification. We then demonstrate the effectiveness of our method by training an SSN on the challenging task of performing amortized Markov chain Monte Carlo-based inference under a Gaussian scale mixture generative model of natural image patches with a rich and diverse set of basis functions - something that was not possible with previous methods. These results open the way to training realistic cortical-like neural networks on challenging tasks at scale. Copy rights belong to original authors. Visit the link for more info Podcast created by Paper Player, LLC

growth training llc copy dynamics networks contemporary neutral monte carlo markov ssn stochastic stabilized gaussian lengyel biorxiv ssns mnist abstract there

LW - AI Timelines via Cumulative Optimization Power: Less Long, More Short by jacob cannell

Play Episode Listen Later Oct 6, 2022 30:44

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Timelines via Cumulative Optimization Power: Less Long, More Short, published by jacob cannell on October 6, 2022 on LessWrong. TLDR: We can best predict the future by using simple models which best postdict the past (ala Bayes/Solomonoff). A simple model based on net training compute postdicts the relative performance of successful biological and artificial neural networks. Extrapolation of this model into the future leads to short AI timelines: ~75% chance of AGI by 2032. Cumulative Optimization Power[1]: a Simple Model of Intelligence A simple generalized scaling model predicts the emergence of capabilities in trained ANNs(Artificial Neural Nets) and BNNs(Biological Neural Nets): perf ~= P = CT For sufficiently flexible and efficient NN architectures and learning algorithms, the relative intelligence and capabilities of the best systems are simply proportional to net training compute or intra-lifetime cumulative optimization power P, where P = CT (compute ops/cycle training cycles), assuming efficient allocation of (equivalent uncompressed) model capacity bits N roughly proportional to data size bits D. Intelligence Rankings Imagine ordering some large list of successful BNNs(brains or brain modules) by intelligence (using some committee of experts), and from that deriving a relative intelligence score for each BNN. Obviously such a scoring will be noisy in its least significant bits: is a bottlenose dolphin more intelligent than an american crow? But the most significant bits are fairly clear: C. Elegans is less intelligent than Homo Sapiens. Now imagine performing the same tedious ranking process for various successful ANNs. Here the task is more challenging because ANNs tend to be far more specialized, but the general ordering is still clear: char-RNN is less intelligent than GPT-3. We could then naturally combine the two lists, and make more fine-grained comparisons by including specialized sub-modules of BNNs (vision, linguistic processing, etc). The initial theory is that P - intra-lifetime cumulative optimization power (net training compute) - is a very simple model which explains a large amount of the entropy/variance in a rank order intelligence measure: much more so than any other simple proposed candidates (at least that I'm aware of). Since P follow a predictable temporal trajectory due to Moore's Law style technological progress, we can then extrapolate the trends to predict the arrival of AGI. This simple initial theory has a few potential flaws/objections, which we will then address. Initial Exemplars I've semi-randomly chosen 15 exemplars for more detailed analysis: 8 BNNs, and 9 ANNs. Here are the 8 BNNs (6 whole brains and 2 sub-systems) in randomized order: Honey Bee Human Raven Human Linguistic Cortex Cat C. Elegans Lizard Owl Monkey Visual Cortex The ranking of the 6 full brains in intelligence is rather obvious and likely uncontroversial. Ranking all 8 BNNs in terms of P (net training compute) is still fairly obvious. Here are the 9 ANNs, also initially in randomized order: AlphaGo: First ANN to achieve human pro-level play in Go Deepspeech 2: ANN speech transcription system VPT: Diamond-level minecraft play Alexnet: Early CNN imagenet milestone, subhuman performance 6-L MNIST MLP: Early CNN milestone on MNIST, human level Chinchilla: A 'Foundation' Large Language Model GPT-3: A 'Foundation' Large Language Model DQN Atari: First strong ANN for Atari, human level on some games VIT L/14@336px: OpenAI CLIP 'Foundation' Large Vision Model Most of these systems are specialists in non-overlapping domains, such that direct performance comparison is mostly meaningless, but the ranking of the 3 vision systems should be rather obvious based on the descriptions. The DQN Atari and VPT agents are somewhat comparable to animal brains. How would you ran...

ai law speech ranking ea gpt optimization atari timelines agi homo sapiens nn cumulative bnn ann s rationalist cannell extrapolation elegans rnn lesswrong vpt mnist tldr we

LW - AI Timelines via Cumulative Optimization Power: Less Long, More Short by jacob cannell

Play Episode Listen Later Oct 6, 2022 30:44

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Timelines via Cumulative Optimization Power: Less Long, More Short, published by jacob cannell on October 6, 2022 on LessWrong. TLDR: We can best predict the future by using simple models which best postdict the past (ala Bayes/Solomonoff). A simple model based on net training compute postdicts the relative performance of successful biological and artificial neural networks. Extrapolation of this model into the future leads to short AI timelines: ~75% chance of AGI by 2032. Cumulative Optimization Power[1]: a Simple Model of Intelligence A simple generalized scaling model predicts the emergence of capabilities in trained ANNs(Artificial Neural Nets) and BNNs(Biological Neural Nets): perf ~= P = CT For sufficiently flexible and efficient NN architectures and learning algorithms, the relative intelligence and capabilities of the best systems are simply proportional to net training compute or intra-lifetime cumulative optimization power P, where P = CT (compute ops/cycle training cycles), assuming efficient allocation of (equivalent uncompressed) model capacity bits N roughly proportional to data size bits D. Intelligence Rankings Imagine ordering some large list of successful BNNs(brains or brain modules) by intelligence (using some committee of experts), and from that deriving a relative intelligence score for each BNN. Obviously such a scoring will be noisy in its least significant bits: is a bottlenose dolphin more intelligent than an american crow? But the most significant bits are fairly clear: C. Elegans is less intelligent than Homo Sapiens. Now imagine performing the same tedious ranking process for various successful ANNs. Here the task is more challenging because ANNs tend to be far more specialized, but the general ordering is still clear: char-RNN is less intelligent than GPT-3. We could then naturally combine the two lists, and make more fine-grained comparisons by including specialized sub-modules of BNNs (vision, linguistic processing, etc). The initial theory is that P - intra-lifetime cumulative optimization power (net training compute) - is a very simple model which explains a large amount of the entropy/variance in a rank order intelligence measure: much more so than any other simple proposed candidates (at least that I'm aware of). Since P follow a predictable temporal trajectory due to Moore's Law style technological progress, we can then extrapolate the trends to predict the arrival of AGI. This simple initial theory has a few potential flaws/objections, which we will then address. Initial Exemplars I've semi-randomly chosen 15 exemplars for more detailed analysis: 8 BNNs, and 9 ANNs. Here are the 8 BNNs (6 whole brains and 2 sub-systems) in randomized order: Honey Bee Human Raven Human Linguistic Cortex Cat C. Elegans Lizard Owl Monkey Visual Cortex The ranking of the 6 full brains in intelligence is rather obvious and likely uncontroversial. Ranking all 8 BNNs in terms of P (net training compute) is still fairly obvious. Here are the 9 ANNs, also initially in randomized order: AlphaGo: First ANN to achieve human pro-level play in Go Deepspeech 2: ANN speech transcription system VPT: Diamond-level minecraft play Alexnet: Early CNN imagenet milestone, subhuman performance 6-L MNIST MLP: Early CNN milestone on MNIST, human level Chinchilla: A 'Foundation' Large Language Model GPT-3: A 'Foundation' Large Language Model DQN Atari: First strong ANN for Atari, human level on some games VIT L/14@336px: OpenAI CLIP 'Foundation' Large Vision Model Most of these systems are specialists in non-overlapping domains, such that direct performance comparison is mostly meaningless, but the ranking of the 3 vision systems should be rather obvious based on the descriptions. The DQN Atari and VPT agents are somewhat comparable to animal brains. How would you ran...

ai law speech ranking ea gpt optimization atari timelines agi homo sapiens nn cumulative bnn ann s rationalist cannell extrapolation elegans rnn lesswrong vpt mnist tldr we

AF - Four usages of "loss" in AI by Alex Turner

Play Episode Listen Later Oct 2, 2022 9:00

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Four usages of "loss" in AI, published by Alex Turner on October 2, 2022 on The AI Alignment Forum. Summary: What does it mean for a loss function to be "aligned with" human goals? I perceive four different concepts which involve "loss function" in importantly different ways: Physical-loss: The physical implementation of a loss function and the loss computations, Mathematical-loss: The mathematical idealization of a loss function, A loss function "encoding/representing/aligning with" an intended goal, and Agents which "care about achieving low loss." I advocate retaining physical- and mathematical-loss. I advocate dropping 3 in favor of talking directly about desired AI cognition and how the loss function entrains that cognition. I advocate disambiguating 4, because it can refer to a range of physically grounded preferences about loss (e.g. low value at the loss register versus making perfect future predictions). Related: Towards deconfusing wireheading and reward maximization. I'm going to talk about "loss" instead of "reward", but the lessons apply to both. I think it's important to maintain a sharp distinction between the following four concepts. 1: Physically implemented loss The loss function updated my network. This is a statement about computations embedded in physical reality. This statement involves the physically implemented sequence of loss computations which stream in throughout training. For example, the computations engendered by loss_fn = torch.nn.CrossEntropyLoss(). 2: Mathematical loss The loss function is a smooth function of the prediction distribution. This is a statement about the idealized mathematical loss function. These are the mathematical objects you can prove learning theory results about. The Platonic idealization of the learning problem and the mathematical output-grading rule casts a shadow into your computer via its real-world implementation (concept 1). For example, (D,ℓ) where D:={(x,label(x))∣x∈MNIST} is the mathematical idealization of the MNIST dataset, where the x∈R28×28 are the idealized grayscale MNIST images. And ℓ is the mathematical function of cross-entropy (CE) loss between a label prediction distribution and the ground-truth labels. 3: Loss functions "representing" goals I want a loss function which is aligned with the goal of "write good novels." This is an aspirational statement about achieving some kind of correspondence between the loss function and the goal of writing good novels. But what does this statement mean? Suppose you tell me "I have written down a loss function ℓnovel which is perfectly aligned with the goal of 'write good novels'." What experiences should this claim lead me to anticipate? That an agent can only achieve low physical-loss (concept 1) if it has, in physical fact, written a good novel? That in some mathematical idealization of the learning problem (concept 2), loss-minimization only occurs when the agent outputs text which would be found in what we rightly consider to be "good novels"? (But in which mathematical idealization?) That, as a matter of physical fact, if you train an agent on ℓnovel using learning setup X, then you produce a language model which can be easily prompted to output high-quality novels? The imprecision comes from loss functions not directly encoding goals. Loss signals are physically implemented (concept 1) parts of the AI's training process which (physically) update the AI's cognition in certain ways. While a loss function can be involved in the AI's decision-making structure (see items i and ii above), additional information is needed to understand what motivational cognition is being discussed. I think that talking about loss functions being "aligned" encourages bad habits of thought at best, and is nonsensical at worst. I think it makes way more sense to say how you ...

ai loss speech ea physically platonic mathematical usages alex turner rationalist mnist

LW - Deep Q-Networks Explained by Jay Bailey

Play Episode Listen Later Sep 14, 2022 36:48

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Q-Networks Explained, published by Jay Bailey on September 13, 2022 on LessWrong. Note: This is a long post. The post is structured in such a way that not everyone needs to read everything - Sections 1 and 2 are skippable background information, and Sections 4 and 5 go into technical detail that not everybody wants or needs to know. Section 3 on its own is sufficient to gain a high-level understanding of DQN if you already know how reinforcement learning works.This post was made as my final project for the AGI Safety Fundamentals course.This post aims to provide a distillation of Deep Q-Networks, neural networks trained via Deep Q-Learning. The algorithm, usually just referred to as DQN, is the algorithm that first put deep reinforcement learning on the map. The 2013 paper is also the first paper in the Key Papers section of the excellent Spinning Up In Deep RL resource. As a result, anyone who wants to understand reinforcement learning is likely to start here. Since reading and replicating papers can be difficult, this resource aims to make the learning curve a little easier for those who want to improve their ML knowledge past an initial deep learning course. This post will be focused mainly on the 2015 paper, which adds some extra improvements to the 2013 paper, as well as giving more information that's helpful for replication. This is a long post, but it's broken up into sections so that you can only read the sections you're interested in. It starts with background information, then goes over DQN at three levels of detail, each of which builds off the previous one. You should probably have a basic understanding of supervised learning in order to get the most out of this post. You should understand what a loss function is, what gradient descent is conceptually, and be familiar with the concept of a neural network. Supervised Learning With Neural Networks (Section 1) - Background information on deep learning. This can be safely skipped if you already understand how supervised learning problems like image classifiers work. Reinforcement Learning (Section 2) - Background information on RL. This can be safely skipped if you're already familiar with the basics of reinforcement learning, like states, actions, and why reinforcement learning is harder than supervised learning. DQN Overview (Section 3) - This section presents DQN at a high level, without the math. I recommend everyone read this section. You can stop there, or continue on to Section 4 if you want to understand the math behind it. DQN Technical Explanation (Section 4) - This section builds off of Section 3, and includes the algorithms and equations that make DQN's work. You can skip this section if you just want a high-level overview of what DQN is - the what, but not the how. Replicating DQN (Section 5) - This section builds off of Section 4, and includes tips and tricks to make DQN work in code. I recommend only reading this section if you're interested in replicating this or similar papers yourself. Further Improvements (Section 6) - This section talks about improvements that have been made in the DQN line of research since the original DQN papers were released. If this topic interests you, you can skip Sections 4 and 5 and still understand this section. Regardless of how far you get, feedback is certainly welcome - part of the purpose of this post is to test my fit for doing more distilling work, so if you found this helpful (or unhelpful) I'd love to hear about it! Supervised Learning With Neural Networks (Section 1) To understand supervised learning, let's start with a toy problem that's often used to compare algorithms and make sure they work before applying them to more challenging domains. MNIST is a dataset composed of tens of thousands of hand-written digits from 0 to 9. Each piece of data co...

deep speech ea networks ml sections rl rationalist lesswrong mnist

LW - Deep Q-Networks Explained by Jay Bailey

Play Episode Listen Later Sep 14, 2022 36:48

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep Q-Networks Explained, published by Jay Bailey on September 13, 2022 on LessWrong. Note: This is a long post. The post is structured in such a way that not everyone needs to read everything - Sections 1 and 2 are skippable background information, and Sections 4 and 5 go into technical detail that not everybody wants or needs to know. Section 3 on its own is sufficient to gain a high-level understanding of DQN if you already know how reinforcement learning works.This post was made as my final project for the AGI Safety Fundamentals course.This post aims to provide a distillation of Deep Q-Networks, neural networks trained via Deep Q-Learning. The algorithm, usually just referred to as DQN, is the algorithm that first put deep reinforcement learning on the map. The 2013 paper is also the first paper in the Key Papers section of the excellent Spinning Up In Deep RL resource. As a result, anyone who wants to understand reinforcement learning is likely to start here. Since reading and replicating papers can be difficult, this resource aims to make the learning curve a little easier for those who want to improve their ML knowledge past an initial deep learning course. This post will be focused mainly on the 2015 paper, which adds some extra improvements to the 2013 paper, as well as giving more information that's helpful for replication. This is a long post, but it's broken up into sections so that you can only read the sections you're interested in. It starts with background information, then goes over DQN at three levels of detail, each of which builds off the previous one. You should probably have a basic understanding of supervised learning in order to get the most out of this post. You should understand what a loss function is, what gradient descent is conceptually, and be familiar with the concept of a neural network. Supervised Learning With Neural Networks (Section 1) - Background information on deep learning. This can be safely skipped if you already understand how supervised learning problems like image classifiers work. Reinforcement Learning (Section 2) - Background information on RL. This can be safely skipped if you're already familiar with the basics of reinforcement learning, like states, actions, and why reinforcement learning is harder than supervised learning. DQN Overview (Section 3) - This section presents DQN at a high level, without the math. I recommend everyone read this section. You can stop there, or continue on to Section 4 if you want to understand the math behind it. DQN Technical Explanation (Section 4) - This section builds off of Section 3, and includes the algorithms and equations that make DQN's work. You can skip this section if you just want a high-level overview of what DQN is - the what, but not the how. Replicating DQN (Section 5) - This section builds off of Section 4, and includes tips and tricks to make DQN work in code. I recommend only reading this section if you're interested in replicating this or similar papers yourself. Further Improvements (Section 6) - This section talks about improvements that have been made in the DQN line of research since the original DQN papers were released. If this topic interests you, you can skip Sections 4 and 5 and still understand this section. Regardless of how far you get, feedback is certainly welcome - part of the purpose of this post is to test my fit for doing more distilling work, so if you found this helpful (or unhelpful) I'd love to hear about it! Supervised Learning With Neural Networks (Section 1) To understand supervised learning, let's start with a toy problem that's often used to compare algorithms and make sure they work before applying them to more challenging domains. MNIST is a dataset composed of tens of thousands of hand-written digits from 0 to 9. Each piece of data co...

deep speech ea networks ml sections rl rationalist lesswrong mnist

LW - Taking the parameters which seem to matter and rotating them until they don't by Garrett Baker

speech large ea parameters rotating rationalist lesswrong mnist

Play Episode Listen Later Aug 26, 2022 2:31

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Taking the parameters which seem to matter and rotating them until they don't, published by Garrett Baker on August 26, 2022 on LessWrong. A big bottleneck in interpretability is neural networks are non-local. That is, given the layer setup if we change a small bit of the original activations, then a large bit of the new activations are affected. This is an impediment to finding the circuit-structure of networks. It is difficult to figure out how something works when changing one thing affects everything. The project I'm currently working on aims to fix this issue, without affecting the training dynamics of networks or the function which the network is implementing. The idea is to find a rotation matrix R and insert it with its inverse like below, then group together the rotation with the original activations, and the inverse with the weights and nonlinear function. We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations. This locality is measured by the average sparsity of the jacobian across all the training inputs. We do this because the jacobian is a representation of how each of the inputs affects each of the outputs. Large entries represent large effects. Small entries represent small effects. So if many entries are zero, this means that fewer inputs have an effect on fewer outputs. I.e. local changes to the input cause local changes to the output. This should find us a representation of the activations and interpretations of matrix multiplies that "make sense" in the context of the rest of the network. Another way of thinking about this is that our goal is to find the basis our network is thinking in. Currently I'm getting this method to work on a simple, 3-layer, fully connected MNIST number classifying network. If this seems to give insight into the mechanics of the network after application, the plan is to adapt it to a more complicated network such as a transformer or resnet. I only have preliminary results right now, but they are looking promising: This is the normalized jacobian the middle layer before a rough version of my method: And here is the normalized jacobian after a rough version of my method (the jacobian's output has been set to a basis which maximizes it's sparsity): Thanks David Udell for feedback on the post. I did not listen to everything you said, and if I did the post would have been better This seems important if we'd like to use interpretability work to produce useful conjectures about agency and selection more generally. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

LW - Taking the parameters which seem to matter and rotating them until they don't by Garrett Baker

speech large ea parameters rotating rationalist lesswrong mnist

Play Episode Listen Later Aug 26, 2022 2:31

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Taking the parameters which seem to matter and rotating them until they don't, published by Garrett Baker on August 26, 2022 on LessWrong. A big bottleneck in interpretability is neural networks are non-local. That is, given the layer setup if we change a small bit of the original activations, then a large bit of the new activations are affected. This is an impediment to finding the circuit-structure of networks. It is difficult to figure out how something works when changing one thing affects everything. The project I'm currently working on aims to fix this issue, without affecting the training dynamics of networks or the function which the network is implementing. The idea is to find a rotation matrix R and insert it with its inverse like below, then group together the rotation with the original activations, and the inverse with the weights and nonlinear function. We then can optimize the rotation matrix and its inverse so that local changes in the rotated activation matrix have local effects on the outputted activations. This locality is measured by the average sparsity of the jacobian across all the training inputs. We do this because the jacobian is a representation of how each of the inputs affects each of the outputs. Large entries represent large effects. Small entries represent small effects. So if many entries are zero, this means that fewer inputs have an effect on fewer outputs. I.e. local changes to the input cause local changes to the output. This should find us a representation of the activations and interpretations of matrix multiplies that "make sense" in the context of the rest of the network. Another way of thinking about this is that our goal is to find the basis our network is thinking in. Currently I'm getting this method to work on a simple, 3-layer, fully connected MNIST number classifying network. If this seems to give insight into the mechanics of the network after application, the plan is to adapt it to a more complicated network such as a transformer or resnet. I only have preliminary results right now, but they are looking promising: This is the normalized jacobian the middle layer before a rough version of my method: And here is the normalized jacobian after a rough version of my method (the jacobian's output has been set to a basis which maximizes it's sparsity): Thanks David Udell for feedback on the post. I did not listen to everything you said, and if I did the post would have been better This seems important if we'd like to use interpretability work to produce useful conjectures about agency and selection more generally. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

AF - A Data limited future by Donald Hobson

Play Episode Listen Later Aug 6, 2022 3:17

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Data limited future, published by Donald Hobson on August 6, 2022 on The AI Alignment Forum. This is a prediction of what I think the near future might be like. Suppose the trends in scaling laws roughly continue. Deep learning can do anything if it has enough data. But getting that data is hard. So large language models get better, but not that much better. They are using most of the available high quality text, and it is hard to get more text. Image generation, can do. Self driving cars. Took lots of simulations, lots of gathering data and even building a filmset town in a desert and letting 1000 self driving cars crash all over it for a few months. But we got there. Short videos, can do, 30 second clickbait videos are now generated by something like DALLE-2. Social media profiles. Done. Deep learning can learn to do anything, given enough data. It can't learn to make AI breakthroughs. There just aren't enough examples of humans coming up with breakthroughs. You can train a model to set network parameters, but the best parameters are fairly well understood. And you need to run it on small examples, so you get slightly better parameters for Mnist classifiers. The more choice you give the network, the more likely you are to get something that doesn't function at all. If you just create an RL agent producing code, it won't produce anything that compiles. (or at least anything nontrivial that compiles). If you pretrain on code, your model will output code similar to existing code. So it will output existing algorithms, with the minor adjustments any competent programmer could easily do. Often neural networks. Sometimes k-means or linear regression. If you prompt a large language model for a superintelligent output, you usually get a result like this. This isn't a coincidence, the pattern [prompt asking for superintelligence][Answer degenerating into nonsense] appears many times in the training data. (For example it appears here. And if no one give me a reason not to, it can appear in many other places as well). So in this world, we have AI that is limited to things there is large amounts of data for. If many humans do something every day, and that data is reasonably collectable, an AI can be trained to do it. If the AI can play blindly for a million rounds before it figures something out, then it can do it. Any computer games. Short term picking things up with robot hands etc. If you have lots of robots, and a wholesale supply of eggs, and a team of people to clean up the first attempts, you can train an AI to make an omelette. (Especially if you have many human examples to start with). Some startups are doing this. Perhaps formal theorem provers with deep learning selection of the next line of proof become a thing. The world has reached a new equilibrium. The thing humans still hold over the machines is data efficiency. And there it will stay until someone either manages to invent a much more data efficient algorithm, an abstract theory step that will hopefully take a while, or someone stumbles onto a much more data efficient algorithm trying to make some routine AI, or someone manages to cajole code for a superintelligence out of a language model. This is an equilibrium that could plausibly remain for more than a decade. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

ai social deep data speech limited ea dalle hobson rl rationalist mnist

AF - Ten experiments in modularity, which we'd like you to run! by TheMcDouglas

Play Episode Listen Later Jun 16, 2022 15:27

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ten experiments in modularity, which we'd like you to run!, published by TheMcDouglas on June 16, 2022 on The AI Alignment Forum. This is the third post describing our team's work on selection theorems for modularity, as part of a project mentored by John Wentworth (see here for the earlier posts). Although the theoretical and empirical parts of the project have both been going very well, we're currently bottlenecked on the empirical side: we have several theories and ideas for how to test them, but few experimental results. Right now, we only have one empiricist coding up experiments, so this overhang seems likely to persist. The purpose of this post is to outline some of our ideas for experiments. We hope that this will provide concrete steps for people who are interested in engaging with empirical research in AI safety, or on selection theorems in particular, to contribute to this area. Experiments 1. Investigate (randomly) modulary varying goals in modern deep learning architectures. In our previous post, we discussed the idea of modularly varying goals in more detail. We conjectured that rapidly switching between a small set of different goals might fine-tune the network to deal with this specific switch, rather than providing selection pressure for modularity. One possible solution to this is RMVG (randomly-sampled modularly varying goals), where we switch between a large set of random goals, all still with the same subtasks, but have a large enough set that we never show the network the same task twice. For example, one could build a CNN to recognise two MNIST digits M1,M2, then perform some algebraic operation on the recognised digits (e.g. addition), and measure the modularity of the zero-training-loss solutions found by ADAM (e.g. using the graph theory measure of modularity with the matrix norm of the CNN kernels as weights - see these CHAI papers for more ideas and discussion). To introduce RMVG, you might switch the task during training from calculating M1+M2 to a×M1+b×M2, where a,b are real numbers, varied periodically after some fixed number of epochs. The network wouldn't get a and b as direct inputs; it could only access them indirectly via their effect on the loss function. This task is modular in the sense that it can be factored into the separate tasks of “recognising two separate digits” and “performing joint arithmetical operations on them”, and we might hope that varying the goal in this way causes the network to learn a corresponding modular solution. Questions you might ask: Does this way of varying goals lead to more modular solutions than the fixed goal of a×M1+b×M2? Can you think of any other modular goals which you could run a similar test for? Are there some types of modular goals which produce modularity more reliably? Does switching training methods (e.g. ADAM to SGD or GD) change anything about the average modularity score of solutions? 2) Measure the broadness of modular and non-modular optima in deep learning architectures. We have some theories that predict modular solutions for tasks to be on average broader in the loss function landscape than non-modular solutions. One could test this by making a CNN and getting it to produce some zero training loss solutions to a task, some of which were more modular than others (see experiment (1) for more detail on this, and (4) for another proposed method to produce modularity), then vary the parameters of these optima slightly to see how quickly the loss increases. We've recently done experiments with simple networks and the retina problem, and this intuition seems to hold up. But more replications here would be very valuable, to examine the extent to which the hypothesis is borne out in different situations / task / network sizes. Questions you might ask: Is there a relationship between bro...

ai cnn speech experiments measure ea investigate chai m1 m2 gd modularity rationalist sgd m1 m2 mnist

LW - Humans are very reliable agents by alyssavance

Play Episode Listen Later Jun 16, 2022 5:33

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Humans are very reliable agents, published by alyssavance on June 16, 2022 on LessWrong. Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results? Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. The best current image models have ~99.9% reliability on MNIST, an easy handwriting recognition task. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models. By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.) (The numbers here are only rough Fermi estimates. I'm sure one could nitpick them by going into pre-pandemic vs. post-pandemic crash rates, laws in the US vs. other countries, what percentage of crashes are drunk drivers, do drunk drivers count, how often would a really bad decision be fatal, etc. But I'm confident that whichever way you do the math, you'll still find that humans are many orders of magnitude more reliable.) Other types of accidents are similarly rare. Eg. pre-pandemic, there were around 40 million commercial flights per year, but only a handful of fatal crashes. If each flight involves 100 chances for the pilot to crash the plane by screwing up, then that would get you a reliability rate around 1:1,000,000,000, or ~99.99999999%. Even obviously dangerous activities can have very low critical failure rates. For example, shooting is a popular hobby in the US; the US market buys around 10 billion rounds of ammunition per year. There are around 500 accidental gun deaths per year, so shooting a gun has a reliability rate against accidental death of ~1:20,000,000, or 99.999995%. In a military context, the accidental death rate was around ten per year against ~1 billion rounds fired, for a reliability rate of ~99.9999999%. Deaths by fire are very rare compared to how often humans use candles, stoves, and so on; New York subway deaths are rare compared to several billion annual rides; out of hundreds of millions of hikers, only a tiny percentage fall off of cliffs; and so forth. The 2016 AI Impacts survey asked hundreds of AI researchers when they thought AI would be capable of doing certain tasks, playing poker, proving theorems and so on. Some tasks have been solved or have a solution "in sight", but right now, we're nowhere close to an AI that can replace human surgeons; robot-assisted surgeries still have manual control by human operators. Cosmetic surgeries on healthy patients have a fatality rate around 1:300,000, even before excluding unpredictable problems like bl...

new york ai google humans speech deaths ea gpt ml reliable matching cosmetic fermi rationalist ai impacts lesswrong mnist redwood research

LW - Humans are very reliable agents by alyssavance

Open Tech Talks : Technology worth Talking| Blogging |Lifestyle

Play Episode Listen Later Jun 16, 2022 5:33

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Humans are very reliable agents, published by alyssavance on June 16, 2022 on LessWrong. Over the last few years, deep-learning-based AI has progressed extremely rapidly in fields like natural language processing and image generation. However, self-driving cars seem stuck in perpetual beta mode, and aggressive predictions there have repeatedly been disappointing. Google's self-driving project started four years before AlexNet kicked off the deep learning revolution, and it still isn't deployed at large scale, thirteen years later. Why are these fields getting such different results? Right now, I think the biggest answer is that ML benchmarks judge models by average-case performance, while self-driving cars (and many other applications) require matching human worst-case performance. The best current image models have ~99.9% reliability on MNIST, an easy handwriting recognition task. Redwood Research, which is exploring worst-case performance in the context of AI alignment, got reliability rates around 99.997% for their text generation models. By comparison, human drivers are ridiculously reliable. The US has around one traffic fatality per 100 million miles driven; if a human driver makes 100 decisions per mile, that gets you a worst-case reliability of ~1:10,000,000,000 or ~99.999999999%. That's around five orders of magnitude better than a very good deep learning model, and you get that even in an open environment, where data isn't pre-filtered and there are sometimes random mechanical failures. Matching that bar is hard! I'm sure future AI will get there, but each additional "nine" of reliability is typically another unit of engineering effort. (Note that current self-driving systems use a mix of different models embedded in a larger framework, not one model trained end-to-end like GPT-3.) (The numbers here are only rough Fermi estimates. I'm sure one could nitpick them by going into pre-pandemic vs. post-pandemic crash rates, laws in the US vs. other countries, what percentage of crashes are drunk drivers, do drunk drivers count, how often would a really bad decision be fatal, etc. But I'm confident that whichever way you do the math, you'll still find that humans are many orders of magnitude more reliable.) Other types of accidents are similarly rare. Eg. pre-pandemic, there were around 40 million commercial flights per year, but only a handful of fatal crashes. If each flight involves 100 chances for the pilot to crash the plane by screwing up, then that would get you a reliability rate around 1:1,000,000,000, or ~99.99999999%. Even obviously dangerous activities can have very low critical failure rates. For example, shooting is a popular hobby in the US; the US market buys around 10 billion rounds of ammunition per year. There are around 500 accidental gun deaths per year, so shooting a gun has a reliability rate against accidental death of ~1:20,000,000, or 99.999995%. In a military context, the accidental death rate was around ten per year against ~1 billion rounds fired, for a reliability rate of ~99.9999999%. Deaths by fire are very rare compared to how often humans use candles, stoves, and so on; New York subway deaths are rare compared to several billion annual rides; out of hundreds of millions of hikers, only a tiny percentage fall off of cliffs; and so forth. The 2016 AI Impacts survey asked hundreds of AI researchers when they thought AI would be capable of doing certain tasks, playing poker, proving theorems and so on. Some tasks have been solved or have a solution "in sight", but right now, we're nowhere close to an AI that can replace human surgeons; robot-assisted surgeries still have manual control by human operators. Cosmetic surgeries on healthy patients have a fatality rate around 1:300,000, even before excluding unpredictable problems like bl...

new york ai google humans speech deaths ea gpt ml reliable matching cosmetic fermi rationalist ai impacts lesswrong mnist redwood research

104-Open Talk on Open Source Dataset Format for AI

Play Episode Listen Later Jun 10, 2022 37:53

An open talk on the challenges of having a data pipeline for the images, audio and videos, The Hub enables to have several famous machine learning datasets with just a single command, like CIFAR-10, MNIST or Fashion-MNIST, Google Objection, ImageNet, COCO, and many others. As I came from a Relational Database management system (RDBS) background, this talk gives me a new perspective and helps to think outside of the known areas. Enjoy the talk with the CEO of Active Loop. This session was recorded in October 2021 and is now being published. Today's Guest Davit Buniatyan, CEO at ActiveLoop.ai A great insight talk with the guest speaker on a topic, a product owner focuses on a dataset format to offer API for creating, storing, and collaborating on any size of AI datasets. What were the challenges faced in the unstructured data storage and how the hub is offering a solution to solve the data problem? The question of where you will store the large data sets of images, and videos - you will get all the answers in this talk How the opensource Github repo is helping thousands of people to use datasets to PyTorch or TensorFlow with one line of code. Website: ActiveLoop.ai Twitter: activeloop Resources: ActiveLoop hub - dataset for AI Book: Good to Great

ceo ai coco api open source github hub tensorflow pytorch dataset open talk imagenet relational databases cifar mnist

AF - Who models the models that model models? An exploration of GPT-3's in-context model fitting ability by Lovre

Play Episode Listen Later Jun 7, 2022 16:34

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Who models the models that model models? An exploration of GPT-3's in-context model fitting ability, published by Lovre on June 7, 2022 on The AI Alignment Forum. Introduction Much has been written and much has been observed about the abilities of GPT-3 on many tasks. Most of these capabilities, though not all, pertain to writing convicing text, but, not to undermine GPT-3's impressiveness at performing these tasks, we might call this the predictable part of its oeuvre. It only makes sense that a better language modelling is, well, going to be better at writing text. Deeper and comparatively much less explored is the unpredictable ability of GPT-3 to learn new tasks by just seeing a few examples, without any training/backpropagation – the so called in-context learning (sometimes called metalearning). The original paper announcing GPT-3 contained a handful of examples (perhaps mostly notably examples of GPT-3 learning to perform arithmetic, e.g. accurate addition of up to 5-digit numbers), Gwern has also insightfully written about it, Frieda Rong has performed some interesting experiments, and there have been various other experiments one could chance upon. My curiosity being piqued but not sated by these experiments, and also having had the feeling that, as captivating the arithmetic examples were, they weren't the most natural question one could ask about a stochastic model's quantitative capabilities – I decided to investigate whether GPT-3 could fit numerical models in-context. The Setup What does it mean to fit a model in-context? Well, recall that GPT-3 has a context window (roughly speaking: text it considers before generating additional text) of length 2048 tokens (roughly speaking: (parts of the) words, punctuation, etc.) so the idea is to put feature vectors inside that context window. Of course, this means you cannot fit any larger or higher-dimensional dataset in there.[1] In practice this means prompting GPT-3 on input like: Input: 89, 51, 73, 31, output = 1 [...] Input: 96, 51, 80, 38, output = 2 Input: 90, 37, 76, 27, output = And then taking its output, i.e. the text it generates, as the prediction. A couple more technical details: In all the experiments I performed, all the numbers were integers. The GPT-3's outputs were always sampled with temperature 0, i.e. they were deterministic – only the most probable output was considered. I restricted myself that way for simplicity, but hopefully some future work looks at the full distribution over outputs. (In the rest of this post I share GPT-3's results without too much warning, so if you'd like to make your own predictions about how GPT-3 does at simple classification and regression tasks in-context, now would be the time to pause reading and make your general predictions.) Iris Dataset Just as there is the MNIST dataset for visual recognition, so too there is a small low-dimensional classification dataset which every would-be classifier has to solve or else sink – the Iris dataset, composed of 150 observations of septal/petal height/width of three species of iris (50 observations each). Note that classification of the Iris dataset is in some sense trivial, with simple algorithms achieving near-perfect accuracy, but a model still needs to do some meaningful learning, given that it needs to learn to differentiate between three classes based on just a few dozen four-dimensional vectors. In order to prevent leakage, as the Iris dataset is all over the internet, as well as to get the more-easily-palatable integers out, I transformed every feature with the transformation xnew=round(14xold+6). I also split the dataset into 50% train and 50% test folds – so that the model "trained" on, or rather looked at 75 examples. I hadn't quite been expecting what had followed – I had expected GPT-3 to catch on to some patt...

speech context deeper models ability exploration ea input gpt fitting rationalist mnist gwern

Recommandé n°54 - Une nouvelle dimension (au sens le plus mathématique du terme)

Recommandé

Play Episode Listen Later Apr 29, 2022 52:56

Episode 54 - Une nouvelle dimension (au sens le plus mathématique du terme) Du détournement, des espaces vectoriels et du google traduction de thaï, c'est l'épisode 54 de Recommandé. En voiture ! Musiques La playlist de l'émission : https://www.youtube.com/playlist?list=PLNjXbZkItxtY6KOqIbTtbpGn-Z1R391On Contact - envoyez-moi des musiques ! Twitter @recommande0 Discord https://discord.gg/4RnA9v7 Illustration C'est une image aplatie en 2D d'un espace vectoriel à 10 dimensions pour le jeu de données MNIST.

discord math dimension nouvelle sens 2d terme recommand mnist

Episode 10: Deep Neural Networks in Julia with Flux.jl

Talk Julia

Play Episode Listen Later Mar 5, 2022 36:57

David and Randy explore deep neural networks in Julia using Flux.jl by recreating Grant Sanderson's model for predicting handwritten digits in the MNIST data set. We also show how to visualize model results and training performance in TensorBoard using the TensorBoardLogging.jl package.

machine learning flux deep learning neural networks deep neural networks mnist grant sanderson david amos

LW - Future ML Systems Will Be Qualitatively Different by jsteinhardt

Play Episode Listen Later Jan 12, 2022 8:34

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Future ML Systems Will Be Qualitatively Different, published by jsteinhardt on January 11, 2022 on LessWrong. In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. Some examples of More is Different include: Uranium. With a bit of uranium, nothing special happens; with a large amount of uranium packed densely enough, you get a nuclear reaction. DNA. Given only small molecules such as calcium, you can't meaningfully encode useful information; given larger molecules such as DNA, you can encode a genome. Water. Individual water molecules aren't wet. Wetness only occurs due to the interaction forces between many water molecules interspersed throughout a fabric (or other material). Traffic. A few cars on the road are fine, but with too many you get a traffic jam. It could be that 10,000 cars could traverse a highway easily in 15 minutes, but 20,000 on the road at once could take over an hour. Specialization. Historically, in small populations, virtually everyone needed to farm or hunt to survive; in contrast, in larger and denser communities, enough food is produced for large fractions of the population to specialize in non-agricultural work. While some of the examples, like uranium, correspond to a sharp transition, others like specialization are more continuous. I'll use emergence to refer to qualitative changes that arise from quantitative increases in scale, and phase transitions for cases where the change is sharp. In this post, I'll argue that emergence often occurs in the field of AI, and that this should significantly affect our intuitions about the long-term development and deployment of AI systems. We should expect weird and surprising phenomena to emerge as we scale up systems. This presents opportunities, but also poses important risks. Emergent Shifts in the History of AI There have already been several examples of quantitative differences leading to important qualitative changes in machine learning. Storage and Learning. The emergence of machine learning as a viable approach to AI is itself an example of More Is Different. While learning had been discussed since the 1950s, it wasn't until the 80s-90s that it became a dominant paradigm: for instance, IBM's first statistical translation model was published in 1988, even though the idea was proposed in 1949. Not coincidentally, 1GB of storage cost over $100k in 1981 but only around $9k in 1990 (adjusted to 2021 dollars). The Hansard corpus used to train IBM's model comprised 2.87 million sentences and would have been difficult to use before the 80s. Even the simple MNIST dataset would have required $4000 in hardware just to store in 1981, but that had fallen to a few dollars by 1998 when it was published. Cheaper hardware thus allowed for a qualitatively new approach to AI: in other words, More storage enabled Different approaches. Compute, Data, and Neural Networks. As hardware improved, it became possible to train neural networks that were very deep for the first time. Better compute enabled bigger models trained for longer, and better storage enabled learning from more data; AlexNet-sized models and ImageNet-sized datasets wouldn't have been feasible for researchers to experiment with in 1990. Deep learning performs well with lots of data and compute, but struggles at smaller scales. Without many resources, simpler algorithms tend to outperform it, but with sufficient resources it pulls far ahead of the pack. This reversal of fortune led to qualitative changes in the field. As one example, the field of machine t...

LW - Future ML Systems Will Be Qualitatively Different by jsteinhardt

Play Episode Listen Later Jan 12, 2022 8:34

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Future ML Systems Will Be Qualitatively Different, published by jsteinhardt on January 11, 2022 on LessWrong. In 1972, the Nobel prize-winning physicist Philip Anderson wrote the essay "More Is Different". In it, he argues that quantitative changes can lead to qualitatively different and unexpected phenomena. While he focused on physics, one can find many examples of More is Different in other domains as well, including biology, economics, and computer science. Some examples of More is Different include: Uranium. With a bit of uranium, nothing special happens; with a large amount of uranium packed densely enough, you get a nuclear reaction. DNA. Given only small molecules such as calcium, you can't meaningfully encode useful information; given larger molecules such as DNA, you can encode a genome. Water. Individual water molecules aren't wet. Wetness only occurs due to the interaction forces between many water molecules interspersed throughout a fabric (or other material). Traffic. A few cars on the road are fine, but with too many you get a traffic jam. It could be that 10,000 cars could traverse a highway easily in 15 minutes, but 20,000 on the road at once could take over an hour. Specialization. Historically, in small populations, virtually everyone needed to farm or hunt to survive; in contrast, in larger and denser communities, enough food is produced for large fractions of the population to specialize in non-agricultural work. While some of the examples, like uranium, correspond to a sharp transition, others like specialization are more continuous. I'll use emergence to refer to qualitative changes that arise from quantitative increases in scale, and phase transitions for cases where the change is sharp. In this post, I'll argue that emergence often occurs in the field of AI, and that this should significantly affect our intuitions about the long-term development and deployment of AI systems. We should expect weird and surprising phenomena to emerge as we scale up systems. This presents opportunities, but also poses important risks. Emergent Shifts in the History of AI There have already been several examples of quantitative differences leading to important qualitative changes in machine learning. Storage and Learning. The emergence of machine learning as a viable approach to AI is itself an example of More Is Different. While learning had been discussed since the 1950s, it wasn't until the 80s-90s that it became a dominant paradigm: for instance, IBM's first statistical translation model was published in 1988, even though the idea was proposed in 1949. Not coincidentally, 1GB of storage cost over $100k in 1981 but only around $9k in 1990 (adjusted to 2021 dollars). The Hansard corpus used to train IBM's model comprised 2.87 million sentences and would have been difficult to use before the 80s. Even the simple MNIST dataset would have required $4000 in hardware just to store in 1981, but that had fallen to a few dollars by 1998 when it was published. Cheaper hardware thus allowed for a qualitatively new approach to AI: in other words, More storage enabled Different approaches. Compute, Data, and Neural Networks. As hardware improved, it became possible to train neural networks that were very deep for the first time. Better compute enabled bigger models trained for longer, and better storage enabled learning from more data; AlexNet-sized models and ImageNet-sized datasets wouldn't have been feasible for researchers to experiment with in 1990. Deep learning performs well with lots of data and compute, but struggles at smaller scales. Without many resources, simpler algorithms tend to outperform it, but with sufficient resources it pulls far ahead of the pack. This reversal of fortune led to qualitative changes in the field. As one example, the field of machine t...

LW - D

Play Episode Listen Later Jan 4, 2022 6:29

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D

network speech ea neurons spiking neuron minst command line ann s rationalist lesswrong mnist

LW - D

european union brasil adaptation semi fala domain lugares alemanha como fazer carros nomos explicando supervised learning mnist

Play Episode Listen Later Jan 4, 2022 6:29

Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: D

network speech ea neurons spiking neuron minst command line ann s rationalist lesswrong mnist

#31 - Como fazer carros autônomos dirigirem em todos os lugares? Explicando domain adaptation e semi-supervised learning.

Vida com IA

Play Episode Listen Later Dec 30, 2021 22:01

Fala galera, nesse episódio eu explico como um carro autônomo pode ser treinado para dirigir em um país, na Alemanha por exemplo, e mesmo assim ser capaz de dirigir no Brasil com tudo tão diferente. Para explicar isso, eu falo sobre o conceito de domain adaptation (adaptação de domínio) e também semi-supervised learning (aprendizado semi supervisionado). Eu explico a ideia com a primeira arquitetura proposta and também falo sobre o estado da arte que temos hoje para resolver o problema. Instagram: https://www.instagram.com/podcast.lifewithai/ Linkedin: https://www.linkedin.com/company/life-with-ai Código: https://github.com/filipelauar/projects/blob/main/domain_adaptation_semi_supervised_learning_MNIST_pytorch.ipynb

#31 - How to make autonomous vehicles drive in different places? Understanding domain adaptation and semi-supervised learning.

Life with AI

Play Episode Listen Later Dec 30, 2021 20:48

Hey guys, in this episode I explain how an autonomous vehicle can learn to drive in a specific city and be able to generalize the driving knowledge to any other city. To explain it, I show the concept of of semi-supervised learning and domain adaptation. I explain the the ideia with the first proposed architecture and what we have today as state of the art for this problem. Instagram: https://www.instagram.com/podcast.lifewithai/ Linkedin: https://www.linkedin.com/company/life-with-ai Code: https://github.com/filipelauar/projects/blob/main/domain_adaptation_semi_supervised_learning_MNIST_pytorch.ipynb

drive code places adaptation semi domain autonomous vehicles supervised learning mnist

Understanding “Deep Double Descent” by evhub

The Nonlinear Library: LessWrong Top Posts

Play Episode Listen Later Dec 11, 2021 9:09

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding “Deep Double Descent” , published by evhub on the AI Alignment Forum. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts. Prior work The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon: Belkin et al. describe their hypothesis for what's happening as follows: All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam's razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers. I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through. I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal ...

deep speech ea openai descent ml occam belkin rationalist sgd mnist

Understanding “Deep Double Descent” by Evan Hubinger

The Nonlinear Library: Alignment Forum Top Posts

Play Episode Listen Later Dec 10, 2021 9:04

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding “Deep Double Descent”, published by Evan Hubinger on the AI Alignment Forum. If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts. Prior work The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon: Belkin et al. describe their hypothesis for what's happening as follows: All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam's razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers. I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through. I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal circuits being malign suggest that such strong simplicity and speed priors could...

deep speech ea openai descent ml occam belkin rationalist sgd mnist

Alignment Newsletter #168: Four technical topics for which Open Phil is soliciting grant proposals

Alignment Newsletter Podcast

Play Episode Listen Later Oct 28, 2021 16:21

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Request for proposals for projects in AI alignment that work with deep learning systems (Nick Beckstead and Asya Bergal) (summarized by Rohin): Open Philanthropy is seeking proposals for AI safety work in four major areas related to deep learning, each of which I summarize below. Proposals are due January 10, and can seek up to $1M covering up to 2 years. Grantees may later be invited to apply for larger and longer grants. Rohin's opinion: Overall, I like these four directions and am excited to see what comes out of them! I'll comment on specific directions below. RFP: Measuring and forecasting risks (Jacob Steinhardt) (summarized by Rohin): Measurement and forecasting is useful for two reasons. First, it gives us empirical data that can improve our understanding and spur progress. Second, it can allow us to quantitatively compare the safety performance of different systems, which could enable the creation of safety standards. So what makes for a good measurement? 1. Relevance to AI alignment: The measurement exhibits a failure mode that becomes worse as models become larger, or tracks a potential capability that may emerge with further scale (which in turn could enable deception, hacking, resource acquisition, etc). 2. Forward-looking: The measurement helps us understand future issues, not just those that exist today. Isolated examples of a phenomenon are good if we have nothing else, but we'd much prefer to have a systematic understanding of when a phenomenon occurs and how it tends to quantitatively increase or decrease with various factors. See for example scaling laws (AN #87). 3. Rich data source: Not all trends in MNIST generalize to CIFAR-10, and not all trends in CIFAR-10 generalize to ImageNet. Measurements on data sources with rich factors of variation are more likely to give general insights. 4. Soundness and quality: This is a general category for things like “do we know that the signal isn't overwhelmed by the noise” and “are there any reasons that the measurement might produce false positives or false negatives”. What sorts of things might you measure? 1. As you scale up task complexity, how much do you need to scale up human-labeled data to continue to maintain good performance and avoid reward hacking? If you fail at this and there are imperfections in the reward, how bad does this become? 2. What changes do we observe based on changes in the quality of the human feedback (e.g. getting feedback from amateurs vs experts)? This could give us information about the acceptable “difference in intelligence” between a model and its supervisor. 3. What happens when models are pushed out of distribution along a factor of variation that was not varied in the pretraining data? 4. To what extent do models provide wrong or undesired outputs in contexts where they are capable of providing the right answer? Rohin's opinion: Measurements generally seem great. One story for impact is that we have a measurement that we think is strongly correlated with x-risk, and we use that measurement to select an AI system that scores low on such a metric. This seems distinctly good and I think would in fact reduce x-risk! But I want to clarify that I don't think it would convince me that the system was safe with high confidence. The conceptual arguments against high confidence in safety seem quite strong and not easily overcome by such measurements. (I'm thinking of objective robustness failures (AN #66) of the form “the model is trying to pursue a simple proxy, but behaves well on the training distribution until it can execute a treacherous turn”.) You can also tell stories where the measurements reveal empirical facts that then help us have high confidence in safety, by allowing us to build better theories and arguments, which can rule out the conceptual arguments above. Separately, these measurements are also useful as a form of legible evidence about risk to others who are more skeptical of conceptual arguments. RFP: Techniques for enhancing human feedback (Ajeya Cotra) (summarized by Rohin): Consider a topic previously analyzed in aligning narrowly superhuman models (AN #141): how can we use human feedback to train models to do what we want, in cases where the models are more knowledgeable than the humans providing the feedback? A variety of techniques have been proposed to solve this problem, including iterated amplification (AN #40), debate (AN #5), recursive reward modeling (AN #34), market making (AN #108), and generalizing from short deliberations to long deliberations. This RFP solicits proposals that aim to test these or other mechanisms on existing systems. There are a variety of ways that to set up the experiments so that the models are more knowledgeable than the humans providing the feedback, for example: 1. Train a language model to accurately explain things about a field that the feedback providers are not familiar with. 2. Train an RL agent to act well in an environment where the RL agent can observe more information than the feedback providers can. 3. Train a multilingual model to translate between English and a foreign language that the feedback providers do not know. RFP: Interpretability (Chris Olah) (summarized by Rohin): The author provides this one sentence summary: We would like to see research building towards the ability to “reverse engineer" trained neural networks into human-understandable algorithms, enabling auditors to catch unanticipated safety problems in these models. This RFP is primarily focused on an aspirational “intermediate” goal: to fully reverse engineer some modern neural network, such as an ImageNet classifier. (Despite the ambition, it is only an “intermediate” goal because what we would eventually need is a general method for cheaply reverse engineering any neural network.) The proposed areas of research are primarily inspired by the Circuits line of work (AN #142): 1. Discovering Features and Circuits: This is the most obvious approach to the aspirational goal. We simply “turn the crank” using existing tools to study new features and circuits, and this fairly often results in an interesting result, that makes progress towards reverse engineering a neural network. 2. Scaling Circuits to Larger Models: So far the largest example of reverse engineering is curve circuits, with 50K parameters. Can we find examples of structure in the neural networks that allow us to drastically reduce the amount of effort required per parameter? (As examples, see equivariance and branch specialization.) 3. Resolving Polysemanticity: One of the core building blocks of the circuits approach is to identify a neuron with a concept, so that connections between neurons can be analyzed as connections between concepts. Unfortunately, some neurons are polysemantic, that is, they encode multiple different concepts. This greatly complicates analysis of the connections and circuits between these neurons. How can we deal with this potential obstacle? Rohin's opinion: The full RFP has many, many more points about these topics; it's 8 pages of remarkably information-dense yet readable prose. If you're at all interested in mechanistic interpretability, I recommend reading it in full. This RFP also has the benefit of having the most obvious pathway to impact: if we understand what algorithm neural networks are running, there's a much better chance that we can catch any problems that arise, especially ones in which the neural network is deliberately optimizing against us. It's one of the few areas where nearly everyone agrees that further progress is especially valuable. RFP: Truthful and honest AI (Owain Evans) (summarized by Rohin): This RFP outlines research projects on Truthful AI (summarized below). They fall under three main categories: 1. Increasing clarity about “truthfulness” and “honesty”. While there are some tentative definitions of these concepts, there is still more precision to be had: for example, how do we deal with statements with ambiguous meanings, or ones involving figurative language? What is the appropriate standard for robustly truthful AI? It seems too strong to require the AI system to never generate a false statement; for example it might misunderstand the meaning of a newly coined piece of jargon. 2. Creating benchmarks and tasks for Truthful AI, such as TruthfulQA (AN #165), which checks for imitative falsehoods. This is not just meant to create a metric to improve on; it may also simply perform as a measurement. For example, we could experimentally evaluate whether honesty generalizes (AN #158), or explore how much truthfulness is reduced when adding in a task-specific objective. 3. Improving the truthfulness of models, for example by finetuning models on curated datasets of truthful utterances, finetuning on human feedback, using debate (AN #5), etc. Besides the societal benefits from truthful AI, building truthful AI systems can also help with AI alignment: 1. A truthful AI system can be used to supervise its own actions, by asking it whether its selected action was good. 2. A robustly truthful AI system could continue to do this after deployment, allowing for ongoing monitoring of the AI system. 3. Similarly, we could have a robustly truthful AI system supervise its own actions in hypothetical scenarios, to make it more robustly aligned. Rohin's opinion: While I agree that making AI systems truthful would then enable many alignment strategies, I'm actually more interested in the methods by which we make AI systems truthful. Many of the ideas suggested in the RFP are ones that would apply for alignment more generally, and aren't particularly specific to truthful AI. So it seems like whatever techniques we used to build truthful AI could then be repurposed for alignment. In other words, I expect that the benefit to AI alignment of working on truthful AI is that it serves as a good test case for methods that aim to impose constraints upon an AI system. In this sense, it is a more challenging, larger version of the ”never describe someone getting injured” challenge (AN #166). Note that I am only talking about how this helps AI alignment; there are also beneficial effects on society from pursuing truthful AI that I haven't talked about here. AI GOVERNANCE Truthful AI: Developing and governing AI that does not lie (Owain Evans, Owen Cotton-Barratt et al) (summarized by Rohin): This paper argues that we should develop both the technical capabilities and the governance mechanisms necessary to ensure that AI systems are truthful. We will primarily think about conversational AI systems here (so not, say, AlphaFold). Some key terms: 1. An AI system is honest if it only makes statements that it actually believes. (This requires you to have some way of ascribing beliefs to the system.) In contrast, truthfulness only checks if statements correspond to reality, without making any claims about the AI system's beliefs. 2. An AI system is broadly truthful if it doesn't lie, volunteers all the relevant information it knows, is well-calibrated and knows the limits of its information, etc. 3. An AI system is narrowly truthful if it avoids making negligent suspected-falsehoods. These are statements that can feasibly be determined by the AI system to be unacceptably likely to be false. Importantly, a narrowly truthful AI is not required to make contentful statements, it can express uncertainty or refuse to answer. This paper argues for narrow truthfulness as the appropriate standard. Broad truthfulness is not very precisely defined, making it challenging to coordinate on. Honesty does not give us the guarantees we want: in settings in which it is advantageous to say false things, AI systems might end up being honest but deluded. They would honestly report their beliefs, but those beliefs might be false. Narrow truthfulness is still a much stronger standard that we impose upon standards. This is desirable, because (1) AI systems need not be constrained by social norms, the way humans are; consequently they need stronger standards, and (2) it may be less costly to enforce that AI systems are narrowly truthful than to enforce that humans are narrowly truthful, so a higher standard is more feasible. Evaluating the (narrow) truthfulness of a model is non-trivial. There are two parts: first, determining whether a given statement is unacceptably likely to be false, and second, determining whether the model was negligent in uttering such a statement. The former could be done by having human processes that study a wide range of information and determine whether a given statement is unacceptably likely to be false. In addition to all of the usual concerns about the challenges of evaluating a model that might know more than you, there is also the challenge that it is not clear exactly what counts as “unacceptably likely to be false”. For example, if a model utters a false statement, but expresses low confidence, how should that be rated? The second part, determining negligence, needs to account for the fact that the AI system might not have had all the necessary information, or that it might not have been capable enough to come to the correct conclusion. One way of handling this is to compare the AI system to other AI systems built in a similar fashion. How might narrow truthfulness be useful? One nice thing it enables is truthfulness amplification, in which we can amplify properties of a model by asking a web of related questions and combining the answers appropriately. For example, if we are concerned that the AI system is deceiving us on just this question, we could ask it whether it is deceiving us, or whether an investigation into its statement would conclude that it was deceptive. As another example, if we are worried that the AI system is making a mistake on some question where its statement isn't obviously false, we can ask it about its evidence for its position and how strong the evidence is (where false statements are more likely to be negligently false). Section 3 is devoted to the potential benefits and costs if we successfully ensure that AI systems are narrowly truthful, with the conclusion that the costs are small relative to the benefits, and can be partially mitigated. Section 6 discusses other potential benefits and costs if we attempt to create truthfulness standards to ensure the AI systems are narrowly truthful. (For example, we might try to create a truthfulness standard, but instead create an institution that makes sure that AI systems follow a particular agenda (by only rating as true the statements that are consistent with that agenda). Section 4 talks about the governance mechanisms we might use to implement a truthfulness standard. Section 5 describes potential approaches for building truthful AI systems. As I mentioned in the highlighted post, these techniques are general alignment techniques that have been specialized for truthful AI. NEWS Q&A Panel on Applying for Grad School (summarized by Rohin): In this event run by AI Safety Support on November 7, current PhD students will share their experiences navigating the application process and AI Safety research in academia. RSVP here. SafeAI Workshop 2022 (summarized by Rohin): The SafeAI workshop at AAAI is now accepting paper submissions, with a deadline of Nov 12. FLI's $25M Grants Program for Existential Risk Reduction (summarized by Rohin): This podcast talks about FLI's recent grants program for x-risk reduction. I've previously mentioned the fellowships (AN #165) they are running as part of this program. As a reminder, the application deadline is October 29 for the PhD fellowship, and November 5 for the postdoc fellowship.

Alignment Newsletter #167: Concrete ML safety problems and their relevance to x-risk

Alignment Newsletter Podcast

Play Episode Listen Later Oct 20, 2021 17:10

Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Unsolved Problems in ML Safety (Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt) (summarized by Dan Hendrycks): To make the case for safety to the broader machine learning research community, this paper provides a revised and expanded collection of concrete technical safety research problems, namely: 1. Robustness: Create models that are resilient to adversaries, unusual situations, and Black Swan events. 2. Monitoring: Detect malicious use, monitor predictions, and discover unexpected model functionality. 3. Alignment: Build models that represent and safely optimize hard-to-specify human values. 4. External Safety: Use ML to address risks to how ML systems are handled, including cyberwarfare and global turbulence. Throughout, the paper attempts to clarify problem's motivation and provide concrete project ideas. Dan Hendrycks' opinion: My coauthors and I wrote this paper with the ML research community as our target audience. Here are some thoughts on this topic: 1. The document includes numerous problems that, if left unsolved, would imply that ML systems are unsafe. We need the effort of thousands of researchers to address all of them. This means that the main safety discussions cannot stay within the confines of the relatively small EA community. I think we should aim to have over one third of the ML research community work on safety problems. We need the broader community to treat AI at least as seriously as safety for nuclear power plants. 2. To grow the ML research community, we need to suggest problems that can progressively build the community and organically grow support for elevating safety standards within the existing research ecosystem. Research agendas that pertain to AGI exclusively will not scale sufficiently, and such research will simply not get enough market share in time. If we do not get the machine learning community on board with proactively mitigating risks that already exist, we will have a harder time getting them to mitigate less familiar and unprecedented risks. Rather than try to win over the community with alignment philosophy arguments, I'll try winning them over with interesting problems and try to make work towards safer systems rewarded with prestige. 3. The benefits of a larger ML Safety community are numerous. They can decrease the cost of safety methods and increase the propensity to adopt them. Moreover, to make ML systems have desirable properties, it is necessary to rapidly accumulate incremental improvements, but this requires substantial growth since such gains cannot be produced by just a few card-carrying x-risk researchers with the purest intentions. 4. The community will fail to grow if we ignore near-term concerns or actively exclude or sneer at people who work on problems that are useful for both near- and long-term safety (such as adversaries). The alignment community will need to stop engaging in textbook territorialism and welcome serious hypercompetent researchers who do not post on internet forums or who happen not to subscribe to effective altruism. (We include a community strategy in the Appendix.) 5. We focus on reinforcement learning but also deep learning. Most of the machine learning research community studies deep learning (e.g., text processing, vision) and does not use, say, Bellman equations or PPO. While existentially catastrophic failures will likely require competent sequential decision making agents, the relevant problems and solutions can often be better studied outside of gridworlds and MuJoCo. There is much useful safety research to be done that does not need to be cast as a reinforcement learning problem. 6. To prevent alienating readers, we did not use phrases such as "AGI." AGI-exclusive research will not scale; for most academics and many industry researchers, it's a nonstarter. Likewise, to prevent needless dismissiveness, we kept x-risks implicit, only hinted at them, or used the phrase "permanent catastrophe." I would have personally enjoyed discussing at length how anomaly detection is an indispensable tool for reducing x-risks from Black Balls, engineered microorganisms, and deceptive ML systems. Here are how the problems relate to x-risk: Adversarial Robustness: This is needed for proxy gaming. ML systems encoding proxies must become more robust to optimizers, which is to say they must become more adversarially robust. We make this connection explicit at the bottom of page 9. Black Swans and Tail Risks: It's hard to be safe without high reliability. It's not obvious we'll achieve high reliability even by the time we have systems that are superhuman in important respects. Even though MNIST is solved for typical inputs, we still do not even have an MNIST classifier for atypical inputs that is reliable! Moreover, if optimizing agents become unreliable in the face of novel or extreme events, they could start heavily optimizing the wrong thing. Models accidentally going off the rails poses an x-risk if they are sufficiently powerful (this is related to "competent errors" and "treacherous turns"). If this problem is not solved, optimizers can use these weaknesses; this is a simpler problem on the way to adversarial robustness. Anomaly and Malicious Use Detection: This is an indispensable tool for detecting proxy gaming, Black Balls, engineered microorganisms that present bio x-risks, malicious users who may misalign a model, deceptive ML systems, and rogue ML systems. Representative Outputs: Making models honest is a way to avoid many treacherous turns. Hidden Model Functionality: This also helps avoid treacherous turns. Backdoors is a potentially useful related problem, as it is about detecting latent but potential sharp changes in behavior. Value Learning: Understanding utilities is difficult even for humans. Powerful optimizers will need to achieve a certain, as-of-yet unclear level of superhuman performance at learning our values. Translating Values to Action: Successfully prodding models to optimize our values is necessary for safe outcomes. Proxy Gaming: Obvious. Value Clarification: This is the philosophy bot section. We will need to decide what values to pursue. If we decide poorly, we may lock in or destroy what is of value. It also possible that there is an ongoing moral catastrophe, which we would not want to replicate across the cosmos. Unintended Consequences: This should help models not accidentally work against our values. ML for Cybersecurity: If you believe that AI governance is valuable and that global turbulence risks can increase risks of terrible outcomes, this section is also relevant. Even if some of the components of ML systems are safe, they can become unsafe when traditional software vulnerabilities enable others to control their behavior. Moreover, traditional software vulnerabilities may lead to the proliferation of powerful advanced models, and this may be worse than proliferating nuclear weapons. Informed Decision Making: We want to avoid decision making based on unreliable gut reactions during a time of crisis. This reduces risks of poor governance of advanced systems. Here are some other notes: 1. We use systems theory to motivate inner optimization as we expect motivation will be more convincing to others. 2. Rather than have a broad call for "interpretability," we focus on specific transparency-related problems that are more tractable and neglected. (See the Appendix for a table assessing importance, tractability, and neglectedness.) For example, we include sections on making models honest and detecting emergent functionality. 3. The "External Safety" section can also be thought of as technical research for reducing "Governance" risks. For readers mostly concerned about AI risks from global turbulence, there still is technical research that can be done. Here are some observations while writing the document: 1. Some approaches that were previously very popular are currently neglected, such as inverse reinforcement learning. This may be due to currently low tractability. 2. Five years ago, I started explicitly brainstorming the content for this document. I think it took the whole time for this document to take shape. Moreover, if this were written last fall, the document would be far more confused, since it took around a year after GPT-3 to become reoriented; writing these types of documents shortly after a paradigm shift may be too hasty. 3. When collecting feedback, it was not uncommon for "in-the-know" researchers to make opposite suggestions. Some people thought some of the problems in the Alignment section were unimportant, while others thought they were the most critical. We attempted to include most research directions. [MLSN #1]: ICLR Safety Paper Roundup (Dan Hendrycks) (summarized by Rohin): This is the first issue of the ML Safety Newsletter, which is "a monthly safety newsletter which is designed to cover empirical safety research and be palatable to the broader machine learning research community". Rohin's opinion: I'm very excited to see this newsletter: this is a category of papers that I want to know about and that are relevant to safety, but I don't have the time to read all of these papers given all the other alignment work I read, especially since I don't personally work in these areas and so often find it hard to summarize them or place them in the appropriate context. Dan on the other hand has written many such papers himself and generally knows the area, and so will likely do a much better job than I would. I recommend you subscribe, especially since I'm not going to send a link to each MLSN in this newsletter. TECHNICAL AI ALIGNMENT TECHNICAL AGENDAS AND PRIORITIZATION Selection Theorems: A Program For Understanding Agents (John Wentworth) (summarized by Rohin): This post proposes a research area for understanding agents: selection theorems. A selection theorem is a theorem that tells us something about agents that will be selected for in a broad class of environments. Selection theorems are helpful because (1) they can provide additional assumptions that can help with learning human values, and (2) they can tell us likely properties of the agents we build by accident (think inner alignment concerns). As an example, coherence arguments demonstrate that when an environment presents an agent with “bets” or “lotteries”, where the agent cares only about the outcomes of the bets, then any “good” agent can be represented as maximizing expected utility. (What does it mean to be “good”? This can vary, but one example would be that the agent is not subject to Dutch books, i.e. situations in which it is guaranteed to lose resources.) This can then be turned into a selection argument by combining it with something that selects for “good” agents. For example, evolution will select for agents that don't lose resources for no gain, so humans are likely to be represented as maximizing expected utility. Unfortunately, many coherence arguments implicitly assume that the agent has no internal state, which is not true for humans, so this argument does not clearly work. As another example, our ML training procedures will likely also select for agents that don't waste resources, which could allow us to conclude that the resulting agents can be represented as maximizing expected utility, if the agents don't have internal state. Coherence arguments aren't the only kind of selection theorem. The good(er) regulator theorem (AN #138) provides a set of scenarios under which agents learn an internal “world model”. The Kelly criterion tells us about scenarios in which the best (most selected) agents will make bets as though they are maximizing expected log money. These and other examples are described in this followup post. The rest of this post elaborates on the various parts of a selection theorem, and provides advice on how to make original research contributions in the area of selection theorems. Another followup post describes some useful properties for which the author expects there are useful selections theorems to prove. Rohin's opinion: People sometimes expect me to be against this sort of work, because I wrote Coherence arguments do not imply goal-directed behavior (AN #35). This is not true. My point in that post is that coherence arguments alone are not enough, you need to combine them with some other assumption (for example, that there exists some “resource” over which the agent has no terminal preferences). I do think it is plausible that this research agenda gives us a better picture of agency that tells us something about how AI systems will behave, or something about how to better infer human values. While I am personally more excited about studying particular development paths to AGI rather than more abstract agent models, I do think this research would be more useful than other types of alignment research I have seen proposed. OTHER PROGRESS IN AI MISCELLANEOUS (AI) State of AI Report 2021 (Nathan Benaich and Ian Hogarth) (summarized by Rohin): As with past (AN #15) reports (AN #120), I'm not going to summarize the entire thing, and instead you get the high-level themes that the authors identified: 1. AI is stepping up in more concrete ways, including in mission critical infrastructure. 2. AI-first approaches have taken biology by storm (and we aren't just talking about AlphaFold). 3. Transformers have emerged as a general purpose architecture for machine learning in many domains, not just NLP. 4. Investors have taken notice, with record funding this year into AI startups, and two first ever IPOs for AI-first drug discovery companies, as well as blockbuster IPOs for data infrastructure and cybersecurity companies that help enterprises retool for the AI-first era. 5. The under-resourced AI-alignment efforts from key organisations who are advancing the overall field of AI, as well as concerns about datasets used to train AI models and bias in model evaluation benchmarks, raise important questions about how best to chart the progress of AI systems with rapidly advancing capabilities. 6. AI is now an actual arms race rather than a figurative one, with reports of recent use of autonomous weapons by various militaries. 7. Within the US-China rivalry, China's ascension in research quality and talent training is notable, with Chinese institutions now beating the most prominent Western ones. 8. There is an emergence and nationalisation of large language models. Rohin's opinion: In last year's report (AN #120), I said that their 8 predictions seemed to be going out on a limb, and that even 67% accuracy woud be pretty impressive. This year, they scored their predictions as 5 “Yes”, 1 “Sort of”, and 2 “No”. That being said, they graded “The first 10 trillion parameter dense model” as “Yes”, I believe on the basis that Microsoft had run a couple of steps of training on a 32 trillion parameter dense model. I definitely interpreted the prediction as saying that a 10 trillion parameter model would be trained to completion, which I do not think happened publicly, so I'm inclined to give it a “No”. Still, this does seem like a decent track record for what seemed to me to be non-trivial predictions. This year's predictions seem similarly "out on a limb" as last year's. This year's report included one slide summaries of many papers I've summarized before. I only found one major issue -- the slide on TruthfulQA (AN #165) implies that larger language models are less honest in general, rather than being more likely to imitate human falsehoods. This is actually a pretty good track record, given the number of things they summarized where I would have noticed if there were major issues. NEWS CHAI Internships 2022 (summarized by Rohin): CHAI internships are open once again! Typically, an intern will execute on an AI safety research project proposed by their mentor, resulting in a first-author publication at a workshop. The early deadline is November 23rd and the regular deadline is December 13th.

Topographic VAEs learn Equivariant Capsules (Machine Learning Research Paper Explained)

Yannic Kilcher Videos (Audio Only)

Play Episode Listen Later Sep 21, 2021 32:03

#tvae #topographic #equivariant Variational Autoencoders model the latent space as a set of independent Gaussian random variables, which the decoder maps to a data distribution. However, this independence is not always desired, for example when dealing with video sequences, we know that successive frames are heavily correlated. Thus, any latent space dealing with such data should reflect this in its structure. Topographic VAEs are a framework for defining correlation structures among the latent variables and induce equivariance within the resulting model. This paper shows how such correlation structures can be built by correctly arranging higher-level variables, which are themselves independent Gaussians. OUTLINE: 0:00 - Intro 1:40 - Architecture Overview 6:30 - Comparison to regular VAEs 8:35 - Generative Mechanism Formulation 11:45 - Non-Gaussian Latent Space 17:30 - Topographic Product of Student-t 21:15 - Introducing Temporal Coherence 24:50 - Topographic VAE 27:50 - Experimental Results 31:15 - Conclusion & Comments Paper: https://arxiv.org/abs/2109.01394 Code: https://github.com/akandykeller/topog... Abstract: In this work we seek to bridge the concepts of topographic organization and equivariance in neural networks. To accomplish this, we introduce the Topographic VAE: a novel method for efficiently training deep generative models with topographically organized latent variables. We show that such a model indeed learns to organize its activations according to salient characteristics such as digit class, width, and style on MNIST. Furthermore, through topographic organization over time (i.e. temporal coherence), we demonstrate how predefined latent space transformation operators can be encouraged for observed transformed input sequences -- a primitive form of unsupervised learned equivariance. We demonstrate that this model successfully learns sets of approximately equivariant features (i.e. "capsules") directly from sequences and achieves higher likelihood on correspondingly transforming test sequences. Equivariance is verified quantitatively by measuring the approximate commutativity of the inference network and the sequence transformations. Finally, we demonstrate approximate equivariance to complex transformations, expanding upon the capabilities of existing group equivariant neural networks. Authors: T. Anderson Keller, Max Welling Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

student comparison conclusion minds machine learning outline capsules research papers gaussian learning research topographic mnist gaussians

#025 ML Про стажировки в Яндексе и распознавание рукописных цифр

Machine Learning Podcast

Play Episode Listen Later May 19, 2021 50:28

Этот выпуск состоит из двух частей. Сначала Алина - руководитель бригады подбора стажеров по направлению машинного обучения - немного расскажет о том, как с помощью оплачиваемых стажировок можно получить свой первый опыт работы в ML, а потом я расскажу о том как связать теоретические знания о моделях машинного обучения с практикой их написания и использования на примере учебного датасета изображений рукописных цифр MNIST. Надеюсь, выпуск будет полезен тем, кто уже имеет небольшую теоретическую базу по тому как работает машинное обучение, но еще не очень понимает как от теории перейти к практике написания моделей. Ссылки выпуска: Стажировки в Яндексе (https://yandex.ru/yaintern/) Статья о том, как из пиарщиков девушка стала тестировщиком Яндекс.Станции (https://academy.yandex.ru/posts/byvshiy-piarschik-rasskazyvaet-kak-stat-testirovschikom-yandeks-stantsii) Выпуск подкаста про линейную регрессию (https://anchor.fm/kmsrus/episodes/016-ML-eo11mr/a-a45vl81) Плей-лист selfedu "Нейронные сети на Python. Уроки" (https://www.youtube.com/watch?v=nV7cI5zgOpk&list=PLA0M1Bcd0w8yv0XGiF1wjerjSZVSrYbjh) Мануал "TensorFlow, Keras and deep learning, without a PhD" (https://codelabs.developers.google.com/codelabs/cloud-tensorflow-mnist#0) Буду благодарен за обратную связь! Оставляйте ваши комментарии там, где можно. Например, в Apple Podcasts. Они помогут сделать подкаст лучше! Напишите что вам было понятно, что не очень, какие темы раскрыть, каких гостей пригласить, ну, и вообще в какую сторону катить этот подкаст :) Подписывайтесь на телеграм-канал "Стать специалистом по машинному обучению" (https://t.me/toBeAnMLspecialist) Телеграм автора подкаста (https://t.me/kmsint) Со мной также можно связаться по электронной почте: kms101@yandex.ru Также теперь подкаст можно найти на YouTube (https://www.youtube.com/channel/UCzvfXLNpB2Bbf32dc7a8oDQ?) и Яндекс.Музыке https://music.yandex.ru/album/9781458 Music by Audionautix.com

phd python ml audionautix keras mnist

Guise of the Machines

AI with AI

Play Episode Listen Later Apr 9, 2021 36:28

Andy and Dave discuss the latest in AI news, including a report that systematically examined 62 studies on COVID-19 ML methods (from a pool o 2200+ studies), and found that none of the models were of potential clinical use due to methodological flaws or underlying biases. MIT and Amazon identify pervasive label errors in popular ML datasets (such as MNIST, CIFAR, Imagenet) and demonstrate that models may learn systematic patterns of label error in order to improve their accuracy. DARPA’s Air Combat Evolution program upgrades its virtual program to include new weapons systems and multiple aircraft, with live Phase 2 tests on schedule for later in 2021. Researchers at the University of Waterloo and Northeastern University publish research working toward self-walking robotic exoskeletons. British researchers add a buccinators (cheek) muscle to robotic busts to better synchronize speech and mouth movements. Russian Promobot is developing hyper-realistic skin for humanoid robots. And Anderson Cooper takes a tour of Boston Dynamics. In research, Leverhulme, Cambridge, Imperial College London, and DeepMind UK publish research on the direct human-AI comparison in the animal-AI environment, using human children ages 6-10 and animal-AI agents across 10 levels of task groupings. Josh Bongard and Michael Levin publish Living Things Are Not (20th Century) Machines, a thought piece on updating how we think of machines and what they *could* be. Professors Jason Jones and Steven Skiena are publishing a running AI Dashboard on Public Opinion of AI. The Australian Department of Defence publishes A Method for Ethical AI in Defence. Raghavendra Gadagkar publishes Experiments in Animal Behavior. And Peter Singer and August Cole publish An Eye for a Storm, envisioning a future of professional military education for the Australian Defence Force. Listeners Survey: https://bit.ly/3bqyiHk Click here to visit our website and explore the links mentioned in the episode.

18 Leaky abstractions, ML, and email mountain

Makers.dev

Play Episode Listen Later Mar 10, 2021 50:25

"Christian just do your emails" 00:00 Plumbing problems 01:31 When do you outsource vs. do it yourself? 13:18 Chris's AI research 17:38 Overview of Machine Learning 22:24 PyTorch vs. TensorFlow 26:02 Python, Javascript, and Ruby 29:41 Slogging through email 33:34 Email systems 40:43 Bookmark bankruptcy 41:44 Zettelkasten Google Search browser plugin HOT TAKE 44:34 Just do your emails Christian Timestamps created with https://clips.marketing by @cgenco 9️⃣MNIST number dataset AI practice: https://twitter.com/chrisachard/status/1369313924970143749

ai mountain silicon valley scratch python javascript hard way plumbing leaky abstraction bookmark tensorflow pytorch stumbleupon microconf slogging learn python mnist zed shaw micropreneur

丽莎老师讲机器人之隐私保护突破:高斯差分隐私框架与DL结合

丽莎老师讲机器人

Play Episode Listen Later Jun 1, 2020 10:39

丽莎老师讲机器人之隐私保护新突破:高斯差分隐私框架与深度学习结合欢迎收听丽莎老师讲机器人，想要孩子参加机器人竞赛、创意编程、创客竞赛的辅导，找丽莎老师！欢迎添加微信号：153 5359 2068，或搜索钉钉群：31532843。宾夕法尼亚大学的研究组开发了一个新的数据隐私分析框架，可以在多个类型的机器学习问题中有效保护个人隐私。这个框架现已成功和深度学习结合，并在多个需要保障隐私的深度学习任务中达到最高准确率。什么是差分隐私在这个大数据时代，如何妥善获取和使用与真人相关的数据，渐渐成为迫切需要解决的问题。没有人希望自己生个病，上个网，买件衣服都会被人随意知晓，更别提手机里没有修过的自拍了。一种简单的隐私保护方法就是「匿名」：将收集到的数据中涉及个人信息的特征剔除。可惜这种方法并不可靠，曾有研究将 Netflix 匿名处理过的观影记录通过交叉对比 IMDb 数据库解匿成功，这直接导致了第二届 Netflix 数据分析大奖赛的取消。2006 年，隐私算法的研究迎来了新的里程碑。几位科学家定义了「差分隐私」（以下缩写为 DP），来严谨地分析隐私这个概念。差分隐私很快被证明是个强有效的工具，并被谷歌、苹果、微软、阿里巴巴等各大机构使用。而四位发明者于 2017 年获得了被誉为理论计算机科学界诺贝尔奖的 Godel 奖。要理解差分隐私，我们可以看看下面这个简单的假设检验：假设有两个数据集 S, S'S={小明，小刚，小美}；S'={小红，小刚，小美}这两个数据集是邻近的，因为它们的差异仅体现在一个人上。目的是检验模型是否是基于 S 训练的，这等价于检验小明是否存在于我们的数据中。如果这个假设检验非常困难，那么想要获取小明信息的攻击者就难以得逞。严谨来说，一个随机算法 M 符合 (epsilon,delta)-DP 意味着对于任何的事件 E,从定义不难看出，epsilon 和 delta 越小，隐私性越好。那么，如何实现能保证算法的隐私性呢?具体做法是衡量算法的中间产物（比如梯度）的敏感性，并根据其大小施加一个成正比的噪音。由于噪音的存在，想要窃取小明信息的攻击者便无法确定小明是否在训练集中。在深度神经网络中，每一次迭代都会牺牲一部分隐私来换取性能的提高。我们可以对每个批（batch）的梯度加噪音，从而达到混淆攻击者的目的。当然，噪音加的越大，隐私就越安全，但是随之性能也自然越差。在有限的隐私预算下，很多时候隐私算法的性能表现会不如人意。深度学习经常需要敏感的个人信息来训练。现存的差分隐私定义以及隐私模型都试图在性能和隐私中找到一个平衡。可惜的是，这些尝试仍不能很好的处理两个重要环节：subsampling 和 composition。这导致了隐私算法的性能通常远逊于非隐私算法。高斯差分隐私Gaussian differential privacy (GDP) 高斯差分隐私是最近被提出的一种隐私表示方法。它可以精确的刻画 optimizer 在每个 epoch 所消耗的隐私。GDP高斯差分隐私的表达简洁且是广义的（在 SGD, Adam, Adagrad 等多个优化器上的刻画是完全一样的）。高斯差分隐私GDP 的分析被进一步推广到 Poisson subsampling 和新的优化器上。新的推广得到了理论上严谨的证明，尤其证明了它优于此前最先进的 Moments accountant 方法。在《Gaussian Differential Privacy高斯差分隐私》一文中，宾夕法尼亚大学的研究人员创新性地定义了「f-DP」来刻画隐私。如果用 alpha 来表示第一类错误，beta 来表示第二类错误，对于任何一种拒绝规则 (rejection rule) phi，都存在一个抵换函数 (trade-off function) T：降低第一类错误会导致第二类错误增加，反之亦然。我们将两类错误的和的最小值称为最小错误和。一个随机算法 M 在 S 和 S』上的抵换函数 T 如果始终大于函数 f，那么它就满足 f-DP。对比于传统的 eps,delta-DP，f-DP 使用的是一个函数 f，这也使得其刻画更为自由和准确。作为 f-DP 的一个重要案例，研究人员随后介绍了高斯差分隐私（GDP）来区分两个高斯分布。根据中心极限定理（CLT），任何基于假设检验的隐私定义在极限情况下都会收敛于 GDP。事实上，相对于谷歌在 2016 年提出的，适用于计算 epsilon,delta-DP 的 Moments Accountant (MA) 方法，本文提出的 CLT 方法可以更简易地计算 GDP，而且非常准确。GDP 的好处还不止于此。在最新工作《Deep Learning with Gaussian Differential Privacy》中，卜至祺、董金硕，龙琦和苏炜杰等作者指出 GDP 和 eps,delta-DP 可以通过他们设计的 Dual 函数互相转换。也就是说，研究者可以在 f-DP 的框架下分析算法再转成传统的 dp，或者从传统领域中拿来已有的理论和技巧，不必二次开发。这项技术现在已经在 TensorFlow 中实现。在实验中，将 GDP 和深度学习结合，并在多种类型的任务上取得了不俗的成绩。此前谷歌也曾将 epsDP 和深度学习结合，虽然在 MNIST 图像识别上取得了 97% 的正确率（无隐私算法可达到 99% 以上），在 CIFAR10 上却止步于 73% 的正确率（无隐私算法可达 86%）。而利用 GDP 的精确刻画，作者们在 MNIST 上取得了 98% 的准确率。不仅如此，MA 计算的结果表示 MNIST 的 96.6% 正确率对应的是 9.4% 的最小错误和，意味着攻击者有超过九成的概率猜对一张图片是否在数据集中。而 CLT 的计算表明 epsDP 太过于保守：同样的模型同样的表现，实际对应的最小错误和其实是 77.6%，也就是说隐私并没有损失很多。为了全面探讨 GDP 的优越性，作者在高斯差分隐私GDP 框架下分析了神经网络的表现。实现了 SGD 和 Adam 的隐私版本，并通过让神经网络不断迭代直到高斯差分隐私GDP 达到了 mu=2。在 IMDb（自然语言处理），MovieLens 1M（推荐系统）和 Adult Income（非图像型分类任务）上，GDP 模型都取得了非常接近无隐私模型的性能。例如在 Adult Income 数据上，隐私神经网络和无隐私神经网络表现几乎一样好，意味着隐私也许并不需要以很大的性能牺牲为代价。更进一步的，作者强调文中的神经网络都相对简单（不超过三层），如果使用更复杂更高级的神经网络可以在同样的隐私保证下更显著地提升性能。而另一方面，使用高效的优化算法（减少迭代次数，即隐私的损失次数）也能让性能变得更好。既然 CLT 可以在同样性能的条件下比 MA 更好地保护隐私，那么反过来想，在同样的隐私预算下，GDP 也能显示出更强的性能。作者构思了一个实验来说明这一点：训练一个加了 sigma 噪音的神经网络若干步，通过 MA 可以算出目前损失了多少隐私，通过 CLT 和 Dual 反解出真正必须的噪音 sigma hat。注意 sigma hat 必然小于 sigma，然后训练同一个神经网络但只加 sigma hat 噪音。由于噪音变小，新的神经网络学习效果会更好，而且在每一次迭代，新神经网络都会更好地保护隐私。将神经网络和 GDP 结合，可以更精准地呈现隐私损失，从而更好地保护隐私以及提升隐私算法的性能。另一方面，已有的 (epsilon,delta)-DP 研究也可以嫁接到 GDP 中，为两个领域带来了新的机遇。这一隐私算法领域的新进展给予了研究者们更大的信心去相信，随着机器学习的进一步发展，也许在不远的未来就能以可忽略不计的代价来保护我们的隐私。同时，它也鼓励人们更愿意分享涉及个人信息的数据，来推动机器学习的发展。

netflix moments gdp dp poisson clt gaussian sgd mnist

丽莎老师讲机器人之隐私保护突破:高斯差分隐私框架与DL结合

丽莎老师讲机器人

Play Episode Listen Later Jun 1, 2020 10:39

netflix moments gdp dp poisson clt gaussian sgd mnist

6 Steps to Transition to Data Science from non-CS background

The Data Life Podcast

Play Episode Listen Later Apr 20, 2019 15:58

In this episode we will talk all about the various steps to transition to data science from non computer science backgrounds. One of the main difficulties people face from non-CS backgrounds is how overwhelming it can be to transition to data science field, I talk about my own journey, and share the 6 steps which can help you in your own data science career! 00:00 to 02:10: Introduction 02:11 to 06:00: My Background of moving to data science from electrical engineering 06:01 to 10:56: Steps 1 to 3 covering things like using external APIs, already processed datasets and performing full stack data science work 10:57 to 11:55: Break sponsored by Anchor 11:56: End: Steps 4 to 6 covering things like math and statistics, machine learning pipelines and data structures & algorithms Some useful links: 1) Andrew Ng Deep Learning Specialization Coursera https://www.coursera.org/specializations/deep-learning 2) Intro to Statistics by Sebastien Thrun https://www.udacity.com/course/intro-to-statistics--st101 3) Aurelion Geron's book on machine learning https://www.amazon.com/dp/1491962291/?tag=omnilence-20 4) Pramp for mock algorithm sessions on video https://www.pramp.com/ 5) Leetcode for algorithm question datasets https://leetcode.com/ Some great datasets to get started in machine learning: 6) MNIST for hand written digits https://www.kaggle.com/c/digit-recognizer 7) Iris dataset for flower classification http://archive.ics.uci.edu/ml/datasets/iris 8) IMDB movie reviews https://ai.stanford.edu/~amaas/data/sentiment/ Thanks for listening! --- Send in a voice message: https://anchor.fm/the-data-life-podcast/message Support this podcast: https://anchor.fm/the-data-life-podcast/support

transition statistics imdb cs data science apis mnist

Deep Learning in Optics with Aydogan Ozcan - TWiML Talk #237

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

Play Episode Listen Later Mar 7, 2019 42:07

Today, we’re joined by Aydogan Ozcan, Professor of Electrical and Computer Engineering at UCLA, where his research group focuses on photonics and its applications to nano- and biotechnology. In our conversation, we explore his group's research into the intersection of deep learning and optics, holography and computational imaging. We specifically look at a really interesting project to create all-optical neural networks which work based on diffraction, where the printed pixels of the network are analogous to neurons. We also explore some of the practical applications for their research and other areas of interest for their group. The complete show notes for this episode can be found at twimlai.com/talk/237. Be sure to subscribe to our weekly newsletter at twimlai.com/newsletter!

#007 | Machine Learning болон бататган суралцах нь (reinforment learning)

Humans of MITPU

Play Episode Listen Later Mar 3, 2019 49:48

Манай энэ удаагийн дугаарт Facebook компанийн xиймэл оюун ухааны судалгааны төвд ажилладаг Сайнбаяр оролцон Machine Learning салбарт, тэр дундаа бататган суралцах (Reinforcement Learning) арга барилд гарч буй сонин сайхан болон тус салбарын давуу сул талуудыг тоймлон ярилцлаа. Робот болон дата центрүүдэд нэвтэрсэн энэ хүчээ авч буй салбарын талаар ярилцахдаа Machine Learning салбарт судлаачаар болон хэрэглэгчээр хэрхэн орж суралцах талаар зочин маань үнэтэй зөвлөгөөгөө өгсөн юм. Түүний сонирхуулсан зүйлсийг доор хүргэж байна. 1) https://www.udemy.com/practical-deep-learning-with-pytorch/ 2) Deep Learning Text book: https://www.deeplearningbook.org/ 3) Thesis: https://cs.nyu.edu/media/publications/sukhbaatar_sainbayar.pdf 4) MNIST: http://yann.lecun.com/exdb/mnist/ 5) PyTorch: https://pytorch.org/get-started/locally/

machine learning thesis reinforcement learning pytorch mnist

Distilled Data: 200 Proofs

AI with AI

Play Episode Listen Later Jan 4, 2019 53:12

In shorter news items, Andy and Dave discuss the announcement that the Allen Institute for Artificial Intelligence is partnering with Microsoft Research to connect AI2’s Semantic Scholar academic search engine with Microsoft’s Academic Graph. The University of Pavia in Italy demonstrates an artificial neuron (a perceptron) on an actual quantum processor. Another Tesla on Autopilot has an accident; and Waymo demonstrates that pure imitation learning (with 30 million examples) is not sufficient for teaching a model to drive a car. And Tumblr implements a porn-detecting AI. In research topics, researchers with Facebook AI, MIT, and UC Berkeley demonstrate “dataset distillation,” compressing 60,000 MNIST images into 10 synthetic images. Researchers at University of Maryland demonstrate the ability to hide adversarial attacks from network interpretation; so for networks that visually locate the item identified, that network would locate the “original” item instead of the adversarial item. Adobe and Auburn show that neural networks fail miserably for “out-of-distribution” inputs (or, “strange poses of familiar objects”), and they probe deeper into the parameters that cause the misbehavior. In other news, the AI Narratives Report explores how AI is portrayed and perceived. The AI Index releases its 2018 version. AI researchers have a spirited debate on Twitter about deep learning and symbol manipulation. Quantum Computing: Progress and Prospects provides a deeper look at this nascent technology. And Juergen Schmidhuber gives a TEDx talk on how “true AI” will change everything.

Windows Machine Learning: Hello World (MNIST Edition)

#ifdef WINDOWS - Channel 9

Play Episode Listen Later Nov 12, 2018 7:25

Let's jump to the code! Killian and Rosane work through the Windows ML MNIST tutorial and answer questions developers commonly ask along the way. https://aka.ms/OverviewOfWinML

windows machine learning hello world mnist

238: VLAN-Zezes-ki in Hardware

BSD Now

Play Episode Listen Later Mar 21, 2018 123:38

Looking at Lumina Desktop 2.0, 2 months of KPTI development in SmartOS, OpenBSD email service, an interview with Ryan Zezeski, NomadBSD released, and John Carmack's programming retreat with OpenBSD. This episode was brought to you by Headlines Looking at Lumina Desktop 2.0 (https://www.trueos.org/blog/looking-lumina-desktop-2-0/) A few weeks ago I sat down with Lead Developer Ken Moore of the TrueOS Project to get answers to some of the most frequently asked questions about Lumina Desktop from the open source community. Here is what he said on Lumina Desktop 2.0. Do you have a question for Ken and the rest of the team over at the TrueOS Project? Make sure to read the interview and comment below. We are glad to answer your questions! Ken: Lumina Desktop 2.0 is a significant overhaul compared to Lumina 1.x. Almost every single subsystem of the desktop has been streamlined, resulting in a nearly-total conversion in many important areas. With Lumina Desktop 2.0 we will finally achieve our long-term goal of turning Lumina into a complete, end-to-end management system for the graphical session and removing all the current runtime dependencies from Lumina 1.x (Fluxbox, xscreensaver, compton/xcompmgr). The functionality from those utilities is now provided by Lumina Desktop itself. Going along with the session management changes, we have compressed the entire desktop into a single, multi-threaded binary. This means that if any rogue script or tool starts trying to muck about with the memory used by the desktop (probably even more relevant now than when we started working on this), the entire desktop session will close/crash rather than allowing targeted application crashes to bypass the session security mechanisms. By the same token, this also prevents “man-in-the-middle” type of attacks because the desktop does not use any sort of external messaging system to communicate (looking at you dbus). This also gives a large performance boost to Lumina Desktop The entire system for how a user's settings get saved and loaded has been completely redone, making it a “layered” settings system which allows the default settings (Lumina) to get transparently replaced by system settings (OS/Distributor/SysAdmin) which can get replaced by individual user settings. This results in the actual changes in the user setting files to be kept to a minimum and allows for a smooth transition between updates to the OS or Desktop. This also provides the ability to “restrict” a user's desktop session (based on a system config file) to the default system settings and read-only user sessions for certain business applications. The entire graphical interface has been written in QML in order to fully-utilize hardware-based GPU acceleration with OpenGL while the backend logic and management systems are still written entirely in C++. This results in blazing fast performance on the backend systems (myriad multi-threaded C++ objects) as well as a smooth and responsive graphical interface with all the bells and whistles (drag and drop, compositing, shading, etc). Q: Are there future plans to implement something like Lumina in a MAC Jail? While I have never tried out Lumina in a MAC jail, I do not see anything on that page which should stop it from running in one right now. Lumina is already designed to be run as an unpriviledged user and is very smart about probing the system to find out what is/not available before showing anything to the user. The only thing that comes to mind is that you might need to open up some other system devices so that X11 itself can draw to the display (graphical environment setup is a bit different than CLI environment). Q: I look forward to these changes. I know the last time I used it when I would scroll I would get flashes like the refresh rate was not high enough. It will be nice to have a fast system as well as I know with the more changes Linux is becoming slower. Not once it has loaded but in the loading process. I will do another download when these changes come out and install again and maybe stay this time. If I recall correctly, one of the very first versions of Lumina (pre-1.0) would occasionally flicker. If that is still happening, you might want to verify that you are using the proper video driver for your hardware and/or enable the compositor within the Lumina settings. Q: Why was enlightenment project not considered for TrueOS? It is BSD licensed and is written in C. This was a common question about 4(?) years ago with the first release of the Lumina desktop and it basically boiled down to long-term support and reliability of the underlying toolkit. Some of the things we had to consider were: cross-platform/cross-architecture support, dependency reliability and support framework (Qt5 > EFL), and runtime requirements and dependency tracking (Qt5 is lighter than the EFL). That plus the fact that the EFL specifically states that it is linux-focused and the BSD's are just an afterthought (especially at the time we were doing the evaluation). Q: I have two questions. 1) The default layout of Unity(menu bar with actual menu entries on top and icon dock on the side) is one of the few things I liked about my first voyage into non-Windows systems, and have been missing since moving on to other distros(and now also other non-Linux systems). However in 1.4.0 screenshots on Lumina's site, the OSX-like layout has the menu attached to the window. Will 2.0 be able to have the menus on the bar? 2) Is there any timeline for a public release, or are you taking a “when it's ready” approach? In Lumina you can already put panels on the left/right side of the screen and give you something like the layout of the Unity desktop. The embedded menu system is not available in Lumina because that is not a specification supported by X11 and the window manager standards at the present time. The way that functionality is currently run on Linux is a hacky-bypass of the display system which only really works with the GTK3 and Qt5 toolkits, resulting in very odd overall desktop behavior in mixed environments where some apps use other graphical toolkits. We are targetting the 18.06 STABLE release of TrueOS for Lumina 2, but that is just a guideline and if necessary we will push back the release date to allow for additional testing/fixing as needed. A long two months (https://blog.cooperi.net/a-long-two-months) IllumOS/SmartOS developer Alex Wilson describes the journey of developing KPTI for IllumOS > On Monday (January 1st) I had the day off work for New Year's day, as is usual in most of the western world, so I slept in late. Lou and her friend decided to go to the wax museum and see several tourist attractions around SF, and I decided to pass the day at home reading. That afternoon, work chat started talking about a Tumblr post by pythonsweetness about an Intel hardware security bug. At the time I definitely did not suspect that this was going to occupy most of my working life for the next (almost) two months. Like many people who work on system security, I had read Anders Fogh's post about a "Negative Result" in speculative execution research in July of 2017. At the time I thought it was an interesting writeup and I remember being glad that researchers were looking into this area. I sent the post to Bryan and asked him about his thoughts on it at the time, to which he replied saying that "it would be shocking if they left a way to directly leak out memory in the speculative execution". None of us seriously thought that there would be low-hanging fruit down that research path, but we also felt it was important that there was someone doing work in the area who was committed to public disclosure. At first, after reading the blog post on Monday, we thought (or hoped) that the bug might "just" be a KASLR bypass and wouldn't require a lot of urgency. We tried to reach out to Intel at work to get more information but were met with silence. (We wouldn't hear back from them until after the disclosure was already made public.) The speculation on Tuesday intensified, until finally on Wednesday morning I arrived at the office to find links to late Tuesday night tweets revealing exploits that allowed arbitrary kernel memory reads. Wednesday was not a happy day. Intel finally responded to our emails -- after they had already initiated public disclosure. We all spent a lot of time reading. An arbitrary kernel memory read (an info leak) is not that uncommon as far as bugs go, but for the most part they tend to be fairly easy to fix. The thing that makes the Meltdown and Spectre bugs particularly notable is that in order to mitigate them, a large amount of change is required in very deep low-level parts of the kernel. The kind of deep parts of the kernel where there are 20-year old errata workarounds that were single-line changes that you have to be very careful to not accidentally undo; the kind of parts where, as they say, mortals fear to tread. On Friday we saw the patches Matthew Dillon put together for DragonFlyBSD for the first time. These were the first patches for KPTI that were very straightforward to read and understand, and applied to a BSD-derived kernel that was similar to those I'm accustomed to working on. To mitigate Meltdown (and partially one of the Spectre variants), you have to make sure that speculative execution cannot reach any sensitive data from a user context. This basically means that the pages the kernel uses for anything potentially sensitive have to be unmapped when we are running user code. Traditionally, CPUs that were built to run a multi-user, UNIX-like OS did this by default (SPARC is an example of such a CPU which has completely separate address spaces for the kernel and userland). However, x86 descends from a single-address-space microcontroller that has grown up avoiding backwards-incompatible changes, and has never really introduced a clean notion of multiple address spaces (segmentation is the closest feature really, and it was thrown out for 64-bit AMD64). Instead, operating systems for x86 have generally wound up (at least in the post-AMD64 era) with flat address space models where the kernel text and data is always present in the page table no matter whether you're in user or kernel mode. The kernel mappings simply have the "supervisor" bit set on them so that user code can't directly access them. The mitigation is basically to stop doing this: to stop mapping the kernel text, data and other memory into the page table while we're running in userland. Unfortunately, the x86 design does not make this easy. In order to be able to take interrupts or traps, the CPU has to have a number of structures mapped in the current page table at all times. There is also no ability to tell an x86 CPU that you want it to switch page tables when an interrupt occurs. So, the code that we jump to when we take an interrupt, as well as space for a stack to push context onto have to be available in both page tables. And finally, of course, we need to be able to figure out somehow what the other page table we should switch to is when we enter the kernel. When we looked at the patches for Linux (and also the DragonFlyBSD patches at the time) on Friday and started asking questions, it became pretty evident that the initial work done by both was done under time constraints. Both had left the full kernel text mapped in both page tables, and the Linux trampoline design seemed over-complex. I started talking over some ideas with Robert Mustacchi about ways to fix these and who we should talk to, and reached out to some of my old workmates from the University of Queensland who were involved with OpenBSD. It seemed to me that the OpenBSD developers would care about these issues even more than we did, and would want to work out how to do the mitigation right. I ended up sending an email to Philip Guenther on Friday afternoon, and on Saturday morning I drove an hour or so to meet up with him for coffee to talk page tables and interrupt trampolines. We wound up spending a good 6 hours at the coffee shop, and I came back with several pages of notes and a half-decent idea of the shape of the work to come. One detail we missed that day was the interaction of per-CPU structures with per-process page tables. Much of the interrupt trampoline work is most easily done by using per-CPU structures in memory (and you definitely want a per-CPU stack!). If you combine that with per-process page tables, however, you have a problem: if you leave all the per-CPU areas mapped in all the processes, you will leak information (via Meltdown) about the state of one process to a different one when taking interrupts. In particular, you will leak things like %rip, which ruins all the work being done with PIE and ASLR pretty quickly. So, there are two options: you can either allocate the per-CPU structures per-process (so you end up with $NCPUS * $NPROCS of them); or you can make the page tables per-CPU. OpenBSD, like Linux and the other implementations so far, decided to go down the road of per-CPU per-process pages to solve this issue. For illumos, we took the other route. In illumos, it turned out that we already had per-CPU page tables. Robert and I re-discovered this on the Sunday of that week. We use them for 32-bit processes due to having full P>V PAE support in our kernel (which is, as it turns out, relatively uncommon amongst open-source OS). The logic to deal with creating and managing them and updating them was all already written, and after reading the code we concluded we could basically make a few small changes and re-use all of it. So we did. By the end of that second week, we had a prototype that could get to userland. But, when working on this kind of kernel change we have a rule of thumb we use: after the first 70% of the patch is done and we can boot again, now it's time for the second 70%. In fact it turned out to be more like the second 200% for us -- a tedious long tail of bugs to solve that ended up necessitating some changes in the design as well. At first we borrowed the method that Matt Dillon used for DragonFlyBSD, by putting the temporary "stack" space and state data for the interrupt trampolines into an extra page tacked onto the end of *%gs (in illumos the structure that lives there is the cpu_t). If you read the existing logic in interrupt handlers for dealing with %gs though, you will quickly notice that the corner cases start to build up. There are a bunch of situations where the kernel temporarily alters %gs, and some of the ways to mess it up have security consequences that end up being worse than the bug we're trying to fix. As it turns out, there are no less than 3 different ways that ISRs use to try to get to having the right cpu_t in %gs on illumos, as it turns out, and they are all subtly different. Trying to tell which you should use when requires a bunch of test logic that in turn requires branches and changes to the CPU state, which is difficult to do in a trampoline where you're trying to avoid altering that state as much as possible until you've got the real stack online to push things into. I kept in touch with Philip Guenther and Mike Larkin from the OpenBSD project throughout the weeks that followed. In one of the discussions we had, we talked about the NMI/MCE handlers and the fact that their handling currently on OpenBSD neglected some nasty corner-cases around interrupting an existing trap handler. A big part of the solution to those issues was to use a feature called IST, which allows you to unconditionally change stacks when you take an interrupt. Traditionally, x86 only changes the stack pointer (%rsp on AMD64) while taking an interrupt when there is a privilege level change. If you take an interrupt while already in the kernel, the CPU does not change the stack pointer, and simply pushes the interrupt stack frame onto the stack you're already using. IST makes the change of stack pointer unconditional. If used unwisely, this is a bad idea: if you stay on that stack and turn interrupts back on, you could take another interrupt and clobber the frame you're already in. However, in it I saw a possible way to simplify the KPTI trampoline logic and avoid having to deal with %gs. A few weeks into the project, John Levon joined us at work. He had previously worked on a bunch of Xen-related stuff as well as other parts of the kernel very close to where we were, so he quickly got up to speed with the KPTI work as well. He and I drafted out a "crazy idea" on the whiteboard one afternoon where we would use IST for all interrupts on the system, and put the "stack" they used in the KPTI page on the end of the cpu_t. Then, they could easily use stack-relative addresses to get the page table to change to, then pivot their stack to the real kernel stack memory, and throw away (almost) all the conditional logic. A few days later, we had convinced each other that this was the way to go. Two of the most annoying x86 issues we had to work around were related to the SYSENTER instruction. This instruction is used to make "fast" system calls in 32-bit userland. It has a couple of unfortunate properties: firstly, it doesn't save or restore RFLAGS, so the kernel code has to take care of this (and be very careful not to clobber any of it before saving or after restoring it). Secondly, if you execute SYSENTER with the TF ("trap"/single-step flag) set by a debugger, the resulting debug trap's frame points at kernel code instead of the user code where it actually happened. The first one requires some careful gymnastics on the entry and return trampolines specifically for SYSENTER, while the second is a nasty case that is incidentally made easier by using IST. With IST, we can simply make the debug trap trampoline check for whether we took the trap in another trampoline's code, and reset %cr3 and the destination stack. This works for single-stepping into any of the handlers, not just the one for SYSENTER. To make debugging easier, we decided that traps like the debug/single-step trap (as well as faults like page faults, #GP, etc.) would push their interrupt frame in a different part of the KPTI state page to normal interrupts. We applied this change to all the traps that can interrupt another trampoline (based on the instructions we used). These "paranoid" traps also set a flag in the KPTI struct to mark it busy (and jump to the double-fault handler if it is), to work around some bugs where double-faults are not correctly generated. It's been a long and busy two months, with lots of time spent building, testing, and validating the code. We've run it on as many kinds of machines as we could get our hands on, to try to make sure we catch issues. The time we've spent on this has been validated several times in the process by finding bugs that could have been nasty in production. One great example: our patches on Westmere-EP Xeons were causing busy machines to throw a lot of L0 I-cache parity errors. This seemed very mysterious at first, and it took us a few times seeing it to believe that it was actually our fault. This was actually caused by the accidental activation of a CPU errata for Westmere (B52, "Memory Aliasing of Code Pages May Cause Unpredictable System Behaviour") -- it turned out we had made a typo and put the "cacheable" flag into a variable named flags instead of attrs where it belonged when setting up the page tables. This was causing performance degradation on other machines, but on Westmere it causes cache parity errors as well. This is a great example of the surprising consequences that small mistakes in this kind of code can end up having. In the end, I'm glad that that erratum existed, otherwise it may have been a long time before we caught that bug. As of this week, Mike and Philip have committed the OpenBSD patches for KPTI to their repository, and the patches for illumos are out for review. It's a nice kind of symmetry that the two projects who started on the work together after the public disclosure at the same time are both almost ready to ship at the same time at the other end. I'm feeling hopeful, and looking forward to further future collaborations like this with our cousins, the BSDs. The IllumOS work has since landed, on March 12th (https://github.com/joyent/illumos-joyent/commit/d85fbfe15cf9925f83722b6d62da49d549af615c) *** OpenBSD Email Service (https://github.com/vedetta-com/caesonia) Features Efficient: configured to run on min. 512MB RAM and 20GB SSD, a KVM (cloud) VPS for around $2.50/mo 15GB+ uncompressed Maildir, rivals top free-email providers (grow by upgrading SSD) Email messages are gzip compressed, at least 1/3 more space with level 6 default Server side full text search (headers and body) can be enabled (to use the extra space) Mobile data friendly: IMAPS connections are compressed Subaddress (+tag) support, to filter and monitor email addresses Virtual domains, aliases, and credentials in files, Berkeley DB, or SQLite3 Naive Bayes rspamd filtering with supervised learning: the lowest false positive spam detection rates Carefree automated Spam/ and Trash/ cleaning service (default: older than 30 days) Automated quota management, gently assists when over quota Easy backup MX setup: using the same configuration, install in minutes on a different host Worry-free automated master/master replication with backup MX, prevents accidental loss of email messages Resilient: the backup MX can be used as primary, even when the primary is not down, both perfect replicas Flexible: switching roles is easy, making the process of changing VPS hosts a breeze (no downtime) DMARC (with DKIM and SPF) email-validation system, to detect and prevent email spoofing Daily (spartan) stats, to keep track of things Your sieve scripts and managesieve configuration, let's get started Considerations By design, email message headers need to be public, for exchanges to happen. The body of the message can be encrypted by the user, if desired. Moreover, there is no way to prevent the host from having access to the virtual machine. Therefore, full disk encryption (at rest) may not be necessary. Given our low memory requirements, and the single-purpose concept of email service, Roundcube or other web-based IMAP email clients should be on a different VPS. Antivirus software users (usually) have the service running on their devices. ClamAV can easily be incorporated into this configuration, if affected by the types of malware it protects against, but will require around 1GB additional RAM (or another VPS). Every email message is important, if properly delivered, for Bayes classification. At least 200 ham and 200 spam messages are required to learn what one considers junk. By default (change to use case), a rspamd score above 50% will send the message to Spam/. Moving messages in and out of Spam/ changes this score. After 95%, the message is flagged as "seen" and can be safely ignored. Spamd is effective at greylisting and stopping high volume spam, if it becomes a problem. It will be an option when IPv6 is supported, along with bgp-spamd. System mail is delivered to an alias mapped to a virtual user served by the service. This way, messages are guaranteed to be delivered via encrypted connection. It is not possible for real users to alias, nor mail an external mail address with the default configuration. e.g. puffy@mercury.example.com is wheel, with an alias mapped to (virtual) puffy@example.com, and user (puffy) can be different for each. Interview - Ryan Zezeski - rpz@joyent.com (mailto:rpz@joyent.com) / @rzezeski (https://twitter.com/rzezeski) News Roundup John Carmack's programming retreat to hermit coding with OpenBSD (https://www.facebook.com/permalink.php?story_fbid=2110408722526967&id=100006735798590) After a several year gap, I finally took another week-long programming retreat, where I could work in hermit mode, away from the normal press of work. My wife has been generously offering it to me the last few years, but I'm generally bad at taking vacations from work. As a change of pace from my current Oculus work, I wanted to write some from-scratch-in-C++ neural network implementations, and I wanted to do it with a strictly base OpenBSD system. Someone remarked that is a pretty random pairing, but it worked out ok. Despite not having actually used it, I have always been fond of the idea of OpenBSD — a relatively minimal and opinionated system with a cohesive vision and an emphasis on quality and craftsmanship. Linux is a lot of things, but cohesive isn't one of them. I'm not a Unix geek. I get around ok, but I am most comfortable developing in Visual Studio on Windows. I thought a week of full immersion work in the old school Unix style would be interesting, even if it meant working at a slower pace. It was sort of an adventure in retro computing — this was fvwm and vi. Not vim, actual BSD vi. In the end, I didn't really explore the system all that much, with 95% of my time in just the basic vi / make / gdb operations. I appreciated the good man pages, as I tried to do everything within the self contained system, without resorting to internet searches. Seeing references to 30+ year old things like Tektronix terminals was amusing. I was a little surprised that the C++ support wasn't very good. G++ didn't support C++11, and LLVM C++ didn't play nicely with gdb. Gdb crashed on me a lot as well, I suspect due to C++ issues. I know you can get more recent versions through ports, but I stuck with using the base system. In hindsight, I should have just gone full retro and done everything in ANSI C. I do have plenty of days where, like many older programmers, I think “Maybe C++ isn't as much of a net positive as we assume...”. There is still much that I like, but it isn't a hardship for me to build small projects in plain C. Maybe next time I do this I will try to go full emacs, another major culture that I don't have much exposure to. I have a decent overview understanding of most machine learning algorithms, and I have done some linear classifier and decision tree work, but for some reason I have avoided neural networks. On some level, I suspect that Deep Learning being so trendy tweaked a little bit of contrarian in me, and I still have a little bit of a reflexive bias against “throw everything at the NN and let it sort it out!” In the spirit of my retro theme, I had printed out several of Yann LeCun's old papers and was considering doing everything completely off line, as if I was actually in a mountain cabin somewhere, but I wound up watching a lot of the Stanford CS231N lectures on YouTube, and found them really valuable. Watching lecture videos is something that I very rarely do — it is normally hard for me to feel the time is justified, but on retreat it was great! I don't think I have anything particularly insightful to add about neural networks, but it was a very productive week for me, solidifying “book knowledge” into real experience. I used a common pattern for me: get first results with hacky code, then write a brand new and clean implementation with the lessons learned, so they both exist and can be cross checked. I initially got backprop wrong both times, comparison with numerical differentiation was critical! It is interesting that things still train even when various parts are pretty wrong — as long as the sign is right most of the time, progress is often made. I was pretty happy with my multi-layer neural net code; it wound up in a form that I can just drop it into future efforts. Yes, for anything serious I should use an established library, but there are a lot of times when just having a single .cpp and .h file that you wrote ever line of is convenient. My conv net code just got to the hacky but working phase, I could have used another day or two to make a clean and flexible implementation. One thing I found interesting was that when testing on MNIST with my initial NN before adding any convolutions, I was getting significantly better results than the non-convolutional NN reported for comparison in LeCun ‘98 — right around 2% error on the test set with a single 100 node hidden layer, versus 3% for both wider and deeper nets back then. I attribute this to the modern best practices —ReLU, Softmax, and better initialization. This is one of the most fascinating things about NN work — it is all so simple, and the breakthrough advances are often things that can be expressed with just a few lines of code. It feels like there are some similarities with ray tracing in the graphics world, where you can implement a physically based light transport ray tracer quite quickly, and produce state of the art images if you have the data and enough runtime patience. I got a much better gut-level understanding of overtraining / generalization / regularization by exploring a bunch of training parameters. On the last night before I had to head home, I froze the architecture and just played with hyperparameters. “Training!” Is definitely worse than “Compiling!” for staying focused. Now I get to keep my eyes open for a work opportunity to use the new skills! I am dreading what my email and workspace are going to look like when I get into the office tomorrow. Stack-register Checking (https://undeadly.org/cgi?action=article;sid=20180310000858) Recently, Theo de Raadt (deraadt@) described a new type of mitigation he has been working on together with Stefan Kempf (stefan@): How about we add another new permission! This is not a hardware permission, but a software permission. It is opportunistically enforced by the kernel. The permission is MAP_STACK. If you want to use memory as a stack, you must mmap it with that flag bit. The kernel does so automatically for the stack region of a process's stack. Two other types of stack occur: thread stacks, and alternate signal stacks. Those are handled in clever ways. When a system call happens, we check if the stack-pointer register points to such a page. If it doesn't, the program is killed. We have tightened the ABI. You may no longer point your stack register at non-stack memory. You'll be killed. This checking code is MI, so it works for all platforms. For more detail, see Theo's original message (https://marc.info/?l=openbsd-tech&m=152035796722258&w=2). This is now available in snapshots, and people are finding the first problems in the ports tree already. So far, few issues have been uncovered, but as Theo points out, more testing is necessary: Fairly good results. A total of 4 problems have been found so far. go, SBCL, and two cases in src/regress which failed the new page-alignment requirement. The SBCL and go ones were found at buildtime, since they use themselves to complete build. But more page-alignment violations may be found in ports at runtime. This is something I worry about a bit. So please everyone out there can help: Use snapshots which contain the stack-check diff, update to new packages, and test all possible packages. Really need a lot of testing for this, so please help out. So, everybody, install the latest snapshot and try all your favorite ports. This is the time to report issues you find, so there is a good chance this additional security feature is present in 6.3 (and works with third party software from packages). NomadBSD 1.0 has been released (https://freeshell.de/~mk/projects/nomadbsd.html) NomadBSD is a live system for flash drives, based on FreeBSD® 11.1 (amd64) Change Log The setup process has been improved. Support for optional geli encryption of the home partition has been added Auto-detection of NVIDIA graphics cards and their corresponding driver has been added. (Thanks to holgerw and lme from BSDForen.de) An rc script to start the GEOM disk scheduler on the root device has been added. More software has been added: accessibility/redshift (starts automatically) audio/cantata audio/musicpd audio/ncmpc ftp/filezilla games/bsdtris mail/neomutt math/galculator net-p2p/transmission-qt5 security/fpm2 sysutils/bsdstats x11/metalock x11/xbindkeys Several smaller improvements and bugfixes. Screenshots https://freeshell.de/~mk/projects/nomadbsd-ss1.png https://freeshell.de/~mk/projects/nomadbsd-ss2.png https://freeshell.de/~mk/projects/nomadbsd-ss3.png https://freeshell.de/~mk/projects/nomadbsd-ss4.png https://freeshell.de/~mk/projects/nomadbsd-ss5.png https://freeshell.de/~mk/projects/nomadbsd-ss6.png Beastie Bits KnoxBug - Nagios (http://knoxbug.org/2018-03-27) vBSDcon videos landing (https://www.youtube.com/playlist?list=PLfJr0tWo35bc9FG_reSki2S5S0G8imqB4) AsiaBSDCon 2017 videos (https://www.youtube.com/playlist?list=PLnTFqpZk5ebBTyXedudGm6CwedJGsE2Py) DragonFlyBSD Adds New "Ptr_Restrict" Security Option (https://www.phoronix.com/scan.php?page=news_item&px=DragonFlyBSD-Ptr-Restrict) A Dexter needs your help (https://twitter.com/michaeldexter/status/975603855407788032) Mike Larkin at bhyvecon 2018: OpenBSD vmm(4) update (https://undeadly.org/cgi?action=article;sid=20180309064801) [HEADS UP] - OFED/RDMA stack update (https://lists.freebsd.org/pipermail/freebsd-arch/2018-March/018900.html) *** Feedback/Questions Ron - Interview someone using DragonflyBSD (http://dpaste.com/3BM6GSW#wrap) Brad - Gaming and all (http://dpaste.com/3X4ZZK2#wrap) Mohammad - Sockets vs TCP (http://dpaste.com/0PJMKRD#wrap) Paul - All or at least most of Bryan Cantrill's Talks (http://dpaste.com/2WXVR1X#wrap) ***

university new year interview guide moving training system unity built watching os mobile mac trash windows checking developers intel ram meltdown talks gp sf queensland nvidia hardware tumblr storage how to stack open source stable spam tutorials traditionally linux automated pie spectre oculus abi cpu gpu desktops mx deep learning spf fairly tf efl vps antivirus os x unix cpus cli matt dillon nn digital ocean tcp visual studio ipv6 lumina compiling john carmack bsd bayes sparc dmarc xen alex wilson freebsd yann lecun vlan kvm dkim imap 1gb opengl gdb geom openbsd x11 tektronix anders fogh raadt mnist clamav netbsd qml isrs matthew dillon allan jude bsds aslr ixsystems paul all qt5 trueos tarsnap kpti dragonflybsd smartos vbsdcon

No.70 - 인공지능 딥러닝 기술 소개 및 사례

Software engineering digest

Play Episode Listen Later Jul 10, 2017

안녕하세요. 오늘은 지난 시간에 이어, 몇 년사이에 큰 이슈가 되고 있는 인공지능 딥러닝에 대한 이야기를 간략히 소개해 보겠습니다. 이 내용은 BIM학회에 '인공지능 딥러닝 기술 동향 및 구현 사례'로 소개된 글의 일부 내용입니다. CNN, RNN, LSTM, GAN과 같은 대표적인 딥러닝 신경망 종류 및 3차원 스캔과 관련된 CNN활용 사례 등을 소개합니다.No.70 Podcast 방송 - 인공지능 딥러닝 기술 소개 및 사례소개된 부분에 대한 좀 더 상세한 내용은 다음 레퍼런스에서 확인할 수 있습니다. 1. Daddy maker, 텐서플로우 최신버전 설치 및 개념2. Daddy maker, 텐서플로우 MNIST 딥러닝 및 텐서보드 그래프 가시화3. BIM principle, 2017, 페이스북 딥러닝 CNN 기반 오픈소스 번역 기술 CSSL4. 딥러닝 Hello World - MNIST, CIFAR-10 데이터베이스 구조와 이미지넷5. Konstantin Lackner, 2016, Composing a melody with long-short term memory (LSTM) Recurrent Neural Networks, Institute for DataProcessing Technische Universitat Munchen6. Patrick Hebron, 2016, Unsupervised Learning, patrickhebron.com7. Alec Radford & Luke Metz, Soumith Chintala, 2016, Unsupervised Representation Learning with Deep Convolutional GenerativeAdversarial Network8. Adam Santoro외, 2017, A simple neural network module for relational reasoning, DeepMind9. Microsoft, Going Deep: Convolutional Neural Networks10. The Asimov Institute, 2016, THE NEURAL NETWORK ZOO11. Adit Deshpande, 2016, The 9 Deep Learning Papers You Need To Know About CNN12. Adrian Rosebrock, 2016, My Top 9 Favorite Python Deep Learning Libraries13. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, 2017.5, A novel approach to neural machine translation,FAIR(Facebook AI Research Group)14. Adit Deshpande, A Beginner's Guide To Understanding Convolutional Neural Networks, UCLA15. Rob Verger, 2017.5, Facebook created a faster, more accurate translation system using artificial intelligence, Popular Science16. James Vincent, 2017.5.9, Facebook says its prototype translation technique is nine times faster than rivals, THE VERGE17. Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin, 2017.5, Convolutional Sequence to Sequence Learning,FAIR(Facebook AI Research Group)18. Charles R. Qi, Hao Su, Kaichun Mo, Leonidas J. Guibas, 2016, PointNet: Deep Learning on Point Sets for 3D Classification andSegmentation19. Sepp H., Jurgen S, 1997, Long short-term memory, Neural Computation. 9 (8) ,pp.1735–178020. Goodfellow. Ian J, Pouget-Abadie. J, Mirza. M, Xu. Bing, Warde-Farley. David, Ozair. Sherjil, Courville. Aaron, Bengio. Yoshua, 2014,Generative Adversarial Networks

apple data microsoft institute cnn software daddy engineering ucla iot bing verge object digest kang arial helvetica qi composing gan deepmind bim segmentation dauphin my top xu tae mirza tahoma popular science goodfellow charles r wook james vincent courville unsupervised learning generative adversarial networks rnn articlebody yoshua lstm cifar mnist neural computation adrian rosebrock rob verger

Dynamic Behavior of Coupled Neuron System from a Game-Theoretic Standpoint