POPULARITY
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on The AI Alignment Forum. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on LessWrong. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, and that value...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Simple Toy Coherence Theorem, published by johnswentworth on August 2, 2024 on LessWrong. This post presents a simple toy coherence theorem, and then uses it to address various common confusions about coherence arguments. Setting Deterministic MDP. That means at each time t there's a state S[t][1], the agent/policy takes an action A[t] (which can depend on both time t and current state S[t]), and then the next state S[t+1] is fully determined by S[t] and A[t]. The current state and current action are sufficient to tell us the next state. We will think about values over the state at some final time T. Note that often in MDPs there is an incremental reward each timestep in addition to a final reward at the end; in our setting there is zero incremental reward at each timestep. One key point about this setting: if the value over final state is uniform, i.e. same value for all final states, then the MDP is trivial. In that case, all policies are optimal, it does not matter at all what the final state is or what any state along the way is, everything is equally valuable. Theorem There exist policies which cannot be optimal for any values over final state except for the trivial case of uniform values. Furthermore, such policies are exactly those which display inconsistent revealed preferences transitively between all final states. Proof As a specific example: consider an MDP in which every state is reachable at every timestep, and a policy which always stays in the same state over time. From each state S every other state is reachable, yet the policy chooses S, so in order for the policy to be optimal S must be a highest-value final state. Since each state must be a highest-value state, the policy cannot be optimal for any values over final state except for the trivial case of uniform values. That establishes the existence part of the theorem, and you can probably get the whole idea by thinking about how to generalize that example. The rest of the proof extends the idea of that example to inconsistent revealed preferences in general. Bulk of Proof (click to expand) Assume the policy is optimal for some particular values over final state. We can then start from those values over final state and compute the best value achievable starting from each state at each earlier time. That's just dynamic programming: V[S,t]=max S' reachable in next timestep from S V[S',t+1] where V[S,T] are the values over final states. A policy is optimal for final values V[S,T] if-and-only-if at each timestep t1 it chooses a next state with highest reachable V[S,t]. Now, suppose that at timestep t there are two different states either of which can reach either state A or state B in the next timestep. From one of those states the policy chooses A; from the other the policy chooses B. This is an inconsistent revealed preference between A and B at time t: sometimes the policy has a revealed preference for A over B, sometimes for B over A. In order for a policy with an inconsistent revealed preference between A and B at time t to be optimal, the values must satisfy V[A,t]=V[B,t] Why? Well, a policy is optimal for final values V[S,T] if-and-only if at each timestep t1 it chooses a next state with highest reachable V[S,t]. So, if an optimal policy sometimes chooses A over B at timestep t when both are reachable, then we must have V[A,t]V[B,t]. And if an optimal policy sometimes chooses B over A at timestep t when both are reachable, then we must have V[A,t]V[B,t]. If both of those occur, i.e. the policy has an inconsistent revealed preference between A and B at time t, then V[A,t]=V[B,t]. Now, we can propagate that equality to a revealed preference on final states. We know that the final state which the policy in fact reaches starting from A at time t must have the highest reachable value, and that value...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Linear infra-Bayesian Bandits, published by Vanessa Kosoy on May 10, 2024 on The AI Alignment Forum. Linked is my MSc thesis, where I do regret analysis for an infra-Bayesian[1] generalization of stochastic linear bandits. The main significance that I see in this work is: Expanding our understanding of infra-Bayesian regret bounds, and solidifying our confidence that infra-Bayesianism is a viable approach. Previously, the most interesting IB regret analysis we had was Tian et al which deals (essentially) with episodic infra-MDPs. My work here doesn't supersede Tian et al because it only talks about bandits (i.e. stateless infra-Bayesian laws), but it complements it because it deals with a parameteric hypothesis space (i.e. fits into the general theme in learning-theory that generalization bounds should scale with the dimension of the hypothesis class). Discovering some surprising features of infra-Bayesian learning that have no analogues in classical theory. In particular, it turns out that affine credal sets (i.e. such that are closed w.r.t. arbitrary affine combinations of distributions and not just convex combinations) have better learning-theoretic properties, and the regret bound depends on additional parameters that don't appear in classical theory (the "generalized sine" S and the "generalized condition number" R). Credal sets defined using conditional probabilities (related to Armstrong's "model splinters") turn out to be well-behaved in terms of these parameters. In addition to the open questions in the "summary" section, there is also a natural open question of extending these results to non-crisp infradistributions[2]. (I didn't mention it in the thesis because it requires too much additional context to motivate.) 1. ^ I use the word "imprecise" rather than "infra-Bayesian" in the title, because the proposed algorithms achieves a regret bound which is worst-case over the hypothesis class, so it's not "Bayesian" in any non-trivial sense. 2. ^ In particular, I suspect that there's a flavor of homogeneous ultradistributions for which the parameter S becomes unnecessary. Specifically, an affine ultradistribution can be thought of as the result of "take an affine subspace of the affine space of signed distributions, intersect it with the space of actual (positive) distributions, then take downwards closure into contributions to make it into a homogeneous ultradistribution". But we can also consider the alternative "take an affine subspace of the affine space of signed distributions, take downwards closure into signed contributions and then intersect it with the space of actual (positive) contributions". The order matters! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Coherence of Policies in Toy Environments, published by dx26 on March 18, 2024 on LessWrong. This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillien, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal concept of a "coherent", goal-directed AGI in the future maximizing some utility function unaligned with human values. Whether and how coherence may develop in future AI systems, especially in the era of LLMs, has been a subject of considerable debate. In this post, we provide a preliminary mathematical definition of the coherence of a policy as how likely it is to have been sampled via uniform reward sampling (URS), or uniformly sampling a reward function and then sampling from the set of policies optimal for that reward function, versus uniform policy sampling (UPS). We provide extensions of the model for sub-optimality and for "simple" reward functions via uniform sparsity sampling (USS). We then build a classifier for the coherence of policies in small deterministic MDPs, and find that properties of the MDP and policy, like the number of self-loops that the policy takes, are predictive of coherence when used as features for the classifier. Moreover, coherent policies tend to preserve optionality, navigate toward high-reward areas of the MDP, and have other "agentic" properties. We hope that our metric can be iterated upon to achieve better definitions of coherence and a better understanding of what properties dangerous AIs will have. Introduction Much of the current discussion about AI x-risk centers around "agentic", goal-directed AIs having misaligned goals. For instance, one of the most dangerous possibilities being discussed is of mesa-optimizers developing within superhuman models, leading to scheming behavior and deceptive alignment. A significant proportion of current alignment work focuses on detecting, analyzing (e.g. via analogous case studies of model organisms), and possibly preventing deception. Some researchers in the field believe that intelligence and capabilities are inherently tied with "coherence", and thus any sufficiently capable AI will approximately be a coherent utility function maximizer. In their paper "Risks From Learned Optimization" formally introducing mesa-optimization and deceptive alignment, Evan Hubinger et al. discuss the plausibility of mesa-optimization occurring in RL-trained models. They analyze the possibility of a base optimizer, such as a hill-climbing local optimization algorithm like stochastic gradient descent, producing a mesa-optimizer model that internally does search (e.g. Monte Carlo tree search) in pursuit of a mesa-objective (in the real world, or in the "world-model" of the agent), which may or may not be aligned with human interests. This is in contrast to a model containing many complex heuristics that is not well-defined internally as a consequentialist mesa-optimizer; one extreme example is a tabular model/lookup table that matches observations to actions, which clearly does not do any internal search or have any consequentialist cognition. They speculate that mesa-optimizers may be selected for because they generalize better than other models, and/or may be more compressible information-theoretic wise, and may thus be selected for because of inductive biases in the training process. Other researchers believe that scheming and other mesa-optimizing behavior is implausible with the most common current ML architectures, and that the inductive bias argument and other arguments for getting misaligned mesa-optimizers by default (like the counting argument, which suggests that there are many more ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring Coherence of Policies in Toy Environments, published by dx26 on March 18, 2024 on LessWrong. This post was produced as part of the Astra Fellowship under the Winter 2024 Cohort, mentored by Richard Ngo. Thanks to Martín Soto, Jeremy Gillien, Daniel Kokotajlo, and Lukas Berglund for feedback. Summary Discussions around the likelihood and threat models of AI existential risk (x-risk) often hinge on some informal concept of a "coherent", goal-directed AGI in the future maximizing some utility function unaligned with human values. Whether and how coherence may develop in future AI systems, especially in the era of LLMs, has been a subject of considerable debate. In this post, we provide a preliminary mathematical definition of the coherence of a policy as how likely it is to have been sampled via uniform reward sampling (URS), or uniformly sampling a reward function and then sampling from the set of policies optimal for that reward function, versus uniform policy sampling (UPS). We provide extensions of the model for sub-optimality and for "simple" reward functions via uniform sparsity sampling (USS). We then build a classifier for the coherence of policies in small deterministic MDPs, and find that properties of the MDP and policy, like the number of self-loops that the policy takes, are predictive of coherence when used as features for the classifier. Moreover, coherent policies tend to preserve optionality, navigate toward high-reward areas of the MDP, and have other "agentic" properties. We hope that our metric can be iterated upon to achieve better definitions of coherence and a better understanding of what properties dangerous AIs will have. Introduction Much of the current discussion about AI x-risk centers around "agentic", goal-directed AIs having misaligned goals. For instance, one of the most dangerous possibilities being discussed is of mesa-optimizers developing within superhuman models, leading to scheming behavior and deceptive alignment. A significant proportion of current alignment work focuses on detecting, analyzing (e.g. via analogous case studies of model organisms), and possibly preventing deception. Some researchers in the field believe that intelligence and capabilities are inherently tied with "coherence", and thus any sufficiently capable AI will approximately be a coherent utility function maximizer. In their paper "Risks From Learned Optimization" formally introducing mesa-optimization and deceptive alignment, Evan Hubinger et al. discuss the plausibility of mesa-optimization occurring in RL-trained models. They analyze the possibility of a base optimizer, such as a hill-climbing local optimization algorithm like stochastic gradient descent, producing a mesa-optimizer model that internally does search (e.g. Monte Carlo tree search) in pursuit of a mesa-objective (in the real world, or in the "world-model" of the agent), which may or may not be aligned with human interests. This is in contrast to a model containing many complex heuristics that is not well-defined internally as a consequentialist mesa-optimizer; one extreme example is a tabular model/lookup table that matches observations to actions, which clearly does not do any internal search or have any consequentialist cognition. They speculate that mesa-optimizers may be selected for because they generalize better than other models, and/or may be more compressible information-theoretic wise, and may thus be selected for because of inductive biases in the training process. Other researchers believe that scheming and other mesa-optimizing behavior is implausible with the most common current ML architectures, and that the inductive bias argument and other arguments for getting misaligned mesa-optimizers by default (like the counting argument, which suggests that there are many more ...
We name our MVPs and MDPs (most disappointing players) for offense, defense, and special teams. We also discuss the new ST hire as well as the DC search. Enjoy!Advertising Inquiries: https://redcircle.com/brandsPrivacy & Opt-Out: https://redcircle.com/privacy
We are joined by Koen Holtman, an independent AI researcher focusing on AI safety. Koen is the Founder of Holtman Systems Research, a research company based in the Netherlands. Koen started the conversation with his take on an AI apocalypse in the coming years. He discussed the obedience problem with AI models and the safe form of obedience. Koen explained the concept of Markov Decision Process (MDP) and how it is used to build machine learning models. Koen spoke about the problem of AGIs not being able to allow changing their utility function after the model is deployed. He shared another alternative approach to solving the problem. He shared how to engineer AGI systems now and in the future safely. He also spoke about how to implement safety layers on AI models. Koen discussed the ultimate goal of a safe AI system and how to check that an AI system is indeed safe. He discussed the intersection between large language Models (LLMs) and MDPs. He shared the key ingredients to scale the current AI implementations.
In the latest episode of Product Team Success, host Ross Webb talks with Apple exec Jonathan Boice about AI and product team success. Here are three key takeaways from this insightful episode: 1. AI can amplify human behavior and experiences, but it can also disregard empathy and consideration for human impact if not used properly. 2. Businesses should aim for MDPs (minimal, delightful products) rather than MVPs (minimal viable products) to create a core product that is already good before adding extra features. 3. Balancing empathy and data-driven decision making is crucial for success in product management, and AI technology can help achieve this balance. Listen now for more insights on AI and product team success from this thought-provoking episode Jonathan Boice is the founder of Digital Visionary, a consulting agency that assists businesses in navigating the world of artificial intelligence (AI) without sacrificing their human touch. Boice recognizes the growing apprehension and excitement surrounding AI and aims to help companies use it to create impactful products and improve people's lives. He prides himself on finding the middle ground between the autonomous business model and the traditional approach, ensuring that AI is used to supplement and not replace human workers. With a focus on the product space, Boice is dedicated to helping companies incorporate AI in a mindful and responsible way.
Patreon: https://www.patreon.com/mlst Discord: https://discord.gg/ESrGqhf5CB Twitter: https://twitter.com/MLStreetTalk In this exclusive interview, Dr. Tim Scarfe sits down with Minqi Jiang, a leading PhD student at University College London and Meta AI, as they delve into the fascinating world of deep reinforcement learning (RL) and its impact on technology, startups, and research. Discover how Minqi made the crucial decision to pursue a PhD in this exciting field, and learn from his valuable startup experiences and lessons. Minqi shares his insights into balancing serendipity and planning in life and research, and explains the role of objectives and Goodhart's Law in decision-making. Get ready to explore the depths of robustness in RL, two-player zero-sum games, and the differences between RL and supervised learning. As they discuss the role of environment in intelligence, emergence, and abstraction, prepare to be blown away by the possibilities of open-endedness and the intelligence explosion. Learn how language models generate their own training data, the limitations of RL, and the future of software 2.0 with interpretability concerns. From robotics and open-ended learning applications to learning potential metrics and MDPs, this interview is a goldmine of information for anyone interested in AI, RL, and the cutting edge of technology. Don't miss out on this incredible opportunity to learn from a rising star in the AI world! TOC Tech & Startup Background [00:00:00] Pursuing PhD in Deep RL [00:03:59] Startup Lessons [00:11:33] Serendipity vs Planning [00:12:30] Objectives & Decision Making [00:19:19] Minimax Regret & Uncertainty [00:22:57] Robustness in RL & Zero-Sum Games [00:26:14] RL vs Supervised Learning [00:34:04] Exploration & Intelligence [00:41:27] Environment, Emergence, Abstraction [00:46:31] Open-endedness & Intelligence Explosion [00:54:28] Language Models & Training Data [01:04:59] RLHF & Language Models [01:16:37] Creativity in Language Models [01:27:25] Limitations of RL [01:40:58] Software 2.0 & Interpretability [01:45:11] Language Models & Code Reliability [01:48:23] Robust Prioritized Level Replay [01:51:42] Open-ended Learning [01:55:57] Auto-curriculum & Deep RL [02:08:48] Robotics & Open-ended Learning [02:31:05] Learning Potential & MDPs [02:36:20] Universal Function Space [02:42:02] Goal-Directed Learning & Auto-Curricula [02:42:48] Advice & Closing Thoughts [02:44:47] References: - Why Greatness Cannot Be Planned: The Myth of the Objective by Kenneth O. Stanley and Joel Lehman https://www.springer.com/gp/book/9783319155234 - Rethinking Exploration: General Intelligence Requires Rethinking Exploration https://arxiv.org/abs/2106.06860 - The Case for Strong Emergence (Sabine Hossenfelder) https://arxiv.org/abs/2102.07740 - The Game of Life (Conway) https://www.conwaylife.com/ - Toolformer: Teaching Language Models to Generate APIs (Meta AI) https://arxiv.org/abs/2302.04761 - OpenAI's POET: Paired Open-Ended Trailblazer https://arxiv.org/abs/1901.01753 - Schmidhuber's Artificial Curiosity https://people.idsia.ch/~juergen/interest.html - Gödel Machines https://people.idsia.ch/~juergen/goedelmachine.html - PowerPlay https://arxiv.org/abs/1112.5309 - Robust Prioritized Level Replay: https://openreview.net/forum?id=NfZ6g2OmXEk - Unsupervised Environment Design: https://arxiv.org/abs/2012.02096 - Excel: Evolving Curriculum Learning for Deep Reinforcement Learning https://arxiv.org/abs/1901.05431 - Go-Explore: A New Approach for Hard-Exploration Problems https://arxiv.org/abs/1901.10995 - Learning with AMIGo: Adversarially Motivated Intrinsic Goals https://www.researchgate.net/publication/342377312_Learning_with_AMIGo_Adversarially_Motivated_Intrinsic_Goals PRML https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf Sutton and Barto https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: POWERplay: An open-source toolchain to study AI power-seeking, published by Edouard Harris on October 24, 2022 on The AI Alignment Forum. We're open-sourcing POWERplay, a research toolchain you can use to study power-seeking behavior in reinforcement learning agents. POWERplay was developed by Gladstone AI for internal research. POWERplay's main use is to estimate the instrumental value that a reinforcement learning agent can get from a state in an MDP. Its implementation is based on a definition of instrumental value (or "POWER") first proposed by Alex Turner et al. We've extended this definition to cover certain tractable multi-agent RL settings, and built an implementation behind a simple Python API. We've used POWERplay previously to obtain some suggestive early results in single-agent and multi-agent power-seeking. But we think there may be more low-hanging fruit to be found in this area. Beyond our own ideas about what to do next, we've also received some interesting conceptual questions in connection with this work. A major reason we're open-sourcing POWERplay is to lower the cost of converting these conceptual questions into real experiments with concrete outcomes, that can support or falsify our intuitions about instrumental convergence. Ramp-up We've designed POWERplay to make it as easy as possible for you to get started with it. Follow the installation and quickstart instructions to get moving quickly. Use the replication API to trivially reproduce any figure from any post in our instrumental convergence sequence. Design single-agent and multi-agent MDPs and policies, launch experiments on your local machine, and visualize results with clear figures and animations. POWERplay comes with "batteries included", meaning all the code samples in the documentation should just work out-of-the-box if it's been installed successfully. It also comes with pre-run examples of experimental results, so you can understand what "normal" output is supposed to look like. While this does make the repo weigh in at about 500 MB, it's worth the benefits of letting you immediately start playing around with visualizations on preexisting data. If we've done our job right, a smart and curious grad student (with a bit of Python experience) should be able to start reproducing our previous experiments within an hour, and to have some new — and hopefully interesting! — results within a week. We're looking forward to seeing what people do with this. If you have any questions or comments about POWERplay, feel free to reach out to Edouard at edouard@gladstone.ai. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
17 - MDPs & Value/Policy Iteration
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open Problems in Negative Side Effect Minimization, published by Fabian Schimpf on May 6, 2022 on The AI Alignment Forum. Acknowledgments We want to thank Stuart Armstrong, Remmelt Ellen, David Lindner, Michal Pokorny, Achyuta Rajaram, Adam Shimi, and Alex Turner for helpful discussions and valuable feedback on earlier drafts of this post. Fabian Schimpf and Lukas Fluri are part of this year's edition of the AI Safety Camp. Our gratitude goes to the camp organizers: Remmelt Ellen, Sai Joseph, Adam Shimi, and Kristi Uustalu. TLDR; Negative side effects are one class of threats that misaligned AGIs pose to humanity. Many different approaches have been proposed to mitigate or prevent AI systems from having negative side effects. In this post, we present three requirements that a side-effect minimization method (SEM) should fulfill to be applied in the real world and argue that current methods do not yet satisfy these requirements. We also propose future work that could help to solve these requirements. Introduction Avoiding negative side-effects of agents acting in environments has been a core problem in AI safety since the field started to be formalized. Therefore, as part of our AI safety camp project, we took a closer look at state-of-the-art approaches like AUP and Relative Reachability. After months of discussions, we realized that we were confused about how these (and similar methods) could be used to solve problems we care about outside the scope of the typical grid-world environments. We formalized these discussions into distinct desiderata that we believe are currently not sufficiently addressed and, in part, maybe even overlooked. This post attempts to summarize these points and provide structured arguments to support our critique. Of course, we expect to be partially wrong about this, as we updated our beliefs even while writing up this post. We welcome any feedback or additional input to this post. The sections after the summary table and anticipated questions contain our reasoning for the selected open problems and do not need to be read in order. Background The following paragraphs make heavy use of the following terms and side-effect minimization methods (SEMs). For a more detailed explanation we refer to the provided links MDP: A Markov Decision Process is a 5-tuple ⟨S,A,T,R,γ⟩ In the setting of side-effect minimization, the goal generally is to maximize the cumulative reward without causing (negative) side-effects. RR: In its simplest form Stepwise Relative Reachability is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the compositionr(st,at,st+1)=R(st,at,st+1)−λdRR(st+1,s′t+1) where dRR(st+1,s′t+1)=1|S|∑s∈Smax(R(s′st+1;s)−R(st+1;s),0) is a deviation measure punishing the agent if the average “reachability” of all states of the MDP has been decreased by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that side-effects reduce the reachability of certain states (i.e. breaking a vase makes all states that require an intact vase unreachable) and punishing such a decrease in reachability hence also punishes the agent for side-effects. AUP: Attainable Utility Preservation (see also here and here) is an SEM, acting in MDPs, which tries to avoid side-effects by replacing the old reward function R with the composition r(st,at,st+1)=R(st,at,st+1)−λdAUP(st,at,st+1) where dAUP(st,at,st+1)=1N∑Ri=1|QRi(st,at,st+1−QRi(st,anop,s′t+1)| is a normalized deviation measure punishing the agent if its ability to maximize any of its provided auxiliary reward functions Ri∈R changes by taking action at compared to taking a baseline action anop (like doing nothing). The idea is that the true (side-effect free) reward function (which is very hard to specify) is correlated with many ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Seeking Power is Often Convergently Instrumental in MDPs , published by TurnTrout, elriggs on the LessWrong. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is a linkpost for In 2008, Steve Omohundro's foundational paper The Basic AI Drives conjectured that superintelligent goal-directed AIs might be incentivized to gain significant amounts of power in order to better achieve their goals. Omohundro's conjecture bears out in toy models, and the supporting philosophical arguments are intuitive. In 2019, the conjecture was even debated by well-known AI researchers. Power-seeking behavior has been heuristically understood as an anticipated risk, but not as a formal phenomenon with a well-understood cause. The goal of this post (and the accompanying paper, Optimal Policies Tend to Seek Power) is to change that. Motivation It's 2008, the ancient wild west of AI alignment. A few people have started thinking about questions like “if we gave an AI a utility function over world states, and it actually maximized that utility... what would it do?" In particular, you might notice that wildly different utility functions seem to encourage similar strategies. Resist shutdown? Gain computational resources? Prevent modification of utility function? Paperclip utility ✔️ ✔️ ✔️ Blue webcam pixel utility ✔️ ✔️ ✔️ People-look-happy utility ✔️ ✔️ ✔️ These strategies are unrelated to terminal preferences: the above utility functions do not award utility to e.g. resource gain in and of itself. Instead, these strategies are instrumental: they help the agent optimize its terminal utility. In particular, a wide range of utility functions incentivize these instrumental strategies. These strategies seem to be convergently instrumental. But why? I'm going to informally explain a formal theory which makes significant progress in answering this question. I don't want this post to be Optimal Policies Tend to Seek Power with cuter illustrations, so please refer to the paper for the math. You can read the two concurrently. We can formalize questions like “do ‘most' utility maximizers resist shutdown?” as “Given some prior beliefs about the agent's utility function, knowledge of the environment, and the fact that the agent acts optimally, with what probability do we expect it to be optimal to avoid shutdown?” The table's convergently instrumental strategies are about maintaining, gaining, and exercising power over the future, in some sense. Therefore, this post will help answer: What does it mean for an agent to “seek power”? In what situations should we expect seeking power to be more probable under optimality, than not seeking power? This post won't tell you when you should seek power for your own goals; this post illustrates a regularity in optimal action across different goals one might pursue. Formalizing Convergent Instrumental Goals suggests that the vast majority of utility functions incentivize the agent to exert a lot of control over the future, assuming that these utility functions depend on “resources.” This is a big assumption: what are “resources”, and why must the AI's utility function depend on them? We drop this assumption, assuming only unstructured reward functions over a finite Markov decision process (MDP), and show from first principles how power-seeking can often be optimal. Formalizing the Environment My theorems apply to finite MDPs; for the unfamiliar, I'll illustrate with Pac-Man. Full observability: You can see everything that's going on; this information is packaged in the state s. In Pac-Man, the state is the game screen. Markov transition function: the next state depends only on the choice of action a and the current state s. It doesn't matter how we got into a situation. Discounted reward: future rewards get geometrically discoun...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Seeking Power is Often Convergently Instrumental in MDPs, published by Paul Christiano on the AI Alignment Forum. (Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.) I really don't want my AI to strategically deceive me and resist my attempts to correct its behavior. Let's call an AI that does so egregiously misaligned (for the purpose of this post). Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up? But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I'm interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can't tell any plausible story about how they lead to egregious misalignment. This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it's possible, there are several ways in which it could actually be easier: We can potentially iterate much faster, since it's often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice. We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases. We can find algorithms that have a good chance of working in the future even if we don't know what AI will look like or how quickly it will advance, since we've been thinking about a very wide range of possible failure cases. I'd guess there's a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can't come up with a plausible story about how it leads to egregious misalignment. That's a high enough probability that I'm very excited to gamble on it. Moreover, if it fails I think we're likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable. What this looks like (3 examples) My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.” Example 1: human feedback In an unaligned benchmark I describe a simple AI training algorithm: Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions. We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations. At test time the AI searches for plans that lead to trajectories that look good to humans. In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment: Our generative model understands reality better than human evaluators. There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans. It's possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos. The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction. I don't know if or when this kind of reward hacking would happen — I think it's pretty likely eventually, but it's far from certain and it might take a long time. But from my perspective this failure mode is at least plausible — I don't see any contradictions between ...
Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS Collaborating with Humans without Human Data (DJ Strouse et al) (summarized by Rohin): We've previously seen that if you want to collaborate with humans in the video game Overcooked, it helps to train a deep RL agent against a human model (AN #70), so that the agent “expects” to be playing against humans (rather than e.g. copies of itself, as in self-play). We might call this a “human-aware” model. However, since a human-aware model must be trained against a model that imitates human gameplay, we need to collect human gameplay data for training. Could we instead train an agent that is robust enough to play with lots of different agents, including humans as a special case? This paper shows that this can be done with Fictitious Co-Play (FCP), in which we train our final agent against a population of self-play agents and their past checkpoints taken throughout training. Such agents get significantly higher rewards when collaborating with humans in Overcooked (relative to the human-aware approach in the previously linked paper). In their ablations, the authors find that it is particularly important to include past checkpoints in the population against which you train. They also test whether it helps to have the self-play agents have a variety or architectures, and find that it mostly does not make a difference (as long as you are using past checkpoints as well). Read more: Related paper: Maximum Entropy Population Based Training for Zero-Shot Human-AI Coordination Rohin's opinion: You could imagine two different philosophies on how to build AI systems -- the first option is to train them on the actual task of interest (for Overcooked, training agents to play against humans or human models), while the second option is to train a more robust agent on some more general task, that hopefully includes the actual task within it (the approach in this paper). Besides Overcooked, another example would be supervised learning on some natural language task (the first philosophy), as compared to pretraining on the Internet GPT-style and then prompting the model to solve your task of interest (the second philosophy). In some sense the quest for a single unified AGI system is itself a bet on the second philosophy -- first you build your AGI that can do all tasks, and then you point it at the specific task you want to do now. Historically, I think AI has focused primarily on the first philosophy, but recent years have shown the power of the second philosophy. However, I don't think the question is settled yet: one issue with the second philosophy is that it is often difficult to fully “aim” your system at the true task of interest, and as a result it doesn't perform as well as it “could have”. In Overcooked, the FCP agents will not learn specific quirks of human gameplay that could be exploited to improve efficiency (which the human-aware agent could do, at least in theory). In natural language, even if you prompt GPT-3 appropriately, there's still some chance it ends up rambling about something else entirely, or neglects to mention some information that it “knows” but that a human on the Internet would not have said. (See also this post (AN #141).) I should note that you can also have a hybrid approach, where you start by training a large model with the second philosophy, and then you finetune it on your task of interest as in the first philosophy, gaining the benefits of both. I'm generally interested in which approach will build more useful agents, as this seems quite relevant to forecasting the future of AI (which in turn affects lots of things including AI alignment plans). TECHNICAL AI ALIGNMENT LEARNING HUMAN INTENT Inverse Decision Modeling: Learning Interpretable Representations of Behavior (Daniel Jarrett, Alihan Hüyük et al) (summarized by Rohin): There's lots of work on learning preferences from demonstrations, which varies in how much structure they assume on the demonstrator: for example, we might consider them to be Boltzmann rational (AN #12) or risk sensitive, or we could try to learn their biases (AN #59). This paper proposes a framework to encompass all of these choices: the core idea is to model the demonstrator as choosing actions according to a planner; some parameters of this planner are fixed in advance to provide an assumption on the structure of the planner, while others are learned from data. This also allows them to separate beliefs, decision-making, and rewards, so that different structures can be imposed on each of them individually. The paper provides a mathematical treatment of both the forward problem (how to compute actions in the planner given the reward, think of algorithms like value iteration) and the backward problem (how to compute the reward given demonstrations, the typical inverse reinforcement learning setting). They demonstrate the framework on a medical dataset, where they introduce a planner with parameters for flexibility of decision-making, optimism of beliefs, and adaptivity of beliefs. In this case they specify the desired reward function and then run backward inference to conclude that, with respect to this reward function, clinicians appear to be significantly less optimistic when diagnosing dementia in female and elderly patients. Rohin's opinion: One thing to note about this paper is that it is an incredible work of scholarship; it fluently cites research across a variety of disciplines including AI safety, and provides a useful organizing framework for many such papers. If you need to do a literature review on inverse reinforcement learning, this paper is a good place to start. Human irrationality: both bad and good for reward inference (Lawrence Chan et al) (summarized by Rohin): Last summary, we saw a framework for inverse reinforcement learning with suboptimal demonstrators. This paper instead investigates the qualitative effects of performing inverse reinforcement learning with a suboptimal demonstrator. The authors modify different parts of the Bellman equation in order to create a suite of possible suboptimal demonstrators to study. They run experiments with exact inference on random MDPs and FrozenLake, and with approximate inference on a simple autonomous driving environment, and conclude: 1. Irrationalities can be helpful for reward inference, that is, if you infer a reward from demonstrations by an irrational demonstrator (where you know the irrationality), you often learn more about the reward than if you inferred a reward from optimal demonstrations (where you know they are optimal). Conceptually, this happens because optimal demonstrations only tell you about what the best behavior is, whereas most kinds of irrationality can also tell you about preferences between suboptimal behaviors. 2. If you fail to model irrationality, your performance can be very bad, that is, if you infer a reward from demonstrations by an irrational demonstrator, but you assume that the demonstrator was Boltzmann rational, you can perform quite badly. Rohin's opinion: One way this paper differs from my intuitions is that it finds that assuming Boltzmann rationality performs very poorly if the demonstrator is in fact systematically suboptimal. I would have instead guessed that Boltzmann rationality would do okay -- not as well as in the case where there is no misspecification, but only a little worse than that. (That's what I found in my paper (AN #59), and it makes intuitive sense to me.) Some hypotheses for what's going on, which the lead author agrees are at least part of the story: 1. When assuming Boltzmann rationality, you infer a distribution over reward functions that is “close” to the correct one in terms of incentivizing the right behavior, but differs in rewards assigned to suboptimal behavior. In this case, you might get a very bad log loss (the metric used in this paper), but still have a reasonable policy that is decent at acquiring true reward (the metric used in my paper). 2. The environments we're using may differ in some important way (for example, in the environment in my paper, it is primarily important to identify the goal, which might be much easier to do than inferring the right behavior or reward in the autonomous driving environment used in this paper). FORECASTING Forecasting progress in language models (Matthew Barnett) (summarized by Sudhanshu): This post aims to forecast when a "human-level language model" may be created. To build up to this, the author swiftly covers basic concepts from information theory and natural language processing such as entropy, N-gram models, modern LMs, and perplexity. Data for perplexity achieved from recent state-of-the-art models is collected and used to estimate - by linear regression - when we can expect to see future models score below certain entropy levels, approaching the hypothesised entropy for the English Language. These predictions range across the next 15 years, depending which dataset, method, and entropy level is being solved for; there's an attached python notebook with these details for curious readers to further investigate. Preemptly disjunctive, the author concludes "either current trends will break down soon, or human-level language models will likely arrive in the next decade or two." Sudhanshu's opinion: This quick read provides a natural, accessible analysis stemming from recent results, while staying self-aware (and informing readers) of potential improvements. The comments section too includes some interesting debates, e.g. about the Goodhart-ability of the Perplexity metric. I personally felt these estimates were broadly in line with my own intuitions. I would go so far as to say that with the confluence of improved generation capabilities across text, speech/audio, video, as well as multimodal consistency and integration, virtually any kind of content we see ~10 years from now will be algorithmically generated and indistinguishable from the work of human professionals. Rohin's opinion: I would generally adopt forecasts produced by this sort of method as my own, perhaps making them a bit longer as I expect the quickly growing compute trend to slow down. Note however that this is a forecast for human-level language models, not transformative AI; I would expect these to be quite different and would predict that transformative AI comes significantly later. MISCELLANEOUS (ALIGNMENT) Rohin Shah on the State of AGI Safety Research in 2021 (Lucas Perry and Rohin Shah) (summarized by Rohin): As in previous years (AN #54), on this FLI podcast I talk about the state of the field. Relative to previous years, this podcast is a bit more introductory, and focuses a bit more on what I find interesting rather than what the field as a whole would consider interesting. Read more: Transcript NEAR-TERM CONCERNS RECOMMENDER SYSTEMS User Tampering in Reinforcement Learning Recommender Systems (Charles Evans et al) (summarized by Zach): Large-scale recommender systems have emerged as a way to filter through large pools of content to identify and recommend content to users. However, these advances have led to social and ethical concerns over the use of recommender systems in applications. This paper focuses on the potential for social manipulability and polarization from the use of RL-based recommender systems. In particular, they present evidence that such recommender systems have an instrumental goal to engage in user tampering by polarizing users early on in an attempt to make later predictions easier. To formalize the problem the authors introduce a causal model. Essentially, they note that predicting user preferences requires an exogenous variable, a non-observable variable, that models click-through rates. They then introduce a notion of instrumental goal that models the general behavior of RL-based algorithms over a set of potential tasks. The authors argue that such algorithms will have an instrumental goal to influence the exogenous/preference variables whenever user opinions are malleable. This ultimately introduces a risk for preference manipulation. The author's hypothesis is tested using a simple media recommendation problem. They model the exogenous variable as either leftist, centrist, or right-wing. User preferences are malleable in the sense that a user shown content from an opposing side will polarize their initial preferences. In experiments, the authors show that a standard Q-learning algorithm will learn to tamper with user preferences which increases polarization in both leftist and right-wing populations. Moreover, even though the agent makes use of tampering it fails to outperform a crude baseline policy that avoids tampering. Zach's opinion: This article is interesting because it formalizes and experimentally demonstrates an intuitive concern many have regarding recommender systems. I also found the formalization of instrumental goals to be of independent interest. The most surprising result was that the agents who exploit tampering are not particularly more effective than policies that avoid tampering. This suggests that the instrumental incentive is not really pointing at what is actually optimal which I found to be an illuminating distinction. NEWS OpenAI hiring Software Engineer, Alignment (summarized by Rohin): Exactly what it sounds like: OpenAI is hiring a software engineer to work with the Alignment team. BERI hiring ML Software Engineer (Sawyer Bernath) (summarized by Rohin): BERI is hiring a remote ML Engineer as part of their collaboration with the Autonomous Learning Lab at UMass Amherst. The goal is to create a software library that enables easy deployment of the ALL's Seldonian algorithm framework for safe and aligned AI. AI Safety Needs Great Engineers (Andy Jones) (summarized by Rohin): If the previous two roles weren't enough to convince you, this post explicitly argues that a lot of AI safety work is bottlenecked on good engineers, and encourages people to apply to such roles. AI Safety Camp Virtual 2022 (summarized by Rohin): Applications are open for this remote research program, where people from various disciplines come together to research an open problem under the mentorship of an established AI-alignment researcher. Deadline to apply is December 1st. Political Economy of Reinforcement Learning schedule (summarized by Rohin): The date for the PERLS workshop (AN #159) at NeurIPS has been set for December 14, and the schedule and speaker list are now available on the website. FEEDBACK I'm always happy to hear feedback; you can send it to me, Rohin Shah by replying to this email. PODCAST An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles (http://robertskmiles.com). Subscribe here:
Vikas Srivastava, Chief Revenue Officer, IntegralIntegral recently surveyed 94 heads of FX trading and senior FX managers, reviewing the global effects of the Covid-19 pandemic on the FX business. Among the survey's findings, 28% of respondents said they will operate their FX technology completely in the cloud within the next 5 years, compared to just 2% at present. In distribution channels, 46% confirmed that multi-dealer platforms (MDPs) will see the biggest rise in trading in the next 12 months. Robin Amlôt of IBS Intelligence speaks to Vikas Srivastava, Chief Revenue Officer, Integral, about the tech trends in FX trading.
If you’re thinking about buying a Kia Seltos - here’s everything you need to know. You can help support the podcast, securely via PayPal: As usual, Kia conscripted its on-call dynamics wizard to do his mad, Jedi voodoo and turn the conventional vomit-spec South Korean suspension into what is actually an outstanding platform to drive on our preposterously crap ‘Strayan roads. The drive program on the launch was on mainly these B and C roads around Noosa, and I’d have to say the body control and steering feedback is excellent. So, big tick there. There was probably 90 minutes of freeway driving as well - it’s quiet and composes at 110. Interestingly enough - this vehicle has a next-generation motor driven power steering assistance system. That means an electrical servo motor provides the steering assistance. It detects input from you, and a computer tells it how much to help. That’s when you’re turning in. But when you’re on the way out of a bend, MDPS typically defaults to ‘off’ and the self-centring steering effect you feel (If any) is just mechanical control feedback. But in this system, the motor also provides self-centring feedback assistance. It’s really excellent. Here’s the range. You get S, Sport, Sport+ and GT-Line in order of increasing appeal and price. 2.0-litre CVT only on S and Sport. 1.6 Turbo only on GT-Line. But you can have either engine in Sport+. So the fuel economy powertrain is available in the first three variants. The performance powertrain on the top two. They overlap at Sport+. Here’s how you tell the four variants apart like an automotive ninja. (This is gunna help at the dealership when they jam one under your snout for a test drive - if you know this, you cannot be bullshat to about which one you’re driving. And before you say it in the comments: ‘bullshat’ is the past participle of the verb ‘to bullshit’.) The poverty S model rolls on steel wheels. That’s dead easy to spot. If you’re looking at a Seltos with alloy wheels and a folding key, it’s a Sport. If it’s got 17-inch alloys and a pushbutton start it’s Sport+ and if it’s got 18-inch alloys (with a bright red highlight around the hub) and a head-up display, it’s a GT-Line. There’s more safety gear on Sport+ and GT-Line, but you can get that on S and Sport for $1000 as an option. So, I’m not going to bore you with the spec sheet - but the salient observations arising from the spec sheet are: S is a real poverty pack. Anything that can be removed to cut costs basically has been, and this is done primarily to appease the great cheapskates of the automotive universe: Fleet managers. It’s a big step - $3500 - to go from S to Sport, but it’s well worth it for a private owner. You get alloys, a full-sized spare, the big centre infotainment screen, SUNA live traffic and 10 years of free mapcare updates (and, I’m assured, there are no strings attached to that - you just get the updates when they’re available). Sport+ is the pick of the range - because you get adaptive cruise and the better safety gear standard. Plus front parking sensors, nicer interior, proximity key. And it’s $5500 cheaper than GT-Line, which is loaded with all the nice toys, certainly, but do you really need all that stuff? Probably not. I’d strongly suggest you buy the 1.6 turbo if sporty engaging driving matters to you. The CVT that goes with the 2.0-litre is a little bit frustrating for enthusiastic driving. It displays this noticeable re-engagement lag, getting on the gas when you clip an apex and want to start feeding the power on smoothly.
My guest today is Kanaad. He is a friend of mine from Berkeley, and he is a Cavs and RL enthusiast. We talk about a fascinating paper from the 2018 Sloan Sports Analytics Conference. The main idea is that by modeling each possession as a Markov Decision Process, it becomes easy to build a simulator that can help us answer questions like "How much more efficient would a team be if they took 20% fewer midrange shots early in the shot clock?" The paper: http://www.lukebornn.com/papers/sandholtz_ssac_2018.pdf
In a previous episode, we discussed Markov Decision Processes or MDPs, a framework for decision making and planning. This episode explores the generalization Partially Observable MDPs (POMDPs) which are an incredibly general framework that describes most every agent based system.
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the topic of reinforcement learning, focusing particularly on continuous state MDPs and discretization.
Lecture by Professor Andrew Ng for Machine Learning (CS 229) in the Stanford Computer Science department. Professor Ng discusses the topic of reinforcement learning, focusing particularly on MDPs, value functions, and policy and value iteration.