Podcasts about iterated

  • 22PODCASTS
  • 73EPISODES
  • 27mAVG DURATION
  • ?INFREQUENT EPISODES
  • Feb 11, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about iterated

Latest podcast episodes about iterated

Love Music More (with Scoobert Doobert)
From Glam to Hair Metal to MTV - The History of Rock Music (Part 8)

Love Music More (with Scoobert Doobert)

Play Episode Listen Later Feb 11, 2025 27:40


Why did rock hair get so big? What was the secret behind Eddie Van Halen's guitar technique? And why does Scoobert Doobert like performing on the Sunset Strip?Let's take a journey from Davie Bowie to Guns N Roses to find out.(Check out the⁠⁠⁠ ⁠⁠first⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠second⁠⁠⁠⁠⁠,⁠⁠⁠⁠third⁠⁠⁠⁠, fourth, fifth, sixth, and seventh podcast episode and/or⁠⁠⁠⁠⁠⁠⁠⁠⁠ part one⁠⁠⁠⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠⁠⁠⁠ two⁠⁠⁠⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠⁠⁠⁠ three⁠⁠⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠⁠⁠⁠ four⁠⁠⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠⁠⁠ five⁠⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠⁠⁠ six⁠⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠⁠ seven⁠⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠ eight⁠⁠⁠⁠⁠,⁠⁠⁠⁠⁠ nine⁠⁠⁠⁠⁠,⁠⁠⁠⁠ ten⁠⁠⁠⁠,⁠⁠eleven⁠⁠⁠, and⁠⁠ twelfth⁠⁠ of the accompanying Substack posts, with music examples!)For 30% off your first year with DistroKid to share your music with the world click DistroKid.com/vip/lovemusicmore⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Want to hear my music? For all things links visitScoobertDoobert.pizza⁠⁠⁠⁠⁠⁠⁠⁠⁠Subscribe to this pod's blog on⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠Substack⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ to receive deeper dives on the regular

Where It Happens
How to build a $400k/mo AI App using ASO, Organic Growth, and Paid Ads

Where It Happens

Play Episode Listen Later Oct 9, 2024 48:14


Join me as I chat with Romain Torres, Co-Founder of Arcads AI, as we discuss his frameworks and strategies for finding and building profitable AI Apps. Learn his step-by-step formula to building and growth hacking apps. Episode Timestamps0:00 Intro00:49 Introduction to AI Apps and User Acquisition10:38 Product-Led Growth13:30 ASO (Apple Search Optimization)21:13 Organic Social Growth Hacks33:51  Paid Ads Strategy40:30 Creating Ads at Scale with AI Tools1) Three main strategies to grow mobile apps:• TikTok/Instagram organic growth• Paid ads (FB, TikTok, Google)• Apple Search Optimization (ASO)• Each has its pros and cons.2) ASO (Apple Search Optimization) 101:• Find keywords with high volume, low competition• Put keywords in app title• Optimize app store page conversion rate• Get reviews & high initial download velocity• Pro tip: Use AppFigures to research keywords & opportunities3) Organic social growth hack:• Create native-looking content (not ads)• Integrate app naturally into videos• Post consistently across platforms• Leverage micro-influencers for reach• Example: Travel buddy app crushing it with wanderlust reels 4) Paid ads strategy:• Test THOUSANDS of ad variations• Use AI to scale ad creation (tools like @ArcAdsAI)• Iterate on winning formats• Localize ads for different markets• Key insight: AI makes testing 1000s of ads feasible for small teams5) Case study: 17yo dev makes $400k/mo with AI calorie tracking app • Found trend on Reddit/TikTok• Built simple AI-powered app• Leveraged micro-influencers for organic growth• Iterated on viral formats• Lesson: Sometimes "overdone" ideas + AI + great marketing = $$$6) Monetization hack: Niche down existing app ideas• Examples:a) Dating apps for specific demographicsb) Bible app for womenc) Plant identifier for specific countries• Smaller audience, but higher conversion rates & less competition7) AI is leveling the playing field:• Easier to build MVP with ChatGPT• Faster ad creation & testing• Quick localization for global markets• You don't need VC $ to compete with big players anymore!8) Key takeaway: Success = Product-Market Fit + Execution• Find proven demand (ASO, trends)• Add AI "sizzle" to existing concepts• Execute relentlessly on product & marketing• Test, iterate, scaleWant more free ideas? I collect the best ideas from the pod and give them to you for free in a database. Most of them cost $0 to start (my fav)Get access: https://www.gregisenberg.com/30startupideas 

MULTIVERSES
20| Simon Kirby — Language Evolution & Emergence of Structure

MULTIVERSES

Play Episode Listen Later Dec 7, 2023 93:55


Language is the ultimate Lego. With it, we can take simple elements and construct them into an edifice of meaning. Its power is not only in mapping signs to concepts but in that individual words can be composed into larger structures. How did this systematicity arise in language?Simon Kirby is the head of Linguistics and English Language at The University of Edinburgh and one of the founders of the Centre for Langauge Evolution and Change. Over several decades he and his collaborators have run many elegant experiments that show that this property of language emerges inexorably as a system of communication is passed from generation to generation. Experiments with computer simulations, humans, and even baboons demonstrate that as a language is learned mistakes are made - much like the mutations in genes. Crucially, the mistakes that better match the language to the structure of the world (as conceived by the learner) are the ones that are most likely to be passed on.Links Simon's website with art, music, and talks on language evolution Simon's academic homepage Simon on X Multiverses Podcast homeOutline(00:00) Introduction(2:45) What makes language special?(5:30) Language extends our biological bounds(7:55) Language makes culture, culture makes language(9:30) John Searle: world to word and word to world(13:30) Compositionality: the expressivity of language is based on its Lego-like combinations(16:30) Could unique genes explain the fact of language compositionality?(17:20) … Not fully, though they might make our brains able to support compositional language(18:20) Using simulations to model language learning and search for the emergence of structure(19:35) Compositionality emerges from the transmission of representations across generations(20:18) The learners need to make mistakes, but not random mistakes(21:35) Just like biological evolution, we need variation(27:00) When, by chance, linguistic features echo the structure of the world these are more memorable(33:45) Language experiments with humans (Hannah Cornish)(36:32) Sign language experiments in the lab (Yasamin Motamedi)(38:45) Spontaneous emergence of sign language in populations(41:18) Communication is key to making language efficient, while transmission gives structure(47:10) Without intentional design these processes produce optimized systems(50:39) We need to perceive similarity in states of the world for linguistic structure to emerge(57:05) Why isn't language ubiquitous in nature …(58:00) … why do only humans have cultural transmissions(59:56) Over-imitation: Victoria Horner & Andrew Whiten, humans love to copy each other(1:06:00) Is language a spandrel?(1:07:10) How much of language is about information transfer? Partner-swapping conversations (Gareth Roberts)(1:08:49) Language learning = play?(1:12:25) Iterated learning experiments with baboons (& Tetris!)(1:17:50) Endogenous rewards for copying(1:20:30) Art as another angle on the same problems

The Nonlinear Library
LW - A free to enter, 240 character, open-source iterated prisoner's dilemma tournament by Isaac King

The Nonlinear Library

Play Episode Listen Later Nov 9, 2023 0:43


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A free to enter, 240 character, open-source iterated prisoner's dilemma tournament, published by Isaac King on November 9, 2023 on LessWrong. I'm running an iterated prisoner's dilemma tournament where all programs are restricted to 240 characters maximum. The exact rules are posted in the Manifold Markets link; I figured I'd cross-post the contest here to reach more potentially-interested people. (You don't need a Manifold account to participate, you can just put your program in the comments on LessWrong or PM me.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

The Nonlinear Library: LessWrong
LW - A free to enter, 240 character, open-source iterated prisoner's dilemma tournament by Isaac King

The Nonlinear Library: LessWrong

Play Episode Listen Later Nov 9, 2023 0:43


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A free to enter, 240 character, open-source iterated prisoner's dilemma tournament, published by Isaac King on November 9, 2023 on LessWrong. I'm running an iterated prisoner's dilemma tournament where all programs are restricted to 240 characters maximum. The exact rules are posted in the Manifold Markets link; I figured I'd cross-post the contest here to reach more potentially-interested people. (You don't need a Manifold account to participate, you can just put your program in the comments on LessWrong or PM me.) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org

The Nonlinear Library
AF - An LLM-based “exemplary actor” by Roman Leventov

The Nonlinear Library

Play Episode Listen Later May 29, 2023 24:24


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An LLM-based “exemplary actor”, published by Roman Leventov on May 29, 2023 on The AI Alignment Forum. Into and summary This post is the second section of "Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor", posted separately because I think it could warrant a separate discussion, largely independent of the discussion of H-JEPA agent with GFlowNet actors. Here's the summary of this post, copied from the "Overview" section of the main article: In section 2, I describe the “exemplary actor”, an LMCA (language model cognitive architecture) that takes a simple, “brute force” approach to alignment: a powerful LLM (think GPT-5/6 level, with a vast, or quasi-unlimited context) is given a list of “approved” textbooks on methodological and scientific disciplines: epistemology, rationality, ethics, physics, etc. Also, the LLM is given tools: narrow AIs (such as for protein folding or for predicting properties of materials, or for formal scientific modelling). Finally, the LLM is given a compute engine such as Wolfram and a knowledge base such as Wikidata or Wolfram Knowledgebase. The exemplary actor creates plans or predictions for given situations (described in language and fed to the LLM underlying the exemplary actor as prompts) and iteratively critiques and refines its own plans and predictions while putting different textbooks into the LLM context (first, with the textbook on rationality, then epistemology, then physics, etc., with potentially dozens of different textbooks relevant for a plan or prediction that is being criticised), for many iterations, until convergence. In section 2.1, I note that the type of alignment that the exemplary actor's architecture tries to ensure is called (world) model alignment and that is stronger and also more essential than goal alignment. Then, I discuss the properties of the exemplary actor. In section 2.2., I discuss what I see as likely non-issues or straightforwardly addressable issues: the “divergent reasoning nature” of LLMs, the lack of grounded common sense reasoning, and the bias of the quick reactive network (”System 1”), it it is added to the architecture to make it more practically usable in lower-stakes reasoning settings. In section 2.3, I discuss the outstanding technical issues and risks of the exemplary actor's architecture: The risk of direct access to the underlying LLM (section 2.3.1). The exemplary actor's reasoning could still be partially directed by “alien” thinking patterns (i.e., the world model) of the underlying LLM even though these influences won't surface in the explanations of the plan (section 2.3.2). Iterated critique and refinement probably won't make plans strictly conforming to the theories described in the textbooks (section 2.3.3). In section 2.3.4, I discuss the alignment tax of the exemplary actor (compared with the baseline of a bare, minimally fine-tuned LLM) and conclude that the main source of alignment tax might happen to be the theory of ethics which may force the exemplary actor to refuse to participate in “games” (i.e., real-world situations and environments) where it doesn't see ethical ways of “winning”, and thus will consider inaction (or some form of palliative action) the only ethical way forward. This is not a technical problem with the exemplary actor per se, but rather a problem with a higher-level system, i.e., the current economic, social, and political structure of the world. I mention this and other kinds of “higher-level” risks of the plans to build and deploy the exemplary actor (i.e., roughly the plans that OpenAI and Anthropic are betting on, as it seems to me) in section 2.4. 2. An LLM-based “exemplary actor” Let's assume that we have three things: First, a very powerful auto-regressive LLM (think GPT-5/6 level) with the ability to effe...

The Nonlinear Library
LW - Notes on OpenAI's alignment plan by Alex Flint

The Nonlinear Library

Play Episode Listen Later Dec 8, 2022 9:57


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Notes on OpenAI's alignment plan, published by Alex Flint on December 8, 2022 on LessWrong. This work was supported by the Monastic Academy for the Preservation of Life on Earth. OpenAI has published a plan for producing aligned AI systems. Very roughly, their plan is to build language models capable of producing human-level alignment research. The output of these models might be natural language essays about alignment, or code that directly implements machine learning systems. In either case, human researchers would spend their time reviewing machine-generated alignment research. Their research agenda goes as follows. Step 1: Train language models to solve useful problems using reinforcement learning. Rather than having a human review each attempt the model makes, which would be very time-consuming, train a second model that watches a few human reviews and learns to predict what reward the human would have given. Then train the main reinforcement learning model using this second model to provide a lot of feedback. You still need a human to review this second model every so often to make sure it has learned the right thing, but overall it requires less human effort than having a human review every attempt that the reinforcement learning model makes. Step 2: When tasks become so complex that humans struggle to assess the outputs of the model, use some combination of the following: Train models that critique their own outputs or the outputs of other models, possibly requiring the initial model to expand its reasoning around certain points in its original output. ("debate") Train auxiliary models that help humans work out what reward to give to the primary model by solving more tractable sub-problems. In assessing the validity of a philosophy essay, you might train an auxiliary model to identify which paragraphs contain the key reasoning steps, and then use this model as a tool when assessing the philosophy essays. It might be that the sub-problems are themselves very difficult to provide training data for, in which case you can build models to solve sub-sub-problems that help humans to assess the output of the models targeting the sub-problems, and so on recursively. ("Recursive reward modeling") Train a model that is general-purpose enough that it can itself do useful work to assist a human generating training data for the model's own training. Each time you train a new version of the model, give that model to humans to use as a tool in generating more training data, and then retrain the model on that improved training data. In training an internet-search assistant, for example, you might have humans use the previous version of the assistant to answer sub-questions that are relevant to assessing a top-level answer or answering a top-level question. If this produces better training data than in the previous iteration then you may also be able to train a better internet-search assistant using this improved training data. You can repeat this process for as long as the thing keeps improving. ("Iterated amplification") Step 3: use the above to build a language model capable of producing human-level alignment research. Presumably we would put in prompts like "what are the governing dynamics of agency in intelligent systems?" or "how can we avoid mesa-optimization in supervised learning?" and the thing would produce essays containing insights that clarify the alignment problem for us. Discussion There is a bit of a leap from "fine-tune a language model" to "produce human-level alignment research". Let's say we're training a language model to write insightful essays about alignment. We begin with some language model that has been pre-trained using, say, text from the internet. We give it a prompt such as "what would it mean to measure degrees of agency in arbitrary computer p...

The Nonlinear Library: LessWrong
LW - Notes on OpenAI's alignment plan by Alex Flint

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 8, 2022 9:57


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Notes on OpenAI's alignment plan, published by Alex Flint on December 8, 2022 on LessWrong. This work was supported by the Monastic Academy for the Preservation of Life on Earth. OpenAI has published a plan for producing aligned AI systems. Very roughly, their plan is to build language models capable of producing human-level alignment research. The output of these models might be natural language essays about alignment, or code that directly implements machine learning systems. In either case, human researchers would spend their time reviewing machine-generated alignment research. Their research agenda goes as follows. Step 1: Train language models to solve useful problems using reinforcement learning. Rather than having a human review each attempt the model makes, which would be very time-consuming, train a second model that watches a few human reviews and learns to predict what reward the human would have given. Then train the main reinforcement learning model using this second model to provide a lot of feedback. You still need a human to review this second model every so often to make sure it has learned the right thing, but overall it requires less human effort than having a human review every attempt that the reinforcement learning model makes. Step 2: When tasks become so complex that humans struggle to assess the outputs of the model, use some combination of the following: Train models that critique their own outputs or the outputs of other models, possibly requiring the initial model to expand its reasoning around certain points in its original output. ("debate") Train auxiliary models that help humans work out what reward to give to the primary model by solving more tractable sub-problems. In assessing the validity of a philosophy essay, you might train an auxiliary model to identify which paragraphs contain the key reasoning steps, and then use this model as a tool when assessing the philosophy essays. It might be that the sub-problems are themselves very difficult to provide training data for, in which case you can build models to solve sub-sub-problems that help humans to assess the output of the models targeting the sub-problems, and so on recursively. ("Recursive reward modeling") Train a model that is general-purpose enough that it can itself do useful work to assist a human generating training data for the model's own training. Each time you train a new version of the model, give that model to humans to use as a tool in generating more training data, and then retrain the model on that improved training data. In training an internet-search assistant, for example, you might have humans use the previous version of the assistant to answer sub-questions that are relevant to assessing a top-level answer or answering a top-level question. If this produces better training data than in the previous iteration then you may also be able to train a better internet-search assistant using this improved training data. You can repeat this process for as long as the thing keeps improving. ("Iterated amplification") Step 3: use the above to build a language model capable of producing human-level alignment research. Presumably we would put in prompts like "what are the governing dynamics of agency in intelligent systems?" or "how can we avoid mesa-optimization in supervised learning?" and the thing would produce essays containing insights that clarify the alignment problem for us. Discussion There is a bit of a leap from "fine-tune a language model" to "produce human-level alignment research". Let's say we're training a language model to write insightful essays about alignment. We begin with some language model that has been pre-trained using, say, text from the internet. We give it a prompt such as "what would it mean to measure degrees of agency in arbitrary computer p...

The Nonlinear Library
LW - Humans Consulting HCH by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 2:48


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 8: Humans Consulting HCH, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (See also: strong HCH.) Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine. That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh. Let's call this process HCH, for “Humans Consulting HCH.” I've talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.) HCH is easy to specify very precisely. For now, I think that HCH is our best way to precisely specify “a human's enlightened judgment.” It's got plenty of problems, but for now I don't know anything better. Elaborations We can define realizable variants of this inaccessible ideal: For a particular prediction algorithm P, define HCHᴾ as: “P's prediction of what a human would say after consulting HCHᴾ” For a reinforcement learning algorithm A, define max-HCHᴬ as: “A's output when maximizing the evaluation of a human after consulting max-HCHᴬ” For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as: “the market's prediction of what a human will say after consulting HCHᵐᵃʳᵏᵉᵗ” Note that e.g. HCHᴾ is totally different from “P's prediction of HCH.” HCHᴾ will generally make worse predictions, but it is easier to implement. Hope The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are: As capable as the underlying predictor, reinforcement learner, or market participants. Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH. (At least when the human is suitably prudent and wise.) It is clear from the definitions that these systems can't be any more capable than the underlying predictor/learner/market. I honestly don't know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ probably can't. It is similarly unclear whether the system continues to reflect the human's judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
LW - AlphaGo Zero and capability amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 3:35


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 14: AlphaGo Zero and capability amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. AlphaGo Zero is an impressive demonstration of AI capabilities. It also happens to be a nice proof-of-concept of a promising alignment strategy. How AlphaGo Zero works AlphaGo Zero learns two functions (which take as input the current board): A prior over moves p is trained to predict what AlphaGo will eventually decide to do. A value function v is trained to predict which player will win (if AlphaGo plays both sides) Both are trained with supervised learning. Once we have these two functions, AlphaGo actually picks it moves by using 1600 steps of Monte Carlo tree search (MCTS), using p and v to guide the search. It trains p to bypass this expensive search process and directly pick good moves. As p improves, the expensive search becomes more powerful, and p chases this moving target. Iterated capability amplification In the simplest form of iterated capability amplification, we train one function: A “weak” policy A, which is trained to predict what the agent will eventually decide to do in a given situation. Just like AlphaGo doesn't use the prior p directly to pick moves, we don't use the weak policy A directly to pick actions. Instead, we use a capability amplification scheme: we call A many times in order to produce more intelligent judgments. We train A to bypass this expensive amplification process and directly make intelligent decisions. As A improves, the amplified policy becomes more powerful, and A chases this moving target. In the case of AlphaGo Zero, A is the prior over moves, and the amplification scheme is MCTS. (More precisely: A is the pair (p, v), and the amplification scheme is MCTS + using a rollout to see who wins.) Outside of Go, A might be a question-answering system, which can be applied several times in order to first break a question down into pieces and then separately answer each component. Or it might be a policy that updates a cognitive workspace, which can be applied many times in order to “think longer” about an issue. The significance Reinforcement learners take a reward function and optimize it; unfortunately, it's not clear where to get a reward function that faithfully tracks what we care about. That's a key source of safety concerns. By contrast, AlphaGo Zero takes a policy-improvement-operator (like MCTS) and converges towards a fixed point of that operator. If we can find a way to improve a policy while preserving its alignment, then we can apply the same algorithm in order to get very powerful but aligned strategies. Using MCTS to achieve a simple goal in the real world wouldn't preserve alignment, so it doesn't fit the bill. But “think longer” might. As long as we start with a policy that is close enough to being aligned — a policy that “wants” to be aligned, in some sense — allowing it to think longer may make it both smarter and more aligned. I think designing alignment-preserving policy amplification is a tractable problem today, which can be studied either in the context of existing ML or human coordination. So I think it's an exciting direction in AI alignment. A candidate solution could be incorporated directly into the AlphaGo Zero architecture, so we can already get empirical feedback on what works. If by good fortune powerful AI systems look like AlphaGo Zero, then that might get us much of the way to an aligned AI. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
LW - Supervising strong learners by amplifying weak experts by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 1:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 13: Supervising strong learners by amplifying weak experts, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is a linkpost for Abstract Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017b), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
LW - Benign model-free RL by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 12:45


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 11: Benign model-free RL, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. In my last post, I described three research areas in AI control that I see as central: reward learning, robustness, and deliberation. In this post I argue that these three pieces may be sufficient to get a benign and competitive version of model-free reinforcement learning. I think this is an important intermediate goal of solving AI control. This post doesn't discuss benign model-based RL at all, which I think is another key obstacle for prosaic AI control. (This post overlaps extensively with my post on ALBA, but I hope this one will be much clearer. Technically, ALBA is an implementation of the general strategy outlined in this post. I think the general strategy is much more important than that particular implementation.) Ingredients Reward learning and robustness Given a benign agent H, reward learning allows us to construct a reward function r that can be used to train a weaker benign agent A. If our training process is robust, the resulting agent A will remain benign off of the training distribution (though it may be incompetent off of the training distribution). Schematically, we can think of reward learning + robustness as a widget which takes a slow, benign process H and produces a fast, benign process A A's capabilities should be roughly the “intersection” of H's capabilities and our RL algorithms' competence. That is, A should be able to perform a task whenever both H can perform that task and our RL algorithms can learn to perform that task. In these pictures, the vertical axis corresponds intuitively to “capability,” with higher agents being more capable. But in reality I'm thinking of the possible capabilities as forming a complete lattice. That is, a generic pair of levels of capabilities is incomparable, with neither strictly dominating the other. Amplification If we iteratively apply reward learning and robustness, we will obtain a sequence of weaker and weaker agents. To get anywhere, we need some mechanism that lets us produce a stronger agent. The capability amplification problem is to start with a weak agent A and a human expert H, and to produce a significantly more capable agent Hᴬ. The more capable agent can take a lot longer to think, all we care about is that it eventually arrives at better decisions than A. The key challenge is ensuring that Hᴬ remains benign, i.e. that the system doesn't acquire new preferences as it becomes more capable. An example approach is to provide A as an assistant to H. We can give H an hour to deliberate, and let it consult A thousands of times during that hour. Hᴬ's output is then whatever H outputs at the end of that process. Because H is consulting A a large number of times, we can hope that the resulting system will be much smarter than A. Of course, the resulting system will be thousands of times more computationally expensive than A, but that's fine. In general, meta-execution is my current preferred approach to capability amplification. Schematically, we can think of amplification as a widget which takes a fast, benign process A and produces a slow, benign process Hᴬ: Putting it together With these two widgets in hand, we can iteratively produce a sequence of increasingly competent agents: That is, we start with our benign expert H. We then learn a reward function and train an agent A, which is less capable than H but can run much faster. By running many instances of A, we obtain a more powerful agent Hᴬ, which is approximately as expensive as H. We can then repeat the process, using Hᴬ to train an agent A⁺ which runs as fast as A but is more capable. By running A⁺ for a long time we obtain a still more capable agent Hᴬ⁺, and the cycle re...

The Nonlinear Library
LW - Iterated Distillation and Amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 10:54


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 10: Iterated Distillation and Amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is a guest post summarizing Paul Christiano's proposed scheme for training machine learning systems that can be robustly aligned to complex and fuzzy values, which I call Iterated Distillation and Amplification (IDA) here. IDA is notably similar to AlphaGoZero and expert iteration. The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user's interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance. This document gives a high-level outline of IDA. Motivation: The alignment/capabilities tradeoff Assume that we want to train a learner A to perform some complex fuzzy task, e.g. “Be a good personal assistant.” Assume that A is capable of learning to perform the task at a superhuman level — that is, if we could perfectly specify a “personal assistant” objective function and trained A to maximize it, then A would become a far better personal assistant than any human. There is a spectrum of possibilities for how we might train A to do this task. On one end, there are techniques which allow the learner to discover powerful, novel policies that improve upon human capabilities: Broad reinforcement learning: As A takes actions in the world, we give it a relatively sparse reward signal based on how satisfied or dissatisfied we are with the eventual consequences. We then allow A to optimize for the expected sum of its future rewards Broad inverse reinforcement learning: A attempts to infer our deep long-term values from our actions, perhaps using a sophisticated model of human psychology and irrationality to select which of many possible extrapolations is correct. However, it is difficult to specify a broad objective that captures everything we care about, so in practice A will be optimizing for some proxy that is not completely aligned with our interests. Even if this proxy objective is “almost” right, its optimum could be disastrous according to our true values. On the other end, there are techniques that try to narrowly emulate human judgments: Imitation learning: We could train A to exactly mimic how an expertwould do the task, e.g. by training it to fool a discriminative model trying to tell apart A's actions from the human expert's actions. Narrow inverse reinforcement learning: We could train A to infer our near-term instrumental values from our actions, with the presumption that our actions are roughly optimal according to those values. Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards. Using these techniques, the risk of misalignment is reduced significantly (though not eliminated) by restricting agents to the range of known human behavior — but this introduces severe limitations on capability. This tradeoff between allowing for novel capabilities and reducing misalignment risk applies across different learning schemes (with imitation learning generally being narrowest and lowest risk) as well as within a single scheme. The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans? Core concept: Analogy to AlphaGoZero The core idea of Paul's scheme is simila...

The Nonlinear Library
LW - Corrigibility by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 10:38


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 9:Corrigibility, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Warning: rambling.) I would like to build AI systems which help me: Figure out whether I built the right AI and correct any mistakes I made Remain informed about the AI's behavior and avoid unpleasant surprises Make better decisions and clarify my preferences Acquire resources and remain in effective control of them Ensure that my AI systems continue to do all of these nice things .and so on We say an agent is corrigible (article on Arbital) if it has these properties. I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles; it has often been discussed in the context of narrow behaviors like respecting an off-switch, but here I am using it in the broadest possible sense. In this post I claim: A benign act-based agent will be robustly corrigible if we want it to be. A sufficiently corrigible agent will tend to become more corrigible and benign over time. Corrigibility marks out a broad basin of attraction towards acceptable outcomes. As a consequence, we shouldn't think about alignment as a narrow target which we need to implement exactly and preserve precisely. We're aiming for a broad basin, and trying to avoid problems that could kick out of that basin. This view is an important part of my overall optimism about alignment, and an important background assumption in some of my writing. 1. Benign act-based agents can be corrigible A benign agent optimizes in accordance with our preferences. An act-basedagent considers our short-term preferences, including (amongst others) our preference for the agent to be corrigible. If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences. This kind of corrigibility doesn't require any special machinery. An act-based agent turns off when the overseer presses the “off” button not because it has received new evidence, or because of delicately balanced incentives. It turns off because that's what the overseer prefers. Contrast with the usual futurist perspective Omohundro's The Basic AI Drives argues that “almost all systems [will] protect their utility functions from modification,” and Soares, Fallenstein, Yudkowsky, and Armstrong cite as: “almost all [rational] agents are instrumentally motivated to preserve their preferences.” This motivates them to consider modifications to an agent to remove this default incentive. Act-based agents are generally an exception to these arguments, since the overseer has preferences about whether the agent protects its utility function from modification. Omohundro presents preferences-about-your-utility function case as a somewhat pathological exception, but I suspect that it will be the typical state of affairs for powerful AI (as for humans) and it does not appear to be unstable. It's also very easy to implement in 2017. Is act-based corrigibility robust? How is corrigibility affected if an agent is ignorant or mistaken about the overseer's preferences? I think you don't need particularly accurate models of a human's preferences before you can predict that they want their robot to turn off when they press the off button or that they don't want to be lied to. In the concrete case of an approval-directed agent, “human preferences” are represented by human responses to questions of the form “how happy would you be if I did a?” If the agent is considering the action a precisely because it is manipulative or would thwart the user's attempts to correct the system, then it doesn't seem hard to predict that the overseer will object to a. Eliezer has suggested that this is a very anthropocentric judgment of “...

The Nonlinear Library
LW - An unaligned benchmark by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 15:32


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 4: An unaligned benchmark, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. My goal is to design AI systems that are aligned with human interests and competitive with unaligned AI. I find it useful to have a particular AI algorithm in mind. Then I can think about how that algorithm could cause trouble, and try to find a safer variant. I think of the possibly-unaligned AIs as a benchmark: it's what AI alignment researchers need to compete with. The further we fall short of the benchmark, the stronger the competitive pressures will be for everyone to give up on aligned AI and take their chances. I have a few standard benchmarks I keep in mind. This post describes one of those benchmarks. It also tries to lay out clearly why I think that benchmark is unsafe, and explains how I think my current research could make a safe version. I. Model-based RL with MCTS We train three systems in parallel: A generative model to sample sequences of observations, conditioned on sequences of actions. A reward function that takes as input a sequence of actions and predicted observations and produces a reward. A policy and value function which take as input a sequence of observations and produce the next action and an estimate of the future return. We train the policy and value function using (roughly) the AlphaZero algorithm: Use MCTS to improve the current policy. Update the policy at the root to predict the best move found by MCTS, update the value to predict its predicted value. Use the generative model to sample environment transitions and the reward function (with a small discount rate) to score them. We train an autoregressive generative model, to maximize the log probability assigned to the actual sequence of actions and observations produced by the AI (with each observation conditioned on the past actions). This isn't actually a good way to train the generative model, but it's not really central to the discussion. We train the reward function by showing humans sequences of actions and predicted observations, asking them to assign scores, then predicting those scores with supervised learning. We show humans the sequences of actions that look most promising to the system. There are plenty of details you'd need in order to make this work well, but that's the basic idea. When applied with very powerful networks, it's plausible that this system would be able to decisively outcompete humans. It would be capable performing a large intelligent search over long sequences of actions to find those that would be rated highly. II. What goes wrong? There are two classes of problems: Problem 1: Bad objective The goal of the system is to produce (action, observation) sequences that look good to humans. I claim that optimizing this objective faithfully will lead to bad outcomes. As the system improves, the rationale of many individual actions will become incomprehensible to a human overseer. At this point the only option for a human is to evaluate sequence of observations based on whether the consequences look good. The observations present a narrow view of the world, and I strongly suspect that the AI will find sequences of actions that make that narrow view look good without actually being good. Control vs. intrinsic goodness. I think there are two strategies for defining a reward function: Reward worlds in which humans remain in control of the situation, in which they are able to get accurate information and correct course as needed. Reward worlds in which intrinsically good things are happening Both of these strategies seem unworkable. Strategy #1: maintaining control. This appears to be unworkable because determining if humans are actually in control is incredibly difficult — at best you can tell w...

The Nonlinear Library
LW - Approval-directed bootstrapping by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 2:03


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 7: Approval-directed bootstrapping, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Approval-directed behavior works best when the overseer is very smart. Where can we find a smart overseer? One approach is bootstrapping. By thinking for a long time, a weak agent can oversee an agent (slightly) smarter than itself. Now we have a slightly smarter agent, who can oversee an agent which is (slightly) smarter still. This process can go on, until the intelligence of the resulting agent is limited by technology rather than by the capability of the overseer. At this point we have reached the limits of our technology. This may sound exotic, but we can implement it in a surprisingly straightforward way. Suppose that we evaluate Hugh's approval by predicting what Hugh would say if we asked him; the rating of action a is what Hugh would say if, instead of taking action a, we asked Hugh, “How do you rate action a?” Now we get bootstrapping almost for free. In the process of evaluating a proposed action, Hugh can consult Arthur. This new instance of Arthur will, in turn, be overseen by Hugh—and in this new role Hugh can, in turn, be assisted by Arthur. In principle we have defined the entire infinite regress before Arthur takes his first action. We can even learn this function by examples — no elaborate definitions necessary. Each time Arthur proposes an action, we actually ask Hugh to evaluate the action with some probability, and we use our observations to train a model for Hugh's judgments. In practice, Arthur might not be such a useful assistant until he has acquired some training data. As Arthur acquires training data, the Hugh+Arthur system becomes more intelligent, and so Arthur acquires training data from a more intelligent overseer. The bootstrapping unfolds over time as Arthur adjusts to increasingly powerful overseers. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
LW - Approval-directed agents by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 25:24


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 6: Approval-directed agents, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Note: This is the first post from part two: basic intuitions of the sequence on iterated amplification. The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal. Research in AI is steadily progressing towards more flexible, powerful, and autonomous goal-directed behavior. This progress is likely to have significant economic and humanitarian benefits: it helps make automation faster, cheaper, and more effective, and it allows us to automate deciding what to do. Many researchers expect goal-directed machines to predominate, and so have considered the long-term implications of this kind of automation. Some of these implications are worrying: if sophisticated artificial agents pursue their own objectives and are as smart as we are, then the future may be shaped as much by their goals as by ours. Most thinking about “AI safety” has focused on the possibility of goal-directed machines, and asked how we might ensure that their goals are agreeable to humans. But there are other possibilities. In this post I will flesh out one alternative to goal-directed behavior. I think this idea is particularly important from the perspective of AI safety. Approval-directed agents Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action: Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating. I'll call this “approval-directed” behavior throughout this post, in contrast with goal-directed behavior. In this context I'll call Hugh an “overseer.” Arthur's actions are rated more highly than those produced by any alternative procedure. That's comforting, but it doesn't mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can't anticipate those consequences himself. For example, if Arthur is playing chess he should make moves that are actually good—not moves that Hugh thinks are good. The quality of approval-directed decisions is limited by the minimum of Arthur's ability and Hugh's ability: Arthur makes a decision only if it looks good to both Arthur and Hugh. So why would Hugh be interested in this proposal, rather than doing things himself? Hugh doesn't actually rate actions, he just participates in a hypothetical rating process. So Hugh can oversee many agents like Arthur at once (and spend his actual time relaxing on the beach). In many cases, this is the whole point of automation. Hugh can (hypothetically) think for a very long time about each decision—longer than would be practical or cost-effective if he had to actually make the decision himself. Similarly, Hugh can think about Arthur's decisions at a very low level of detail. For example, Hugh might rate a chess-playing AI's choices about how to explore the game tree, rather than rating its final choice of moves. If Arthur is making billions of small decisions each second, then Hugh can think in depth about each of them, and the resulting system can be much smarter than Hugh. Hugh can (hypothetically) use additional resources in order to make his rating: powerful computers, the benefit of hindsight, many assistants, very long time periods. Hugh's capabilities can be gradually escalated as needed, and one approval-directed system can be used to bootstrap to a more effective successor. For example, Arthur could advise Hugh on how to define a better overseer; Arthur could offer advice in real-time to help Hugh be a better o...

The Nonlinear Library
LW - Prosaic AI alignment by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 13:11


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 5: Prosaic AI alignment, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Related: a possible stance for AI control.) It's conceivable that we will build “prosaic” AGI, which doesn't reveal any fundamentally new ideas about the nature of intelligence or turn up any “unknown unknowns.” I think we wouldn't know how to align such an AGI; moreover, in the process of building it, we wouldn't necessarily learn anything that would make the alignment problem more approachable. So I think that understanding this case is a natural priority for research on AI alignment. In particular, I don't think it is reasonable to say “we'll know how to cross that bridge when we come to it,” or “it's impossible to do meaningful work without knowing more about what powerful AI will look like.” If you think that prosaic AGI is plausible, then we may already know what the bridge will look like when we get to it: if we can't do meaningful work now, then we have a problem. 1. Prosaic AGI It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn't involve qualitatively new ideas about “how intelligence works:” It's plausible that a large neural network can replicate “fast” human cognition, and that by coupling it to simple computational mechanisms — short and long-term memory, attention, etc. — we could obtain a human-level computational architecture. It's plausible that a variant of RL can train this architecture to actually implement human-level cognition. This would likely involve some combination of ingredients like model-based RL, imitation learning, or hierarchical RL. There are a whole bunch of ideas currently on the table and being explored; if you can't imagine any of these ideas working out, then I feel that's a failure of imagination (unless you see something I don't). We will certainly learn something by developing prosaic AGI. The very fact that there were no qualitatively new ideas is itself surprising. And beyond that, we'll get a few more bits of information about which particular approach works, fill in a whole bunch of extra details about how to design and train powerful models, and actually get some experimental data. But none of these developments seem to fundamentally change the alignment problem, and existing approaches to AI alignment are not bottlenecked on this kind of information. Actually having the AI in front of us may let us work several times more efficiently, but it's not going to move us from “we have no idea how to proceed” to “now we get it.” 2. Our current state 2a. The concern If we build prosaic superhuman AGI, it seems most likely that it will be trained by reinforcement learning (extending other frameworks to superhuman performance would require new ideas). It's easy to imagine a prosaic RL system learning to play games with superhuman levels of competence and flexibility. But we don't have any shovel-ready approach to training an RL system to autonomously pursue our values. To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit. If we had very powerful RL systems, such a DAO might be able to outcompete human organizations at a wide range of tasks — producing and selling cheaper widgets, but also influencing government policy, extorting/manipulating other actors, and so on. The shareholders of such a DAO may be able to capture the value it creates as long as they are able to retain effective control over its computing hardware / reward signal. Similarly, as long as such DAOs are weak enough to be effectively governed by existing laws and institutions, they are likely to benefit humanity even if they reinvest all of their profits. But a...

The Nonlinear Library
LW - The reward engineering problem by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 12:01


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 16: The reward engineering problem, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Today we usually train reinforcement learning agents to perform narrow tasks with simple goals. We may eventually want to train RL agents to behave “well” in open-ended environments where there is no simple goal. Suppose that we are trying to train an RL agent A. In each episode, A interacts with an environment, producing a transcript τ. We then evaluate that transcript, producing a reward r ∈ [0, 1]. A is trained is to maximize its reward. We would like to set up the rewards so that A will learn to behave well — that is, such that if A learns to receive a high reward, then we will be happy with A's behavior. To make the problem feasible, we assume that we have access to another agent H which is “smarter” than A, and makes “good” decisions. In order to evaluate transcript τ, we allow ourselves to make any number of calls to H, and to use any other tools that are available. The question is: how do we carry out the evaluation, so that the optimal strategy for A is to also make “good” decisions? Following Daniel Dewey, I'll call this the reward engineering problem. Note that our evaluation process may be quite expensive, and actually implementing it may be infeasible. To build a working system, we would need to combine this evaluation with semi-supervised RL and learning with catastrophes. Possible approaches and remaining problems I know of 3 basic approaches to reward engineering: Direct supervision. Use H to evaluate A's behavior, and train A to maximize H's evaluations. In some contexts we could compare two behaviors instead of evaluating one in isolation. Imitation learning. Use H to generate a bunch of transcripts, and train Ato produce similar-looking transcripts. For example, we could train a model to distinguish A's behavior from H's behavior, and reward A when it fools the distinguisher. Inverse reinforcement learning. Use H to generate a bunch of transcripts, and then infer a reward function which is being approximately optimized by H. Use this reward function to evaluate A's behavior. All of these approaches are promising but face significant challenges. I'll describe some of these problems in the next 3 sections. 1. Direct supervision In direct supervision, H looks at a transcript of A's behavior, and estimates how good that transcript is. To see the problem with this scheme, suppose that A has been asked to draw a picture, and A does it by copying an existing picture with some modifications. If originality is especially important, then this may be a very “bad” policy. But even if H is much smarter than A, it may be hard to tell that the picture is not original — creating a derivative work only requires looking at a single existing picture, while checking if a work is derivative requires considering every picture. More formally: in order for direct supervision to be effective, H needs to be better-informed than A about what is “good.” If this condition is satisfied, then from A's perspective, estimating H's estimate of goodness is equivalent to estimating actual goodness. This condition is superficially plausible — after all, we did assume that H is smarter than A. The problem is that when A picks an action, A is especially well-informed about that action — the computation which produced the action provides evidence about it, and H may not have access to that evidence. Transparency One response is to let H see how A computed its action. If H can understand that process, then H may be able to effectively evaluate the action. Sometimes this is straightforward: for example, if A uses an attention mechanism to look at a particular painting and copy it, we can simply tell Hwhat A look...

The Nonlinear Library
LW - The Steering Problem by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 11:53


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 2: The Steering Problem, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Most AI research focuses on reproducing human abilities: to learn, infer, and reason; to perceive, plan, and predict. There is a complementary problem which (understandably) receives much less attention: if you had these abilities, what would you do with them? The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities? This post explains what the steering problem is and why I think it's worth spending time on. Introduction A capable, well-motivated human can be extremely useful: they can work without oversight, produce results that need not be double-checked, and work towards goals that aren't precisely defined. These capabilities are critical in domains where decisions cannot be easily supervised, whether because they are too fast, too complex, or too numerous. In some sense “be as useful as possible” is just another task at which a machine might reach human-level performance. But it is different from the concrete capabilities normally considered in AI research. We can say clearly what it means to "predict well," "plan well," or "reason well." If we ignored computational limits, machines could achieve any of these goals today. And before the existing vision of AI is realized, we must necessarily achieve each of these goals. For now, "be as useful as possible" is in a different category. We can't say exactly what it means. We could not do it no matter how fast our computers could compute. And even if we resolved the most salient challenges in AI, we could remain in the dark about this one. Consider a capable AI tasked with running an academic conference. How should it use its capabilities to make decisions? We could try to specify exactly what makes a conference good or bad. But our requirements are complex and varied, and so specifying them exactly seems time-consuming or impossible. We could build an AI that imitates successful conference organizers. But this approach can never do any better than the humans we are imitating. Realistically, it won't even match human performance unless we somehow communicate what characteristics are important and why. We could ask an AI to maximize our satisfaction with the conference. But we'll get what we measure. An extensive evaluation would greatly increase the cost of the conference, while a superficial evaluation would leave us with a conference optimized for superficial metrics. Everyday experience with humans shows how hard delegation can be, and how much easier it is to assign a task to someone who actually cares about the outcome. Of course there is already pressure to write useful programs in addition to smart programs, and some AI research studies how to efficiently and robustly communicate desired behaviors. For now, available solutions apply only in limited domains or to weak agents. The steering problem is to close this gap. Motivation A system which "merely" predicted well would be extraordinarily useful. Why does it matter whether we know how to make a system which is “as useful as possible”? Our machines will probably do some things very effectively. We know what it means to "act well" in the service of a given goal. For example, using human cognitive abilities as a black box, we could probably design autonomous corporations which very effectively maximized growth. If the black box was cheaper than the real thing, such autonomous corporations could displace their conventional competitors. If machines can do everything equally well, then this would be great news. If not, society's direction may be profoundly influenced by what can and cannot...

The Nonlinear Library
LW - Preface to the sequence on iterated amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 4:22


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 1: Preface to the sequence on iterated amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This sequence describes iterated amplification, a possible strategy for building an AI that is actually trying to do what we want out of ML systems trained by gradient descent. Iterated amplification is not intended to be a silver bullet that resolves all of the possible problems with AI; it's an approach to the particular alignment problem posed by scaled-up versions of modern ML systems. Iterated amplification is based on a few key hopes If you have an overseer who is smarter than the agent you are trying to train, you can safely use that overseer's judgment as an objective. We can train an RL system using very sparse feedback, so it's OK if that overseer is very computationally expensive. A team of aligned agents may be smarter than any individual agent, while remaining aligned. If all of these hopes panned out, then at every point in training “a team of the smartest agents we've been able to train so far” would be a suitable overseer for training a slightly smarter aligned successor. This could let us train very intelligent agents while preserving alignment (starting the induction from an aligned human). Iterated amplification is still in an preliminary state and is best understood as a research program rather than a worked out solution. Nevertheless, I think it is the most concrete existing framework for aligning powerful ML with human interests. Purpose and audience The purpose of this sequence is to communicate the basic intuitions motivating iterated amplification, to define iterated amplification, and to present some of the important open questions. I expect this sequence to be most useful for readers who would like to have a somewhat detailed understanding of iterated amplification, and are looking for something more structured than ai-alignment.com to help orient themselves. The sequence is intended to provide enough background to follow most public discussion about iterated amplification, and to be useful for building intuition and informing research about AI alignment even if you never think about amplification again. The sequence will be easier to understand if you have a working understanding of ML, statistics, and online learning, and if you are familiar with other work on AI alignment. But it would be reasonable to just dive in and just skip over any detailed discussion that seems to depend on missing prerequisites. Outline and reading recommendations The first part of this sequence clarifies the problem that iterated amplification is trying to solve, which is both narrower and broader than you might expect. The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal. The core of the sequence is the third section. Benign model-free RL describes iterated amplification, as a general framework into which we can substitute arbitrary algorithms for reward learning, amplification, and robustness. The first four posts all describe variants of this idea from different perspectives, and if you find that one of those descriptions is clearest for you then I recommend focusing on that one and skimming the others. The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment. The fifth section of the sequence breaks down some of these problems further and describes some possible approaches. The final section is an FAQ by Alex Zhu, included as append...

The Nonlinear Library
LW - Directions and desiderata for AI alignment by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 22:44


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 15: Directions and desiderata for AI alignment, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Note: This is the first post from part four: what needs doing of the sequence on iterated amplification. The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment. In the first half of this post, I'll discuss three research directions that I think are especially promising and relevant to AI alignment: Reliability and robustness. Building ML systems which behave acceptably in the worst case rather than only on the training distribution. Oversight / reward learning. Constructing objectives and training strategies which lead our policies to do what we intend. Deliberation and amplification. Surpassing human performance without simultaneously abandoning human preferences. I think that we have several angles of attack on each of these problems, and that solutions would significantly improve our ability to align AI. My current feeling is that these areas cover much of the key work that needs to be done. In the second half of the post, I'll discuss three desiderata that I think should guide research on alignment: Secure. Our solutions should work acceptably even when the environment itself is under the influence of an adversary. Competitive. Our solutions should impose minimal overhead, performance penalties, or restrictions compared to malign AI. Scalable. Our solutions should continue to work well even when the underlying learning systems improve significantly. I think that taking these requirements seriously leads us to substantially narrow our focus. It may turn out that these desiderata are impossible to meet, but if so I think that the first order of business should be understanding clearly why they are impossible. This would let us better target our work on alignment and better prepare for a future where we won't have a completely satisfying solution to alignment. (The ideas in this post are not novel. My claimed contribution is merely collecting these things together. I will link to my own writing on each topic in large part because that's what I know.) I. Research directions 1. Reliability and robustness Traditional ML algorithms optimize a model or policy to perform well on the training distribution. These models can behave arbitrarily badly when we move away from the training distribution. Similarly, they can behave arbitrarily badly on a small part of the training distribution. I think this is bad news: Deploying ML systems will critically change their environment, in a way that is hard or impossible to simulate at training time. (The “treacherous turn” is a special case of this phenomenon.) Deployed ML systems are interconnected and exposed to the same world. So if conditions change in a way that causes one of them to fail, manysystems may fail simultaneously. If ML systems are extremely powerful, or if they play a critical role in society, then a widespread failure may have catastrophic consequences. I'm aware of three basic approaches to reliability that seem to me like they could plausibly scale and be competitive: (ETA: this list is superseded by the list in Techniques for Optimizing Worst-Case Performance. I removed consensus and added interpretability and verification. I don't discuss “learning the right model,” which I still consider a long shot.) Adversarial training. At training time, attempt to construct inputs that induce problematic behavior and train on those. Eventually, we hope there will be no catastrophe-inducing inputs left. We don't yet know what is possible to achieve. (Szegedy 2014...

The Nonlinear Library
LW - Clarifying "AI Alignment" by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 5:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 3: Clarifying "AI Alignment", published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do. The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean. In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed. Analogy Consider a human assistant who is trying their hardest to do what H wants. I'd say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I'd say we've solved the alignment problem. “Aligned” doesn't mean “perfect:” They could misunderstand an instruction, or be wrong about what H wants at a particular moment in time. They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect. They may not know everything about H's preferences, and so fail to recognize that a particular side effect is bad. They may build an unaligned AI (while attempting to build an aligned AI). I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won't make them more aligned. (For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can't recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.) Clarifications The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I'd call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true. An aligned AI can make errors, including moral or psychological errors, and fixing those errors isn't part of my definition of alignment except insofar as it's part of getting the AI to “try to do what H wants” de dicto. This is a critical difference between my definition and some other common definitions. I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment. An aligned AI would also be trying to do what H wants with respect to clarifying H's preferences. For example, it should decide whether to ask if H prefers apples or oranges, based on its best guesses about how important the decision is to H, how confident it is in its current guess, how annoying it would be to ask, etc. Of course, it may also make a mistake at the meta level — for example, it may not understand when it is OK to interrupt H, and therefore avoid asking questions that it would have been better to ask. This definition of “alignment” is extremely imprecise. I expect it to correspond to some more precise concept that cleaves reality at the joints. But that might not become clear, one way or the other, until we've made significant progress. One reason the definition is imprecise is ...

The Nonlinear Library
LW - Learning with catastrophes by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 6:09


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 18: Learning with catastrophes, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. A catastrophe is an event so bad that we are not willing to let it happen even a single time. For example, we would be unhappy if our self-driving car ever accelerates to 65 mph in a residential area and hits a pedestrian. Catastrophes present a theoretical challenge for traditional machine learning — typically there is no way to reliably avoid catastrophic behavior without strong statistical assumptions. In this post, I'll lay out a very general model for catastrophes in which they are avoidable under much weaker statistical assumptions. I think this framework applies to the most important kinds of catastrophe, and will be especially relevant to AI alignment. Designing practical algorithms that work in this model is an open problem. In a subsequent post I describe what I currently see as the most promising angles of attack. Modeling catastrophes We consider an agent A interacting with the environment over a sequence of episodes. Each episode produces a transcript τ, consisting of the agent's observations and actions, along with a reward r ∈ [0, 1]. Our primary goal is to quickly learn an agent which receives high reward. (Supervised learning is the special case where each transcripts consist of a single input and a label for that input.) While training, we assume that we have an oracle which can determine whether a transcript τ is “catastrophic.” For example, we might show a transcript to a QA analyst and ask them if it looks catastrophic. This oracle can be applied to arbitrary sequences of observations and actions, including those that don't arise from an actual episode. So training can begin before the very first interaction with nature, using only calls to the oracle. Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions: The agent made a catastrophically bad decision. The agent's observations are plausible: we have a right to expect the agent to be able to handle those observations. While actually interacting with the environment, the agent cannot query the oracle — there is no time to wait for a QA engineer to review a proposed action to check if it would be catastrophic. Moreover, if interaction with nature ever produces a catastrophic transcript, we immediately fail. The performance of an algorithm is characterized by two parameters: the probability of catastrophic failure, and the total reward assuming no catastrophic failure. We assume that there are some policies such that no matter what nature does, the resulting transcript is never catastrophic. Traditionally in RL the goal is to get as much reward as the best policy from some class C. We' slightly weaken that goal, and instead aim to do as well as the best policy from C that never makes a catastrophic decision. Batch learning I've described an online version of learning with catastrophes. We can also consider the batch version, where the learner is first given a large number of “training” episodes. In the batch version, there is no penalty for catastrophes at training time, and we don't care about training error. The two performance parameters are test-time performance and test-time catastrophe probability. The oracle This definition depends on an oracle who determines which transcripts are catastrophic. For weak AI systems, the oracle may be a human. But a powerful AI system might take actions which are catastrophic but which look inoffensive to a human judge, so this approach doesn't cut it. In general, the judge should be a human+AI team which is more competent than the system being trained, armed with an adequate solution to the informed oversight problem. Approach Learning with catastro...

The Nonlinear Library
LW - Techniques for optimizing worst-case performance by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 13:57


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 20: Techniques for optimizing worst-case performance, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. If powerful ML systems fail catastrophically, they may be able to quickly cause irreversible damage. To be safe, it's not enough to have an average-case performance guarantee on the training distribution — we need to ensure that even if our systems fail on new distributions or with small probability, they will never fail too badly. The difficulty of optimizing worst-case performance is one of the most likely reasons that I think prosaic AI alignment might turn out to be impossible (if combined with an unlucky empirical situation). In this post I want to explain my view of the problem and enumerate some possible angles of attack. My goal is to communicate why I have hope that worst-case guarantees are achievable. None of these are novel proposals. The intention of this post is to explain my view, not to make a new contribution. I don't currently work in any of these areas, and so this post should be understood as an outsider looking in, rather than coming from the trenches. Malign vs. benign failures and corrigibility I want to distinguish two kinds of failures: “Benign” failures, where our system encounters a novel situation, doesn't know how to handle it, and so performs poorly. The resulting behavior may simply be erratic, or may serve an external attacker. Their effect is similar to physical or cybersecurity vulnerabilities — they create an opportunity for destructive conflict but don't systematically disfavor human values. They may pose an existential risk when combined with high-stakes situations, in the same way that human incompetence may pose an existential risk. Although these failures are important, I don't think it is necessary or possible to eliminate them in the worst case. “Malign” failures, where our system continues to behave competently but applies its intelligence in the service of an unintended goal. These failures systematically favor whatever goals AI systems tend to pursue in failure scenarios, at the expense of human values. They constitute an existential risk independent of any other destructive technology or dangerous situation. Fortunately, they seem both less likely and potentially possible to avoid even in the worst case. I'm most interested in malign failures, and the narrower focus is important to my optimism. The distinction between malign and benign failures is not always crisp. For example, suppose we try to predict a human's preferences, then search over all strategies to find the one that best satisfies the predicted preferences. Guessing the preferences even a little bit wrong would create an adversarial optimizer incentivized to apply its intelligence to a purpose at odds with our real preferences. If we take this approach, incompetence does systematically disfavor human values. By aiming for corrigible rather than optimal behavior (see here or here) I'm optimistic that it is possible to create a sharper distinction between benign and malign failures, which can be leveraged by the techniques below. But for now, this hope is highly speculative. Amplification I believe that these techniques are much more likely to work if we have access to an overseer who is significantly smarter than the model that we are trying to train. I hope that amplification makes this possible. It seems realistic for a strong overseer to recognize an (input, output) pair as a malign failure mode (though it may require a solution to informed oversight). So now we have a concrete goal: find a model that never gives an output the overseer would diagnose as catastrophically bad. Historically researchers in the AI safety community have been extremely pessimistic about...

The Nonlinear Library
LW - Reliability amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 12:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 21: Reliability amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. In a recent post I talked about capability amplification, a putative procedure that turns a large number of fast weak agents into a slower, stronger agent. If we do this in a naive way, it will decrease reliability. For example, if. Our weak policy fails with probability 1%. In order to implement a strong policy we combine 10 decisions made by weak agents. If any of these 10 decisions is bad, then so is the combination. .then the combination will be bad with 10% probability. Although the combination can be more powerful than any individual decision, in this case it is much less reliable. If we repeat policy amplification several times, our failure probability could quickly approach 1, even if it started out being exponentially small. Complementary to capability amplification is reliability amplification: given a policy H that usually works, can we implement a policy H that works significantly more reliably? To be slightly less imprecise (but still quite crude): Given a distribution A over policies that ε-close to a benign policy for some ε ≪ 1, can we implement a distribution A⁺ over policies which is δ-close to a benign policy of similar capability, for some δ ≪ ε? When I say that A is ε-close to a benign policy, I mean that there is a benign distribution over policies which has total variation distance at most ε from A. We can view this is as a simple mixture between a single aligned policy that is chosen with probability (1−ε) and an adversarial policy that is chosen with probability ε. I will move freely between talking about “policies” and “distributions over policies;” we can turn a distribution into a single policy by averaging. The notion of “implement” here is exactly the same as for capability amplification: we can use any number of implementations of A and whatever other resources are at our disposal, and must compute A⁺'s responses to a sequence of observations, within some time limit such as a day. Reliability amplification via deliberation I think there is one very natural way to do reliability amplification — if some variant of this idea doesn't work, I'm not sure what else to try. As a warm-up, let's consider the setting where the agents want to make a single decision rather than a sequence of decisions — we'll return to the sequential case soon. We sample three agents A, B, and C from the given distribution. We ask them to talk for a while and then each propose an action. Then we ask them to do approval voting amongst the three options, and randomly pick from the maximally-approved-of proposals. If all three agents are aligned, then this proposal is clearly going to produce an aligned action. If two the agents behave adversarially then we are clearly out of luck, but that happens only with probability O(ε²), so we can basically ignore it. The interesting case is when there is one adversarial agent and two aligned agents. If we threw up our hands and picked randomly in this case, then we would have a failure probability of ε. So the question is: can we do better than that? The obvious hope is that aligned agents can do better-than-chance at recognizing each other / at recognizing good actions. If they can get any non-negligible advantage, then we can get a non-negligible reduction in error probability (and by iterating the procedure we can potentially get large reductions). The sequential case So far we have talked about making a single decision. The problem is more challenging when we need to make a sequence of decisions. We can try to simply generalize the voting approach, running a new vote for each action. To see the difficulty, suppose that the optimal policy looks as follo...

The Nonlinear Library
LW - Security amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 21:18


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 22: Security amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. An apparently aligned AI system may nevertheless behave badly with small probability or on rare “bad” inputs. The reliability amplification problem is to reduce the failure probability of an aligned AI. The analogous security amplification problem is to reduce the prevalence of bad inputs on which the failure probability is unacceptably high. We could measure the prevalence of bad inputs by looking at the probability that a random input is bad, but I think it is more meaningful to look at the difficulty of finding a bad input. If it is exponentially difficult to find a bad input, then in practice we won't encounter any. If we could transform a policy in a way that multiplicatively increase the difficulty of finding a bad input, then by interleaving that process with a distillation step like imitation or RL we could potentially train policies which are as secure as the learning algorithms themselves — eliminating any vulnerabilities introduced by the starting policy. For sophisticated AI systems, I currently believe that meta-execution is a plausible approach to security amplification. (ETA: I still think that this basic approach to security amplification is plausible, but it's now clear that meta-execution on its own can't work.) Motivation There are many inputs on which any particular implementation of “human judgment” will behave surprisingly badly, whether because of trickery, threats, bugs in the UI used to elicit the judgment, snow-crash-style weirdness, or whatever else. (The experience of computer security suggests that complicated systems typically have many vulnerabilities, both on the human side and the machine side.) If we aggressively optimize something to earn high approval from a human, it seems likely that we will zoom in on the unreasonable part of the space and get an unintended result. What's worse, this flaw seems to be inherited by any agent trained to imitate human behavior or optimize human approval. For example, inputs which cause humans to behave badly would also cause a competent human-imitator to behave badly. The point of security amplification is to remove these human-generated vulnerabilities. We can start with a human, use them to train a learning system (that inherits the human vulnerabilities), use security amplification to reduce these vulnerabilities, use the result to train a new learning system (that inherits the reduced set of vulnerabilities), apply security amplification to reduce those vulnerabilities further, and so on. The agents do not necessarily get more powerful over the course of this process — we are just winnowing away the idiosyncratic human vulnerabilities. This is important, if possible, because it (1) lets us train more secure systems, which is good in itself, and (2) allows us to use weak aligned agents as reward functions for a extensive search. I think that for now this is one of the most plausible paths to capturing the benefits of extensive search without compromising alignment. Security amplification would not be directly usable as a substitute for informed oversight, or to protect an overseer from the agent it is training, because informed oversight is needed for the distillation step which allows us to iterate security amplification without exponentially increasing costs. Note that security amplification + distillation will only remove the vulnerabilities that came from the human. We will still be left with vulnerabilities introduced by our learning process, and with any inherent limits on our model's ability to represent/learn a secure policy. So we'll have to deal with those problems separately. Towards a definition The security amplif...

The Nonlinear Library
LW - Meta-execution by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 7:41


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 23: Meta-execution, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This post describes meta-execution, my current proposal for capability amplification and security amplification. (Meta-execution is annotated functional programming + strong HCH + a level of indirection. It is implemented in the amplify module of my ALBA repository.) Goal We are given an efficient agent A which competently pursues some values. We'd like to use a bunch of copies of A in order to implement a more powerful and robust agent Meta(A) with the same values. Outline Our basic plan is to build a machine out of copies of the agent; instead of asking the agent to make a decision directly, we ask it to implement the decision-making process by answering a sequence of questions of the form “what should happen next?” The basic object in meta-execution is a message, which consists of text along with pointers to other messages or to agents. Each box is a message. A is an agent who can respond to queries like “which of X and Y is larger?” We can represent arbitrarily large objects as giant trees of messages and agents. Meta-execution first forms a tree representing the question “what should be done?” It then asks the agent A to perform a sequence of operations on the tree that eventually lead to an answer. Then it executes that answer. The initial tree might look something like this: If you can answer this question, you can implement an agent. At any given time, an agent who is operating on this tree can only “see” a few messages: it can read the text of those messages, and see pointers like [red] and [blue]. Initially the agent can see only the root of the tree. If you are an agent tasked with processing a message, there are a few basic operations you can perform. You specify the “targets” of the action by specifying pointers you want to follow: Look at another part of the tree which is not currently visible. Spawn a new agent, and see a pointer to that agent. Send a message to an agent, and see its reply. You can compose a message by writing it out with sub-messages in parentheses (); for example, “What is the smallest element in (the list with first element [green] and remaining elements [purple]) according to [blue]” would produce the message in the first image above, if [green], [purple], [blue] had appropriate values. Terminate the current computation by composing a reply. This reply gets sent to the “parent” who initiated the current computation. In the case of the very first agent, who was created in order to answer the original question “what should an agent in state [red] do after receiving input [blue]?”, the reply specifies what the overall system should do. An example execution is illustrated here. I made a quick demo of the execution process, you can find it here. And that's basically it. We spawn a new agent, and hand it the “what should we do?” message. It can then take any of the basic actions listed above and see the result. We repeat that process until the agent returns a message indicating what should be done. We parse the message as an action and new state (see the section on parsing below), we execute the action, and we update the system's state. The details Hopefully for most purposes that outline tells you everything you need to know. If not, the easiest way to learn exactly how this works is probably just to look at the code. Meta-execution is implemented as lambda A : Meta(HCH(A, n)) in the package amplify.__init__, where n is the computational budget and A is the meta-executor. You can experience being the meta-executor by calling examples.meta.act("test") . The available commands are described in the README. Everything is immutable I assume that we have a digital implementation o...

The Nonlinear Library
LW - Thoughts on reward engineering by paulfchristiano from Iterated Amplification

The Nonlinear Library

Play Episode Listen Later Dec 24, 2021 18:03


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 19: Thoughts on reward engineering, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Note: This is the first post from part five: possible approaches of the sequence on iterated amplification. The fifth section of the sequence breaks down some of these problems further and describes some possible approaches. Suppose that I would like to train an RL agent to help me get what I want. If my preferences could be represented by an easily-evaluated utility function, then I could just use my utility function as the agent's reward function. But in the real world that's not what human preferences look like. So if we actually want to turn our preferences into a reward function suitable for training an RL agent, we have to do some work. This post is about the straightforward parts of reward engineering. I'm going to deliberately ignore what seem to me to be the hardest parts of the problem. Getting the straightforward parts out of the way seems useful for talking more clearly about the hard parts (and you never know what questions may turn out to be surprisingly subtle). The setting To simplify things even further, for now I'll focus on the special case where our agent is taking a single action a. All of the difficulties that arise in the single-shot case also arise in the sequential case, but the sequential case also has its own set of additional complications that deserve their own post. Throughout the post I will imagine myself in the position of an “overseer” who is trying to specify a reward function R(a) for an agent. You can imagine the overseer as the user themselves, or (more realistically) as a team of engineer and/or researchers who are implementing a reward function intended to expresses the user's preferences. I'll often talk about the overseer computing R(a) themselves. This is at odds with the usual situation in RL, where the overseer implements a very fast function for computing R(a) in general (“1 for a win, 0 for a draw, -1 for a loss”). Computing R(a) for a particular action a is strictly easier than producing a fast general implementation, so in some sense this is just another simplification. I talk about why it might not be a crazy simplification in section 6. Contents Long time horizons. How do we train RL agents when we care about the long-term effects of their actions? Inconsistency and unreliability. How do we handle the fact that we have only imperfect access to our preferences, and different querying strategies are not guaranteed to yield consistent or unbiased answers? Normative uncertainty. How do we train an agent to behave well in light of its uncertainty about our preferences? Widely varying reward. How do we handle rewards that may vary over many orders of magnitude? Sparse reward. What do we do when our preferences are very hard to satisfy, such that they don't provide any training signal? Complex reward. What do we do when evaluating our preferences is substantially more expensive than running the agent? Conclusion. Appendix: harder problems. 1. Long time horizons A single decision may have very long-term effects. For example, even if I only care about maximizing human happiness, I may instrumentally want my agent to help advance basic science that will one day improve cancer treatment. In principle this could fall out of an RL task with “human happiness” as the reward, so we might think that neglecting long-term effects is just a shortcoming of the single-shot problem. But even in theory there is no way that an RL agent can learn to handle arbitrarily long-term dependencies (imagine training an RL agent to handle 40 year time horizons), and so focusing on the sequential RL problem doesn't address this issue. I think that the only real approach is t...

The Nonlinear Library: LessWrong
LW - Humans Consulting HCH by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 2:48


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 8: Humans Consulting HCH, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (See also: strong HCH.) Consider a human Hugh who has access to a question-answering machine. Suppose the machine answers question Q by perfectly imitating how Hugh would answer question Q, if Hugh had access to the question-answering machine. That is, Hugh is able to consult a copy of Hugh, who is able to consult a copy of Hugh, who is able to consult a copy of Hugh. Let's call this process HCH, for “Humans Consulting HCH.” I've talked about many variants of this process before, but I find it easier to think about with a nice handle. (Credit to Eliezer for proposing using a recursive acronym.) HCH is easy to specify very precisely. For now, I think that HCH is our best way to precisely specify “a human's enlightened judgment.” It's got plenty of problems, but for now I don't know anything better. Elaborations We can define realizable variants of this inaccessible ideal: For a particular prediction algorithm P, define HCHᴾ as: “P's prediction of what a human would say after consulting HCHᴾ” For a reinforcement learning algorithm A, define max-HCHᴬ as: “A's output when maximizing the evaluation of a human after consulting max-HCHᴬ” For a given market structure and participants, define HCHᵐᵃʳᵏᵉᵗ as: “the market's prediction of what a human will say after consulting HCHᵐᵃʳᵏᵉᵗ” Note that e.g. HCHᴾ is totally different from “P's prediction of HCH.” HCHᴾ will generally make worse predictions, but it is easier to implement. Hope The best case is that HCHᴾ, max-HCHᴬ, and HCHᵐᵃʳᵏᵉᵗ are: As capable as the underlying predictor, reinforcement learner, or market participants. Aligned with the enlightened judgment of the human, e.g. as evaluated by HCH. (At least when the human is suitably prudent and wise.) It is clear from the definitions that these systems can't be any more capable than the underlying predictor/learner/market. I honestly don't know whether we should expect them to match the underlying capabilities. My intuition is that max-HCHᴬ probably can, but that HCHᴾ and HCHᵐᵃʳᵏᵉᵗ probably can't. It is similarly unclear whether the system continues to reflect the human's judgment. In some sense this is in tension with the desire to be capable — the more guarded the human, the less capable the system but the more likely it is to reflect their interests. The question is whether a prudent human can achieve both goals. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library: LessWrong
LW - Supervising strong learners by amplifying weak experts by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 1:17


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 13: Supervising strong learners by amplifying weak experts, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is a linkpost for Abstract Many real world learning tasks involve complex or hard-to-specify objectives, and using an easier-to-specify proxy can lead to poor performance or misaligned behavior. One solution is to have humans provide a training signal by demonstrating or judging performance, but this approach fails if the task is too complicated for a human to directly evaluate. We propose Iterated Amplification, an alternative training strategy which progressively builds up a training signal for difficult problems by combining solutions to easier subproblems. Iterated Amplification is closely related to Expert Iteration (Anthony et al., 2017; Silver et al., 2017b), except that it uses no external reward function. We present results in algorithmic environments, showing that Iterated Amplification can efficiently learn complex behaviors. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library: LessWrong
LW - Benign model-free RL by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 12:45


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 11: Benign model-free RL, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. In my last post, I described three research areas in AI control that I see as central: reward learning, robustness, and deliberation. In this post I argue that these three pieces may be sufficient to get a benign and competitive version of model-free reinforcement learning. I think this is an important intermediate goal of solving AI control. This post doesn't discuss benign model-based RL at all, which I think is another key obstacle for prosaic AI control. (This post overlaps extensively with my post on ALBA, but I hope this one will be much clearer. Technically, ALBA is an implementation of the general strategy outlined in this post. I think the general strategy is much more important than that particular implementation.) Ingredients Reward learning and robustness Given a benign agent H, reward learning allows us to construct a reward function r that can be used to train a weaker benign agent A. If our training process is robust, the resulting agent A will remain benign off of the training distribution (though it may be incompetent off of the training distribution). Schematically, we can think of reward learning + robustness as a widget which takes a slow, benign process H and produces a fast, benign process A A's capabilities should be roughly the “intersection” of H's capabilities and our RL algorithms' competence. That is, A should be able to perform a task whenever both H can perform that task and our RL algorithms can learn to perform that task. In these pictures, the vertical axis corresponds intuitively to “capability,” with higher agents being more capable. But in reality I'm thinking of the possible capabilities as forming a complete lattice. That is, a generic pair of levels of capabilities is incomparable, with neither strictly dominating the other. Amplification If we iteratively apply reward learning and robustness, we will obtain a sequence of weaker and weaker agents. To get anywhere, we need some mechanism that lets us produce a stronger agent. The capability amplification problem is to start with a weak agent A and a human expert H, and to produce a significantly more capable agent Hᴬ. The more capable agent can take a lot longer to think, all we care about is that it eventually arrives at better decisions than A. The key challenge is ensuring that Hᴬ remains benign, i.e. that the system doesn't acquire new preferences as it becomes more capable. An example approach is to provide A as an assistant to H. We can give H an hour to deliberate, and let it consult A thousands of times during that hour. Hᴬ's output is then whatever H outputs at the end of that process. Because H is consulting A a large number of times, we can hope that the resulting system will be much smarter than A. Of course, the resulting system will be thousands of times more computationally expensive than A, but that's fine. In general, meta-execution is my current preferred approach to capability amplification. Schematically, we can think of amplification as a widget which takes a fast, benign process A and produces a slow, benign process Hᴬ: Putting it together With these two widgets in hand, we can iteratively produce a sequence of increasingly competent agents: That is, we start with our benign expert H. We then learn a reward function and train an agent A, which is less capable than H but can run much faster. By running many instances of A, we obtain a more powerful agent Hᴬ, which is approximately as expensive as H. We can then repeat the process, using Hᴬ to train an agent A⁺ which runs as fast as A but is more capable. By running A⁺ for a long time we obtain a still more capable agent Hᴬ⁺, and the cycle re...

The Nonlinear Library: LessWrong
LW - Iterated Distillation and Amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 10:54


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 10: Iterated Distillation and Amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is a guest post summarizing Paul Christiano's proposed scheme for training machine learning systems that can be robustly aligned to complex and fuzzy values, which I call Iterated Distillation and Amplification (IDA) here. IDA is notably similar to AlphaGoZero and expert iteration. The hope is that if we use IDA to train each learned component of an AI then the overall AI will remain aligned with the user's interests while achieving state of the art performance at runtime — provided that any non-learned components such as search or logic are also built to preserve alignment and maintain runtime performance. This document gives a high-level outline of IDA. Motivation: The alignment/capabilities tradeoff Assume that we want to train a learner A to perform some complex fuzzy task, e.g. “Be a good personal assistant.” Assume that A is capable of learning to perform the task at a superhuman level — that is, if we could perfectly specify a “personal assistant” objective function and trained A to maximize it, then A would become a far better personal assistant than any human. There is a spectrum of possibilities for how we might train A to do this task. On one end, there are techniques which allow the learner to discover powerful, novel policies that improve upon human capabilities: Broad reinforcement learning: As A takes actions in the world, we give it a relatively sparse reward signal based on how satisfied or dissatisfied we are with the eventual consequences. We then allow A to optimize for the expected sum of its future rewards Broad inverse reinforcement learning: A attempts to infer our deep long-term values from our actions, perhaps using a sophisticated model of human psychology and irrationality to select which of many possible extrapolations is correct. However, it is difficult to specify a broad objective that captures everything we care about, so in practice A will be optimizing for some proxy that is not completely aligned with our interests. Even if this proxy objective is “almost” right, its optimum could be disastrous according to our true values. On the other end, there are techniques that try to narrowly emulate human judgments: Imitation learning: We could train A to exactly mimic how an expertwould do the task, e.g. by training it to fool a discriminative model trying to tell apart A's actions from the human expert's actions. Narrow inverse reinforcement learning: We could train A to infer our near-term instrumental values from our actions, with the presumption that our actions are roughly optimal according to those values. Narrow reinforcement learning: As A takes actions in the world, we give it a dense reward signal based on how reasonable we judge its choices are (perhaps we directly reward state-action pairs themselves rather than outcomes in the world, as in TAMER). A optimizes for the expected sum of its future rewards. Using these techniques, the risk of misalignment is reduced significantly (though not eliminated) by restricting agents to the range of known human behavior — but this introduces severe limitations on capability. This tradeoff between allowing for novel capabilities and reducing misalignment risk applies across different learning schemes (with imitation learning generally being narrowest and lowest risk) as well as within a single scheme. The motivating problem that IDA attempts to solve: if we are only able to align agents that narrowly replicate human behavior, how can we build an AGI that is both aligned and ultimately much more capable than the best humans? Core concept: Analogy to AlphaGoZero The core idea of Paul's scheme is simila...

The Nonlinear Library: LessWrong
LW - Corrigibility by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 10:38


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 9:Corrigibility, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Warning: rambling.) I would like to build AI systems which help me: Figure out whether I built the right AI and correct any mistakes I made Remain informed about the AI's behavior and avoid unpleasant surprises Make better decisions and clarify my preferences Acquire resources and remain in effective control of them Ensure that my AI systems continue to do all of these nice things .and so on We say an agent is corrigible (article on Arbital) if it has these properties. I believe this concept was introduced in the context of AI by Eliezer and named by Robert Miles; it has often been discussed in the context of narrow behaviors like respecting an off-switch, but here I am using it in the broadest possible sense. In this post I claim: A benign act-based agent will be robustly corrigible if we want it to be. A sufficiently corrigible agent will tend to become more corrigible and benign over time. Corrigibility marks out a broad basin of attraction towards acceptable outcomes. As a consequence, we shouldn't think about alignment as a narrow target which we need to implement exactly and preserve precisely. We're aiming for a broad basin, and trying to avoid problems that could kick out of that basin. This view is an important part of my overall optimism about alignment, and an important background assumption in some of my writing. 1. Benign act-based agents can be corrigible A benign agent optimizes in accordance with our preferences. An act-basedagent considers our short-term preferences, including (amongst others) our preference for the agent to be corrigible. If on average we are unhappy with the level of corrigibility of a benign act-based agent, then by construction it is mistaken about our short-term preferences. This kind of corrigibility doesn't require any special machinery. An act-based agent turns off when the overseer presses the “off” button not because it has received new evidence, or because of delicately balanced incentives. It turns off because that's what the overseer prefers. Contrast with the usual futurist perspective Omohundro's The Basic AI Drives argues that “almost all systems [will] protect their utility functions from modification,” and Soares, Fallenstein, Yudkowsky, and Armstrong cite as: “almost all [rational] agents are instrumentally motivated to preserve their preferences.” This motivates them to consider modifications to an agent to remove this default incentive. Act-based agents are generally an exception to these arguments, since the overseer has preferences about whether the agent protects its utility function from modification. Omohundro presents preferences-about-your-utility function case as a somewhat pathological exception, but I suspect that it will be the typical state of affairs for powerful AI (as for humans) and it does not appear to be unstable. It's also very easy to implement in 2017. Is act-based corrigibility robust? How is corrigibility affected if an agent is ignorant or mistaken about the overseer's preferences? I think you don't need particularly accurate models of a human's preferences before you can predict that they want their robot to turn off when they press the off button or that they don't want to be lied to. In the concrete case of an approval-directed agent, “human preferences” are represented by human responses to questions of the form “how happy would you be if I did a?” If the agent is considering the action a precisely because it is manipulative or would thwart the user's attempts to correct the system, then it doesn't seem hard to predict that the overseer will object to a. Eliezer has suggested that this is a very anthropocentric judgment of “...

The Nonlinear Library: LessWrong
LW - Prosaic AI alignment by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 13:11


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 5: Prosaic AI alignment, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. (Related: a possible stance for AI control.) It's conceivable that we will build “prosaic” AGI, which doesn't reveal any fundamentally new ideas about the nature of intelligence or turn up any “unknown unknowns.” I think we wouldn't know how to align such an AGI; moreover, in the process of building it, we wouldn't necessarily learn anything that would make the alignment problem more approachable. So I think that understanding this case is a natural priority for research on AI alignment. In particular, I don't think it is reasonable to say “we'll know how to cross that bridge when we come to it,” or “it's impossible to do meaningful work without knowing more about what powerful AI will look like.” If you think that prosaic AGI is plausible, then we may already know what the bridge will look like when we get to it: if we can't do meaningful work now, then we have a problem. 1. Prosaic AGI It now seems possible that we could build “prosaic” AGI, which can replicate human behavior but doesn't involve qualitatively new ideas about “how intelligence works:” It's plausible that a large neural network can replicate “fast” human cognition, and that by coupling it to simple computational mechanisms — short and long-term memory, attention, etc. — we could obtain a human-level computational architecture. It's plausible that a variant of RL can train this architecture to actually implement human-level cognition. This would likely involve some combination of ingredients like model-based RL, imitation learning, or hierarchical RL. There are a whole bunch of ideas currently on the table and being explored; if you can't imagine any of these ideas working out, then I feel that's a failure of imagination (unless you see something I don't). We will certainly learn something by developing prosaic AGI. The very fact that there were no qualitatively new ideas is itself surprising. And beyond that, we'll get a few more bits of information about which particular approach works, fill in a whole bunch of extra details about how to design and train powerful models, and actually get some experimental data. But none of these developments seem to fundamentally change the alignment problem, and existing approaches to AI alignment are not bottlenecked on this kind of information. Actually having the AI in front of us may let us work several times more efficiently, but it's not going to move us from “we have no idea how to proceed” to “now we get it.” 2. Our current state 2a. The concern If we build prosaic superhuman AGI, it seems most likely that it will be trained by reinforcement learning (extending other frameworks to superhuman performance would require new ideas). It's easy to imagine a prosaic RL system learning to play games with superhuman levels of competence and flexibility. But we don't have any shovel-ready approach to training an RL system to autonomously pursue our values. To illustrate how this can go wrong, imagine using RL to implement a decentralized autonomous organization (DAO) which maximizes its profit. If we had very powerful RL systems, such a DAO might be able to outcompete human organizations at a wide range of tasks — producing and selling cheaper widgets, but also influencing government policy, extorting/manipulating other actors, and so on. The shareholders of such a DAO may be able to capture the value it creates as long as they are able to retain effective control over its computing hardware / reward signal. Similarly, as long as such DAOs are weak enough to be effectively governed by existing laws and institutions, they are likely to benefit humanity even if they reinvest all of their profits. But a...

The Nonlinear Library: LessWrong
LW - Approval-directed bootstrapping by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 2:03


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 7: Approval-directed bootstrapping, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Approval-directed behavior works best when the overseer is very smart. Where can we find a smart overseer? One approach is bootstrapping. By thinking for a long time, a weak agent can oversee an agent (slightly) smarter than itself. Now we have a slightly smarter agent, who can oversee an agent which is (slightly) smarter still. This process can go on, until the intelligence of the resulting agent is limited by technology rather than by the capability of the overseer. At this point we have reached the limits of our technology. This may sound exotic, but we can implement it in a surprisingly straightforward way. Suppose that we evaluate Hugh's approval by predicting what Hugh would say if we asked him; the rating of action a is what Hugh would say if, instead of taking action a, we asked Hugh, “How do you rate action a?” Now we get bootstrapping almost for free. In the process of evaluating a proposed action, Hugh can consult Arthur. This new instance of Arthur will, in turn, be overseen by Hugh—and in this new role Hugh can, in turn, be assisted by Arthur. In principle we have defined the entire infinite regress before Arthur takes his first action. We can even learn this function by examples — no elaborate definitions necessary. Each time Arthur proposes an action, we actually ask Hugh to evaluate the action with some probability, and we use our observations to train a model for Hugh's judgments. In practice, Arthur might not be such a useful assistant until he has acquired some training data. As Arthur acquires training data, the Hugh+Arthur system becomes more intelligent, and so Arthur acquires training data from a more intelligent overseer. The bootstrapping unfolds over time as Arthur adjusts to increasingly powerful overseers. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library: LessWrong
LW - Approval-directed agents by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 25:24


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 6: Approval-directed agents, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Note: This is the first post from part two: basic intuitions of the sequence on iterated amplification. The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal. Research in AI is steadily progressing towards more flexible, powerful, and autonomous goal-directed behavior. This progress is likely to have significant economic and humanitarian benefits: it helps make automation faster, cheaper, and more effective, and it allows us to automate deciding what to do. Many researchers expect goal-directed machines to predominate, and so have considered the long-term implications of this kind of automation. Some of these implications are worrying: if sophisticated artificial agents pursue their own objectives and are as smart as we are, then the future may be shaped as much by their goals as by ours. Most thinking about “AI safety” has focused on the possibility of goal-directed machines, and asked how we might ensure that their goals are agreeable to humans. But there are other possibilities. In this post I will flesh out one alternative to goal-directed behavior. I think this idea is particularly important from the perspective of AI safety. Approval-directed agents Consider a human Hugh, and an agent Arthur who uses the following procedure to choose each action: Estimate the expected rating Hugh would give each action if he considered it at length. Take the action with the highest expected rating. I'll call this “approval-directed” behavior throughout this post, in contrast with goal-directed behavior. In this context I'll call Hugh an “overseer.” Arthur's actions are rated more highly than those produced by any alternative procedure. That's comforting, but it doesn't mean that Arthur is optimal. An optimal agent may make decisions that have consequences Hugh would approve of, even if Hugh can't anticipate those consequences himself. For example, if Arthur is playing chess he should make moves that are actually good—not moves that Hugh thinks are good. The quality of approval-directed decisions is limited by the minimum of Arthur's ability and Hugh's ability: Arthur makes a decision only if it looks good to both Arthur and Hugh. So why would Hugh be interested in this proposal, rather than doing things himself? Hugh doesn't actually rate actions, he just participates in a hypothetical rating process. So Hugh can oversee many agents like Arthur at once (and spend his actual time relaxing on the beach). In many cases, this is the whole point of automation. Hugh can (hypothetically) think for a very long time about each decision—longer than would be practical or cost-effective if he had to actually make the decision himself. Similarly, Hugh can think about Arthur's decisions at a very low level of detail. For example, Hugh might rate a chess-playing AI's choices about how to explore the game tree, rather than rating its final choice of moves. If Arthur is making billions of small decisions each second, then Hugh can think in depth about each of them, and the resulting system can be much smarter than Hugh. Hugh can (hypothetically) use additional resources in order to make his rating: powerful computers, the benefit of hindsight, many assistants, very long time periods. Hugh's capabilities can be gradually escalated as needed, and one approval-directed system can be used to bootstrap to a more effective successor. For example, Arthur could advise Hugh on how to define a better overseer; Arthur could offer advice in real-time to help Hugh be a better o...

The Nonlinear Library: LessWrong
LW - The reward engineering problem by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 12:01


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 16: The reward engineering problem, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Today we usually train reinforcement learning agents to perform narrow tasks with simple goals. We may eventually want to train RL agents to behave “well” in open-ended environments where there is no simple goal. Suppose that we are trying to train an RL agent A. In each episode, A interacts with an environment, producing a transcript τ. We then evaluate that transcript, producing a reward r ∈ [0, 1]. A is trained is to maximize its reward. We would like to set up the rewards so that A will learn to behave well — that is, such that if A learns to receive a high reward, then we will be happy with A's behavior. To make the problem feasible, we assume that we have access to another agent H which is “smarter” than A, and makes “good” decisions. In order to evaluate transcript τ, we allow ourselves to make any number of calls to H, and to use any other tools that are available. The question is: how do we carry out the evaluation, so that the optimal strategy for A is to also make “good” decisions? Following Daniel Dewey, I'll call this the reward engineering problem. Note that our evaluation process may be quite expensive, and actually implementing it may be infeasible. To build a working system, we would need to combine this evaluation with semi-supervised RL and learning with catastrophes. Possible approaches and remaining problems I know of 3 basic approaches to reward engineering: Direct supervision. Use H to evaluate A's behavior, and train A to maximize H's evaluations. In some contexts we could compare two behaviors instead of evaluating one in isolation. Imitation learning. Use H to generate a bunch of transcripts, and train Ato produce similar-looking transcripts. For example, we could train a model to distinguish A's behavior from H's behavior, and reward A when it fools the distinguisher. Inverse reinforcement learning. Use H to generate a bunch of transcripts, and then infer a reward function which is being approximately optimized by H. Use this reward function to evaluate A's behavior. All of these approaches are promising but face significant challenges. I'll describe some of these problems in the next 3 sections. 1. Direct supervision In direct supervision, H looks at a transcript of A's behavior, and estimates how good that transcript is. To see the problem with this scheme, suppose that A has been asked to draw a picture, and A does it by copying an existing picture with some modifications. If originality is especially important, then this may be a very “bad” policy. But even if H is much smarter than A, it may be hard to tell that the picture is not original — creating a derivative work only requires looking at a single existing picture, while checking if a work is derivative requires considering every picture. More formally: in order for direct supervision to be effective, H needs to be better-informed than A about what is “good.” If this condition is satisfied, then from A's perspective, estimating H's estimate of goodness is equivalent to estimating actual goodness. This condition is superficially plausible — after all, we did assume that H is smarter than A. The problem is that when A picks an action, A is especially well-informed about that action — the computation which produced the action provides evidence about it, and H may not have access to that evidence. Transparency One response is to let H see how A computed its action. If H can understand that process, then H may be able to effectively evaluate the action. Sometimes this is straightforward: for example, if A uses an attention mechanism to look at a particular painting and copy it, we can simply tell Hwhat A look...

The Nonlinear Library: LessWrong
LW - An unaligned benchmark by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 15:32


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 4: An unaligned benchmark, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. My goal is to design AI systems that are aligned with human interests and competitive with unaligned AI. I find it useful to have a particular AI algorithm in mind. Then I can think about how that algorithm could cause trouble, and try to find a safer variant. I think of the possibly-unaligned AIs as a benchmark: it's what AI alignment researchers need to compete with. The further we fall short of the benchmark, the stronger the competitive pressures will be for everyone to give up on aligned AI and take their chances. I have a few standard benchmarks I keep in mind. This post describes one of those benchmarks. It also tries to lay out clearly why I think that benchmark is unsafe, and explains how I think my current research could make a safe version. I. Model-based RL with MCTS We train three systems in parallel: A generative model to sample sequences of observations, conditioned on sequences of actions. A reward function that takes as input a sequence of actions and predicted observations and produces a reward. A policy and value function which take as input a sequence of observations and produce the next action and an estimate of the future return. We train the policy and value function using (roughly) the AlphaZero algorithm: Use MCTS to improve the current policy. Update the policy at the root to predict the best move found by MCTS, update the value to predict its predicted value. Use the generative model to sample environment transitions and the reward function (with a small discount rate) to score them. We train an autoregressive generative model, to maximize the log probability assigned to the actual sequence of actions and observations produced by the AI (with each observation conditioned on the past actions). This isn't actually a good way to train the generative model, but it's not really central to the discussion. We train the reward function by showing humans sequences of actions and predicted observations, asking them to assign scores, then predicting those scores with supervised learning. We show humans the sequences of actions that look most promising to the system. There are plenty of details you'd need in order to make this work well, but that's the basic idea. When applied with very powerful networks, it's plausible that this system would be able to decisively outcompete humans. It would be capable performing a large intelligent search over long sequences of actions to find those that would be rated highly. II. What goes wrong? There are two classes of problems: Problem 1: Bad objective The goal of the system is to produce (action, observation) sequences that look good to humans. I claim that optimizing this objective faithfully will lead to bad outcomes. As the system improves, the rationale of many individual actions will become incomprehensible to a human overseer. At this point the only option for a human is to evaluate sequence of observations based on whether the consequences look good. The observations present a narrow view of the world, and I strongly suspect that the AI will find sequences of actions that make that narrow view look good without actually being good. Control vs. intrinsic goodness. I think there are two strategies for defining a reward function: Reward worlds in which humans remain in control of the situation, in which they are able to get accurate information and correct course as needed. Reward worlds in which intrinsically good things are happening Both of these strategies seem unworkable. Strategy #1: maintaining control. This appears to be unworkable because determining if humans are actually in control is incredibly difficult — at best you can tell w...

The Nonlinear Library: LessWrong
LW - Clarifying "AI Alignment" by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 5:17


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 3: Clarifying "AI Alignment", published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do. The “alignment problem” is the problem of building powerful AI systems that are aligned with their operators. This is significantly narrower than some other definitions of the alignment problem, so it seems important to clarify what I mean. In particular, this is the problem of getting your AI to try to do the right thing, not the problem of figuring out which thing is right. An aligned AI would try to figure out which thing is right, and like a human it may or may not succeed. Analogy Consider a human assistant who is trying their hardest to do what H wants. I'd say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I'd say we've solved the alignment problem. “Aligned” doesn't mean “perfect:” They could misunderstand an instruction, or be wrong about what H wants at a particular moment in time. They may not know everything about the world, and so fail to recognize that an action has a particular bad side effect. They may not know everything about H's preferences, and so fail to recognize that a particular side effect is bad. They may build an unaligned AI (while attempting to build an aligned AI). I use alignment as a statement about the motives of the assistant, not about their knowledge or ability. Improving their knowledge or ability will make them a better assistant — for example, an assistant who knows everything there is to know about H is less likely to be mistaken about what H wants — but it won't make them more aligned. (For very low capabilities it becomes hard to talk about alignment. For example, if the assistant can't recognize or communicate with H, it may not be meaningful to ask whether they are aligned with H.) Clarifications The definition is intended de dicto rather than de re. An aligned A is trying to “do what H wants it to do.” Suppose A thinks that H likes apples, and so goes to the store to buy some apples, but H really prefers oranges. I'd call this behavior aligned because A is trying to do what H wants, even though the thing it is trying to do (“buy apples”) turns out not to be what H wants: the de re interpretation is false but the de dicto interpretation is true. An aligned AI can make errors, including moral or psychological errors, and fixing those errors isn't part of my definition of alignment except insofar as it's part of getting the AI to “try to do what H wants” de dicto. This is a critical difference between my definition and some other common definitions. I think that using a broader definition (or the de re reading) would also be defensible, but I like it less because it includes many subproblems that I think (a) are much less urgent, (b) are likely to involve totally different techniques than the urgent part of alignment. An aligned AI would also be trying to do what H wants with respect to clarifying H's preferences. For example, it should decide whether to ask if H prefers apples or oranges, based on its best guesses about how important the decision is to H, how confident it is in its current guess, how annoying it would be to ask, etc. Of course, it may also make a mistake at the meta level — for example, it may not understand when it is OK to interrupt H, and therefore avoid asking questions that it would have been better to ask. This definition of “alignment” is extremely imprecise. I expect it to correspond to some more precise concept that cleaves reality at the joints. But that might not become clear, one way or the other, until we've made significant progress. One reason the definition is imprecise is ...

The Nonlinear Library: LessWrong
LW - The Steering Problem by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 11:53


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 2: The Steering Problem, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Most AI research focuses on reproducing human abilities: to learn, infer, and reason; to perceive, plan, and predict. There is a complementary problem which (understandably) receives much less attention: if you had these abilities, what would you do with them? The steering problem: Using black-box access to human-level cognitive abilities, can we write a program that is as useful as a well-motivated human with those abilities? This post explains what the steering problem is and why I think it's worth spending time on. Introduction A capable, well-motivated human can be extremely useful: they can work without oversight, produce results that need not be double-checked, and work towards goals that aren't precisely defined. These capabilities are critical in domains where decisions cannot be easily supervised, whether because they are too fast, too complex, or too numerous. In some sense “be as useful as possible” is just another task at which a machine might reach human-level performance. But it is different from the concrete capabilities normally considered in AI research. We can say clearly what it means to "predict well," "plan well," or "reason well." If we ignored computational limits, machines could achieve any of these goals today. And before the existing vision of AI is realized, we must necessarily achieve each of these goals. For now, "be as useful as possible" is in a different category. We can't say exactly what it means. We could not do it no matter how fast our computers could compute. And even if we resolved the most salient challenges in AI, we could remain in the dark about this one. Consider a capable AI tasked with running an academic conference. How should it use its capabilities to make decisions? We could try to specify exactly what makes a conference good or bad. But our requirements are complex and varied, and so specifying them exactly seems time-consuming or impossible. We could build an AI that imitates successful conference organizers. But this approach can never do any better than the humans we are imitating. Realistically, it won't even match human performance unless we somehow communicate what characteristics are important and why. We could ask an AI to maximize our satisfaction with the conference. But we'll get what we measure. An extensive evaluation would greatly increase the cost of the conference, while a superficial evaluation would leave us with a conference optimized for superficial metrics. Everyday experience with humans shows how hard delegation can be, and how much easier it is to assign a task to someone who actually cares about the outcome. Of course there is already pressure to write useful programs in addition to smart programs, and some AI research studies how to efficiently and robustly communicate desired behaviors. For now, available solutions apply only in limited domains or to weak agents. The steering problem is to close this gap. Motivation A system which "merely" predicted well would be extraordinarily useful. Why does it matter whether we know how to make a system which is “as useful as possible”? Our machines will probably do some things very effectively. We know what it means to "act well" in the service of a given goal. For example, using human cognitive abilities as a black box, we could probably design autonomous corporations which very effectively maximized growth. If the black box was cheaper than the real thing, such autonomous corporations could displace their conventional competitors. If machines can do everything equally well, then this would be great news. If not, society's direction may be profoundly influenced by what can and cannot...

The Nonlinear Library: LessWrong
LW - Preface to the sequence on iterated amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 4:22


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 1: Preface to the sequence on iterated amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This sequence describes iterated amplification, a possible strategy for building an AI that is actually trying to do what we want out of ML systems trained by gradient descent. Iterated amplification is not intended to be a silver bullet that resolves all of the possible problems with AI; it's an approach to the particular alignment problem posed by scaled-up versions of modern ML systems. Iterated amplification is based on a few key hopes If you have an overseer who is smarter than the agent you are trying to train, you can safely use that overseer's judgment as an objective. We can train an RL system using very sparse feedback, so it's OK if that overseer is very computationally expensive. A team of aligned agents may be smarter than any individual agent, while remaining aligned. If all of these hopes panned out, then at every point in training “a team of the smartest agents we've been able to train so far” would be a suitable overseer for training a slightly smarter aligned successor. This could let us train very intelligent agents while preserving alignment (starting the induction from an aligned human). Iterated amplification is still in an preliminary state and is best understood as a research program rather than a worked out solution. Nevertheless, I think it is the most concrete existing framework for aligning powerful ML with human interests. Purpose and audience The purpose of this sequence is to communicate the basic intuitions motivating iterated amplification, to define iterated amplification, and to present some of the important open questions. I expect this sequence to be most useful for readers who would like to have a somewhat detailed understanding of iterated amplification, and are looking for something more structured than ai-alignment.com to help orient themselves. The sequence is intended to provide enough background to follow most public discussion about iterated amplification, and to be useful for building intuition and informing research about AI alignment even if you never think about amplification again. The sequence will be easier to understand if you have a working understanding of ML, statistics, and online learning, and if you are familiar with other work on AI alignment. But it would be reasonable to just dive in and just skip over any detailed discussion that seems to depend on missing prerequisites. Outline and reading recommendations The first part of this sequence clarifies the problem that iterated amplification is trying to solve, which is both narrower and broader than you might expect. The second part of the sequence outlines the basic intuitions that motivate iterated amplification. I think that these intuitions may be more important than the scheme itself, but they are considerably more informal. The core of the sequence is the third section. Benign model-free RL describes iterated amplification, as a general framework into which we can substitute arbitrary algorithms for reward learning, amplification, and robustness. The first four posts all describe variants of this idea from different perspectives, and if you find that one of those descriptions is clearest for you then I recommend focusing on that one and skimming the others. The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment. The fifth section of the sequence breaks down some of these problems further and describes some possible approaches. The final section is an FAQ by Alex Zhu, included as append...

The Nonlinear Library: LessWrong
LW - Directions and desiderata for AI alignment by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 22:44


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 15: Directions and desiderata for AI alignment, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Note: This is the first post from part four: what needs doing of the sequence on iterated amplification. The fourth part of the sequence describes some of the black boxes in iterated amplification and discusses what we would need to do to fill in those boxes. I think these are some of the most important open questions in AI alignment. In the first half of this post, I'll discuss three research directions that I think are especially promising and relevant to AI alignment: Reliability and robustness. Building ML systems which behave acceptably in the worst case rather than only on the training distribution. Oversight / reward learning. Constructing objectives and training strategies which lead our policies to do what we intend. Deliberation and amplification. Surpassing human performance without simultaneously abandoning human preferences. I think that we have several angles of attack on each of these problems, and that solutions would significantly improve our ability to align AI. My current feeling is that these areas cover much of the key work that needs to be done. In the second half of the post, I'll discuss three desiderata that I think should guide research on alignment: Secure. Our solutions should work acceptably even when the environment itself is under the influence of an adversary. Competitive. Our solutions should impose minimal overhead, performance penalties, or restrictions compared to malign AI. Scalable. Our solutions should continue to work well even when the underlying learning systems improve significantly. I think that taking these requirements seriously leads us to substantially narrow our focus. It may turn out that these desiderata are impossible to meet, but if so I think that the first order of business should be understanding clearly why they are impossible. This would let us better target our work on alignment and better prepare for a future where we won't have a completely satisfying solution to alignment. (The ideas in this post are not novel. My claimed contribution is merely collecting these things together. I will link to my own writing on each topic in large part because that's what I know.) I. Research directions 1. Reliability and robustness Traditional ML algorithms optimize a model or policy to perform well on the training distribution. These models can behave arbitrarily badly when we move away from the training distribution. Similarly, they can behave arbitrarily badly on a small part of the training distribution. I think this is bad news: Deploying ML systems will critically change their environment, in a way that is hard or impossible to simulate at training time. (The “treacherous turn” is a special case of this phenomenon.) Deployed ML systems are interconnected and exposed to the same world. So if conditions change in a way that causes one of them to fail, manysystems may fail simultaneously. If ML systems are extremely powerful, or if they play a critical role in society, then a widespread failure may have catastrophic consequences. I'm aware of three basic approaches to reliability that seem to me like they could plausibly scale and be competitive: (ETA: this list is superseded by the list in Techniques for Optimizing Worst-Case Performance. I removed consensus and added interpretability and verification. I don't discuss “learning the right model,” which I still consider a long shot.) Adversarial training. At training time, attempt to construct inputs that induce problematic behavior and train on those. Eventually, we hope there will be no catastrophe-inducing inputs left. We don't yet know what is possible to achieve. (Szegedy 2014...

The Nonlinear Library: LessWrong
LW - AlphaGo Zero and capability amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 3:35


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 14: AlphaGo Zero and capability amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. AlphaGo Zero is an impressive demonstration of AI capabilities. It also happens to be a nice proof-of-concept of a promising alignment strategy. How AlphaGo Zero works AlphaGo Zero learns two functions (which take as input the current board): A prior over moves p is trained to predict what AlphaGo will eventually decide to do. A value function v is trained to predict which player will win (if AlphaGo plays both sides) Both are trained with supervised learning. Once we have these two functions, AlphaGo actually picks it moves by using 1600 steps of Monte Carlo tree search (MCTS), using p and v to guide the search. It trains p to bypass this expensive search process and directly pick good moves. As p improves, the expensive search becomes more powerful, and p chases this moving target. Iterated capability amplification In the simplest form of iterated capability amplification, we train one function: A “weak” policy A, which is trained to predict what the agent will eventually decide to do in a given situation. Just like AlphaGo doesn't use the prior p directly to pick moves, we don't use the weak policy A directly to pick actions. Instead, we use a capability amplification scheme: we call A many times in order to produce more intelligent judgments. We train A to bypass this expensive amplification process and directly make intelligent decisions. As A improves, the amplified policy becomes more powerful, and A chases this moving target. In the case of AlphaGo Zero, A is the prior over moves, and the amplification scheme is MCTS. (More precisely: A is the pair (p, v), and the amplification scheme is MCTS + using a rollout to see who wins.) Outside of Go, A might be a question-answering system, which can be applied several times in order to first break a question down into pieces and then separately answer each component. Or it might be a policy that updates a cognitive workspace, which can be applied many times in order to “think longer” about an issue. The significance Reinforcement learners take a reward function and optimize it; unfortunately, it's not clear where to get a reward function that faithfully tracks what we care about. That's a key source of safety concerns. By contrast, AlphaGo Zero takes a policy-improvement-operator (like MCTS) and converges towards a fixed point of that operator. If we can find a way to improve a policy while preserving its alignment, then we can apply the same algorithm in order to get very powerful but aligned strategies. Using MCTS to achieve a simple goal in the real world wouldn't preserve alignment, so it doesn't fit the bill. But “think longer” might. As long as we start with a policy that is close enough to being aligned — a policy that “wants” to be aligned, in some sense — allowing it to think longer may make it both smarter and more aligned. I think designing alignment-preserving policy amplification is a tractable problem today, which can be studied either in the context of existing ML or human coordination. So I think it's an exciting direction in AI alignment. A candidate solution could be incorporated directly into the AlphaGo Zero architecture, so we can already get empirical feedback on what works. If by good fortune powerful AI systems look like AlphaGo Zero, then that might get us much of the way to an aligned AI. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library: LessWrong
LW - Learning with catastrophes by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 6:09


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 18: Learning with catastrophes, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. A catastrophe is an event so bad that we are not willing to let it happen even a single time. For example, we would be unhappy if our self-driving car ever accelerates to 65 mph in a residential area and hits a pedestrian. Catastrophes present a theoretical challenge for traditional machine learning — typically there is no way to reliably avoid catastrophic behavior without strong statistical assumptions. In this post, I'll lay out a very general model for catastrophes in which they are avoidable under much weaker statistical assumptions. I think this framework applies to the most important kinds of catastrophe, and will be especially relevant to AI alignment. Designing practical algorithms that work in this model is an open problem. In a subsequent post I describe what I currently see as the most promising angles of attack. Modeling catastrophes We consider an agent A interacting with the environment over a sequence of episodes. Each episode produces a transcript τ, consisting of the agent's observations and actions, along with a reward r ∈ [0, 1]. Our primary goal is to quickly learn an agent which receives high reward. (Supervised learning is the special case where each transcripts consist of a single input and a label for that input.) While training, we assume that we have an oracle which can determine whether a transcript τ is “catastrophic.” For example, we might show a transcript to a QA analyst and ask them if it looks catastrophic. This oracle can be applied to arbitrary sequences of observations and actions, including those that don't arise from an actual episode. So training can begin before the very first interaction with nature, using only calls to the oracle. Intuitively, a transcript should only be marked catastrophic if it satisfies two conditions: The agent made a catastrophically bad decision. The agent's observations are plausible: we have a right to expect the agent to be able to handle those observations. While actually interacting with the environment, the agent cannot query the oracle — there is no time to wait for a QA engineer to review a proposed action to check if it would be catastrophic. Moreover, if interaction with nature ever produces a catastrophic transcript, we immediately fail. The performance of an algorithm is characterized by two parameters: the probability of catastrophic failure, and the total reward assuming no catastrophic failure. We assume that there are some policies such that no matter what nature does, the resulting transcript is never catastrophic. Traditionally in RL the goal is to get as much reward as the best policy from some class C. We' slightly weaken that goal, and instead aim to do as well as the best policy from C that never makes a catastrophic decision. Batch learning I've described an online version of learning with catastrophes. We can also consider the batch version, where the learner is first given a large number of “training” episodes. In the batch version, there is no penalty for catastrophes at training time, and we don't care about training error. The two performance parameters are test-time performance and test-time catastrophe probability. The oracle This definition depends on an oracle who determines which transcripts are catastrophic. For weak AI systems, the oracle may be a human. But a powerful AI system might take actions which are catastrophic but which look inoffensive to a human judge, so this approach doesn't cut it. In general, the judge should be a human+AI team which is more competent than the system being trained, armed with an adequate solution to the informed oversight problem. Approach Learning with catastro...

The Nonlinear Library: LessWrong
LW - Techniques for optimizing worst-case performance by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 13:57


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 20: Techniques for optimizing worst-case performance, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. If powerful ML systems fail catastrophically, they may be able to quickly cause irreversible damage. To be safe, it's not enough to have an average-case performance guarantee on the training distribution — we need to ensure that even if our systems fail on new distributions or with small probability, they will never fail too badly. The difficulty of optimizing worst-case performance is one of the most likely reasons that I think prosaic AI alignment might turn out to be impossible (if combined with an unlucky empirical situation). In this post I want to explain my view of the problem and enumerate some possible angles of attack. My goal is to communicate why I have hope that worst-case guarantees are achievable. None of these are novel proposals. The intention of this post is to explain my view, not to make a new contribution. I don't currently work in any of these areas, and so this post should be understood as an outsider looking in, rather than coming from the trenches. Malign vs. benign failures and corrigibility I want to distinguish two kinds of failures: “Benign” failures, where our system encounters a novel situation, doesn't know how to handle it, and so performs poorly. The resulting behavior may simply be erratic, or may serve an external attacker. Their effect is similar to physical or cybersecurity vulnerabilities — they create an opportunity for destructive conflict but don't systematically disfavor human values. They may pose an existential risk when combined with high-stakes situations, in the same way that human incompetence may pose an existential risk. Although these failures are important, I don't think it is necessary or possible to eliminate them in the worst case. “Malign” failures, where our system continues to behave competently but applies its intelligence in the service of an unintended goal. These failures systematically favor whatever goals AI systems tend to pursue in failure scenarios, at the expense of human values. They constitute an existential risk independent of any other destructive technology or dangerous situation. Fortunately, they seem both less likely and potentially possible to avoid even in the worst case. I'm most interested in malign failures, and the narrower focus is important to my optimism. The distinction between malign and benign failures is not always crisp. For example, suppose we try to predict a human's preferences, then search over all strategies to find the one that best satisfies the predicted preferences. Guessing the preferences even a little bit wrong would create an adversarial optimizer incentivized to apply its intelligence to a purpose at odds with our real preferences. If we take this approach, incompetence does systematically disfavor human values. By aiming for corrigible rather than optimal behavior (see here or here) I'm optimistic that it is possible to create a sharper distinction between benign and malign failures, which can be leveraged by the techniques below. But for now, this hope is highly speculative. Amplification I believe that these techniques are much more likely to work if we have access to an overseer who is significantly smarter than the model that we are trying to train. I hope that amplification makes this possible. It seems realistic for a strong overseer to recognize an (input, output) pair as a malign failure mode (though it may require a solution to informed oversight). So now we have a concrete goal: find a model that never gives an output the overseer would diagnose as catastrophically bad. Historically researchers in the AI safety community have been extremely pessimistic about...

The Nonlinear Library: LessWrong
LW - Reliability amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 12:17


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 21: Reliability amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. In a recent post I talked about capability amplification, a putative procedure that turns a large number of fast weak agents into a slower, stronger agent. If we do this in a naive way, it will decrease reliability. For example, if. Our weak policy fails with probability 1%. In order to implement a strong policy we combine 10 decisions made by weak agents. If any of these 10 decisions is bad, then so is the combination. .then the combination will be bad with 10% probability. Although the combination can be more powerful than any individual decision, in this case it is much less reliable. If we repeat policy amplification several times, our failure probability could quickly approach 1, even if it started out being exponentially small. Complementary to capability amplification is reliability amplification: given a policy H that usually works, can we implement a policy H that works significantly more reliably? To be slightly less imprecise (but still quite crude): Given a distribution A over policies that ε-close to a benign policy for some ε ≪ 1, can we implement a distribution A⁺ over policies which is δ-close to a benign policy of similar capability, for some δ ≪ ε? When I say that A is ε-close to a benign policy, I mean that there is a benign distribution over policies which has total variation distance at most ε from A. We can view this is as a simple mixture between a single aligned policy that is chosen with probability (1−ε) and an adversarial policy that is chosen with probability ε. I will move freely between talking about “policies” and “distributions over policies;” we can turn a distribution into a single policy by averaging. The notion of “implement” here is exactly the same as for capability amplification: we can use any number of implementations of A and whatever other resources are at our disposal, and must compute A⁺'s responses to a sequence of observations, within some time limit such as a day. Reliability amplification via deliberation I think there is one very natural way to do reliability amplification — if some variant of this idea doesn't work, I'm not sure what else to try. As a warm-up, let's consider the setting where the agents want to make a single decision rather than a sequence of decisions — we'll return to the sequential case soon. We sample three agents A, B, and C from the given distribution. We ask them to talk for a while and then each propose an action. Then we ask them to do approval voting amongst the three options, and randomly pick from the maximally-approved-of proposals. If all three agents are aligned, then this proposal is clearly going to produce an aligned action. If two the agents behave adversarially then we are clearly out of luck, but that happens only with probability O(ε²), so we can basically ignore it. The interesting case is when there is one adversarial agent and two aligned agents. If we threw up our hands and picked randomly in this case, then we would have a failure probability of ε. So the question is: can we do better than that? The obvious hope is that aligned agents can do better-than-chance at recognizing each other / at recognizing good actions. If they can get any non-negligible advantage, then we can get a non-negligible reduction in error probability (and by iterating the procedure we can potentially get large reductions). The sequential case So far we have talked about making a single decision. The problem is more challenging when we need to make a sequence of decisions. We can try to simply generalize the voting approach, running a new vote for each action. To see the difficulty, suppose that the optimal policy looks as follo...

The Nonlinear Library: LessWrong
LW - Security amplification by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 21:18


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 22: Security amplification, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. An apparently aligned AI system may nevertheless behave badly with small probability or on rare “bad” inputs. The reliability amplification problem is to reduce the failure probability of an aligned AI. The analogous security amplification problem is to reduce the prevalence of bad inputs on which the failure probability is unacceptably high. We could measure the prevalence of bad inputs by looking at the probability that a random input is bad, but I think it is more meaningful to look at the difficulty of finding a bad input. If it is exponentially difficult to find a bad input, then in practice we won't encounter any. If we could transform a policy in a way that multiplicatively increase the difficulty of finding a bad input, then by interleaving that process with a distillation step like imitation or RL we could potentially train policies which are as secure as the learning algorithms themselves — eliminating any vulnerabilities introduced by the starting policy. For sophisticated AI systems, I currently believe that meta-execution is a plausible approach to security amplification. (ETA: I still think that this basic approach to security amplification is plausible, but it's now clear that meta-execution on its own can't work.) Motivation There are many inputs on which any particular implementation of “human judgment” will behave surprisingly badly, whether because of trickery, threats, bugs in the UI used to elicit the judgment, snow-crash-style weirdness, or whatever else. (The experience of computer security suggests that complicated systems typically have many vulnerabilities, both on the human side and the machine side.) If we aggressively optimize something to earn high approval from a human, it seems likely that we will zoom in on the unreasonable part of the space and get an unintended result. What's worse, this flaw seems to be inherited by any agent trained to imitate human behavior or optimize human approval. For example, inputs which cause humans to behave badly would also cause a competent human-imitator to behave badly. The point of security amplification is to remove these human-generated vulnerabilities. We can start with a human, use them to train a learning system (that inherits the human vulnerabilities), use security amplification to reduce these vulnerabilities, use the result to train a new learning system (that inherits the reduced set of vulnerabilities), apply security amplification to reduce those vulnerabilities further, and so on. The agents do not necessarily get more powerful over the course of this process — we are just winnowing away the idiosyncratic human vulnerabilities. This is important, if possible, because it (1) lets us train more secure systems, which is good in itself, and (2) allows us to use weak aligned agents as reward functions for a extensive search. I think that for now this is one of the most plausible paths to capturing the benefits of extensive search without compromising alignment. Security amplification would not be directly usable as a substitute for informed oversight, or to protect an overseer from the agent it is training, because informed oversight is needed for the distillation step which allows us to iterate security amplification without exponentially increasing costs. Note that security amplification + distillation will only remove the vulnerabilities that came from the human. We will still be left with vulnerabilities introduced by our learning process, and with any inherent limits on our model's ability to represent/learn a secure policy. So we'll have to deal with those problems separately. Towards a definition The security amplif...

The Nonlinear Library: LessWrong
LW - Meta-execution by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 7:41


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 23: Meta-execution, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This post describes meta-execution, my current proposal for capability amplification and security amplification. (Meta-execution is annotated functional programming + strong HCH + a level of indirection. It is implemented in the amplify module of my ALBA repository.) Goal We are given an efficient agent A which competently pursues some values. We'd like to use a bunch of copies of A in order to implement a more powerful and robust agent Meta(A) with the same values. Outline Our basic plan is to build a machine out of copies of the agent; instead of asking the agent to make a decision directly, we ask it to implement the decision-making process by answering a sequence of questions of the form “what should happen next?” The basic object in meta-execution is a message, which consists of text along with pointers to other messages or to agents. Each box is a message. A is an agent who can respond to queries like “which of X and Y is larger?” We can represent arbitrarily large objects as giant trees of messages and agents. Meta-execution first forms a tree representing the question “what should be done?” It then asks the agent A to perform a sequence of operations on the tree that eventually lead to an answer. Then it executes that answer. The initial tree might look something like this: If you can answer this question, you can implement an agent. At any given time, an agent who is operating on this tree can only “see” a few messages: it can read the text of those messages, and see pointers like [red] and [blue]. Initially the agent can see only the root of the tree. If you are an agent tasked with processing a message, there are a few basic operations you can perform. You specify the “targets” of the action by specifying pointers you want to follow: Look at another part of the tree which is not currently visible. Spawn a new agent, and see a pointer to that agent. Send a message to an agent, and see its reply. You can compose a message by writing it out with sub-messages in parentheses (); for example, “What is the smallest element in (the list with first element [green] and remaining elements [purple]) according to [blue]” would produce the message in the first image above, if [green], [purple], [blue] had appropriate values. Terminate the current computation by composing a reply. This reply gets sent to the “parent” who initiated the current computation. In the case of the very first agent, who was created in order to answer the original question “what should an agent in state [red] do after receiving input [blue]?”, the reply specifies what the overall system should do. An example execution is illustrated here. I made a quick demo of the execution process, you can find it here. And that's basically it. We spawn a new agent, and hand it the “what should we do?” message. It can then take any of the basic actions listed above and see the result. We repeat that process until the agent returns a message indicating what should be done. We parse the message as an action and new state (see the section on parsing below), we execute the action, and we update the system's state. The details Hopefully for most purposes that outline tells you everything you need to know. If not, the easiest way to learn exactly how this works is probably just to look at the code. Meta-execution is implemented as lambda A : Meta(HCH(A, n)) in the package amplify.__init__, where n is the computational budget and A is the meta-executor. You can experience being the meta-executor by calling examples.meta.act("test") . The available commands are described in the README. Everything is immutable I assume that we have a digital implementation o...

The Nonlinear Library: LessWrong
LW - Thoughts on reward engineering by paulfchristiano from Iterated Amplification

The Nonlinear Library: LessWrong

Play Episode Listen Later Dec 24, 2021 18:03


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Iterated Amplification, Part 19: Thoughts on reward engineering, published by paulfchristiano. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. Note: This is the first post from part five: possible approaches of the sequence on iterated amplification. The fifth section of the sequence breaks down some of these problems further and describes some possible approaches. Suppose that I would like to train an RL agent to help me get what I want. If my preferences could be represented by an easily-evaluated utility function, then I could just use my utility function as the agent's reward function. But in the real world that's not what human preferences look like. So if we actually want to turn our preferences into a reward function suitable for training an RL agent, we have to do some work. This post is about the straightforward parts of reward engineering. I'm going to deliberately ignore what seem to me to be the hardest parts of the problem. Getting the straightforward parts out of the way seems useful for talking more clearly about the hard parts (and you never know what questions may turn out to be surprisingly subtle). The setting To simplify things even further, for now I'll focus on the special case where our agent is taking a single action a. All of the difficulties that arise in the single-shot case also arise in the sequential case, but the sequential case also has its own set of additional complications that deserve their own post. Throughout the post I will imagine myself in the position of an “overseer” who is trying to specify a reward function R(a) for an agent. You can imagine the overseer as the user themselves, or (more realistically) as a team of engineer and/or researchers who are implementing a reward function intended to expresses the user's preferences. I'll often talk about the overseer computing R(a) themselves. This is at odds with the usual situation in RL, where the overseer implements a very fast function for computing R(a) in general (“1 for a win, 0 for a draw, -1 for a loss”). Computing R(a) for a particular action a is strictly easier than producing a fast general implementation, so in some sense this is just another simplification. I talk about why it might not be a crazy simplification in section 6. Contents Long time horizons. How do we train RL agents when we care about the long-term effects of their actions? Inconsistency and unreliability. How do we handle the fact that we have only imperfect access to our preferences, and different querying strategies are not guaranteed to yield consistent or unbiased answers? Normative uncertainty. How do we train an agent to behave well in light of its uncertainty about our preferences? Widely varying reward. How do we handle rewards that may vary over many orders of magnitude? Sparse reward. What do we do when our preferences are very hard to satisfy, such that they don't provide any training signal? Complex reward. What do we do when evaluating our preferences is substantially more expensive than running the agent? Conclusion. Appendix: harder problems. 1. Long time horizons A single decision may have very long-term effects. For example, even if I only care about maximizing human happiness, I may instrumentally want my agent to help advance basic science that will one day improve cancer treatment. In principle this could fall out of an RL task with “human happiness” as the reward, so we might think that neglecting long-term effects is just a shortcoming of the single-shot problem. But even in theory there is no way that an RL agent can learn to handle arbitrarily long-term dependencies (imagine training an RL agent to handle 40 year time horizons), and so focusing on the sequential RL problem doesn't address this issue. I think that the only real approach is t...

The Nonlinear Library: LessWrong Top Posts
Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems by abramdemski

The Nonlinear Library: LessWrong Top Posts

Play Episode Listen Later Dec 11, 2021 16:50


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Schelling Problems, published by abramdemski on the AI Alignment Forum. I previously claimed that most apparent Prisoner's Dilemmas are actually Stag Hunts. I now claim that they're Schelling Pub in practice. I conclude with some lessons for fighting Moloch. This post turned out especially dense with inferential leaps and unexplained terminology. If you're confused, try to ask in the comments and I'll try to clarify. Some ideas here are due to Tsvi Benson-Tilsen. The title of this post used to be Most Prisoner's Dilemmas are Stag Hunts; Most Stag Hunts are Battle of the Sexes. I'm changing it based on this comment. "Battle of the Sexes" is a game where a male and female (let's say Bob and Alice) want to hang out, but each of them would prefer to engage in gender-stereotyped behavior. For example, Bob wants to go to a football game, and Alice wants to go to a museum. The gender issues are distracting, and although it's the standard, the game isn't that well-known anyway, so sticking to the standard didn't buy me much (in terms of reader understanding). I therefore present to you, the Schelling Pub Game: Two friends would like to meet at the pub. In order to do so, they must make the same selection of pub (making this a Schelling-point game). However, they have different preferences about which pub to meet at. For example: Alice and Bob would both like to go to a pub this evening. There are two pubs: the Xavier, and the Yggdrasil. Alice likes the Xavier twice as much as the Yggdrasil. Bob likes the Yggdrasil twice as much as the Xavier. However, Alice and Bob also prefer to be with each other. Let's say they like being together ten times as much as they like being apart. Schelling Pub Game payoff matrix payoffs written alice;bob B's choice X Y A's choice X 20;10 2;2 Y 1;1 10;20 The important features of this game are: The Nash equilibria are all Pareto-optimal. There is no "individually rational agents work against each other" problem, like in prisoner's dilemma or even stag hunt. There are multiple equilibria, and different agents prefer different equilibria. Thus, realistically, agents may not end up in equilibrium at all -- because (in the single-shot game) they don't know which to choose, and because (in an iterated version of the game) they may make locally sub-optimal choices in order to influence the long-run behavior of other players. (Edited to add, based on comments:) Here's a summary of the central argument which, despite the lack of pictures, may be easier to understand. Most Prisoner's Dilemmas are actually iterated. Iterated games are a whole different game with a different action space (because you can react to history), a different payoff matrix (because you care about future payoffs, not just the present), and a different set of equilibria. It is characteristic of PD that players are incentivised to play away from the Pareto frontier; IE, no Pareto-optimal point is an equilibrium. This is not the case with iterated PD. It is characteristic of Stag Hunt that there is a Pareto-optimal equilibrium, but there is also another equilibrium which is far from optimal. This is also the case with iterated PD. So iterated PD resembles Stag Hunt. However, it is furthermore true of iterated PD that there are multiple different Pareto-optimal equilibria, which benefit different players more or less. Also, if players don't successfully coordinate on one of these equilibria, they can end up in a worse overall state (such as mutual defection forever, due to playing grim-trigger strategies with mutually incompatible demands). This makes iterated PD resemble the Schelling Pub Game. In fact, the Folk Theorem suggests that most iterated games will resemble the Schelling Pub Game in this way. In a comment on The Sc...

The Nonlinear Library: Alignment Forum Top Posts
My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda by Chi Nguyen

The Nonlinear Library: Alignment Forum Top Posts

Play Episode Listen Later Dec 6, 2021 65:25


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Understanding of Paul Christiano's Iterated Amplification AI Safety Research Agenda, publishedby Chi Nguyen on the AI Alignment Forum. Write a Review Crossposted from the EA forum You can read this post as a google docs instead (IMO much better to read). This document aims to clarify the AI safety research agenda by Paul Christiano (IDA) and the arguments around how promising it is. Target audience: All levels of technical expertise. The less knowledge about IDA someone has, the more I expect them to benefit from the writeup. Writing policy: I aim to be as clear and concrete as possible and wrong rather than vague to identify disagreements and where I am mistaken. Things will err on the side of being too confidently expressed. Almost all footnotes are content and not references. Epistemic Status: The document is my best guess on IDA and might be wrong in important ways. I have not verified all of the content with somebody working on IDA. I spent ~4 weeks on this and have no prior background in ML, CS or AI safety. I wrote this document last summer (2019) as part of my summer research fellowship at FHI. I was planning to restructure, complete and correct it since but haven't gotten to it for a year, so decided to just publish it as it is. The document has not been updated, i.e. nothing that has been released since September 2019 is incorporated into this document. Paul Christiano generously reviewed the first third to a half of this summary. I added his comments verbatim in the document. Apologies for the loss of readability due to this. This doesn't imply he endorses any part of this document, especially the second half which he didn't get to review. Purpose of this document: Clarifying IDA IDA is Paul Christiano's AI safety research agenda.[1] Christiano works at OpenAI which is one of the main actors in AI safety and IDA is by many considered the most complete[2] AI safety agenda. However, people who are not directly working on IDA are often confused about how exactly to understand the agenda. Clarifying IDA would make it more accessible for technical people to work on and easier to assess for nontechnical people who want to think about its implications. I believe that there are currently no resources on IDA that are both easy to understand and give a complete picture. Specifically, the current main resources are: the “Iterated Amplification” sequence which is a series of curated posts by Paul Christiano that can be quite difficult to understand, this post by Ajeya Cotra and this video by Robert Miles which are both easy to understand but limited in scope and don't provide many details, Alex Zhu's FAQ to IDA which clarifies important points but does not set them in context with the entire research agenda, an 80,000 podcast with Paul Christiano which explains some intuitions behind IDA but is not comprehensive and is in speech form. This document aims to fill the gap and give a comprehensive and accessible overview of IDA. Summary: IDA in 7 sentences IDA stands for Iterated Amplification and is a research agenda by Paul Christiano from OpenAI. IDA addresses the artificial intelligence (AI) safety problem, specifically the danger of creating a very powerful AI which leads to catastrophic outcomes. IDA tries to prevent catastrophic outcomes by searching for a competitive AI that never intentionally optimises for something harmful to us and that we can still correct once it's running. IDA doesn't propose a specific implementation, but presents a rough AI design and a collection of thoughts on whether this design has the potential to create safe and powerful AI and how the details of that design could look like. The proposed AI design is to use a safe but slow way of scaling up an AI's capabilities, distill this into a faster but slightly weaker AI, which can be scal...

The Nonlinear Library: Alignment Forum Top Posts
In Logical Time, All Games are Iterated Games by Abram Demski

The Nonlinear Library: Alignment Forum Top Posts

Play Episode Listen Later Dec 4, 2021 8:23


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: In Logical Time, All Games are Iterated Games, published by Abram Demski on the AI Alignment Forum. Logical Time The main purpose of this post is to introduce the concept of logical time. The idea was mentioned in Scott's post, Bayesian Probability is for things that are Space-like Separated from You. It was first coined in a conference call with, Daniel Demski, Alex Mennan, and perhaps Corey Staten and Evan Lloyd -- I don't remember exactly who was there, or who first used the term. Logical time is an informal concept which serves as an intuition pump for thinking about logical causality and phenomena in logical decision theory; don't take it too seriously. In particular, I am not interested in anybody trying to formally define logical time (aside from formal approaches to logical causality). Still, it seems like useful language for communicating decision-theory intuitions. Suppose you are playing chess, and you consider moving your bishop. You play out a hypothetical game which results in your loss in several moves. You decide not to move your bishop as a result of this. The hypothetical game resulting in your loss still exists within logic. You are logically later than it, in that the game you actually play depends on what happened in this hypothetical game. Suppose you're stuck in the desert in a Parfit's Hitchhiker problem. Paul Ekman is reading your face, deciding whether you're trustworthy. Paul Ekman does this based on experience, meaning that the computation which is you has a strong similarity with other computations. This similarity can be used to predict you fairly reliably, based on your facial expressions. What creates this similarity? According to the logical time picture, there is a logical fact much earlier in logical time, which governs the connection between facial expressions and behavior. To the extent that agents are trying to predict the future, they can be thought of as trying to place themselves later in logical time than the events which they're trying to predict. Two agents trying to predict each other are competing to see who can be later in logical time. This is not necessarily wise; in games like chicken, there is a sense in which you want to be earlier in logical time. Traditional game theory, especially Nash equilibria, relies on what amounts to loopy logical causality to allow each agent to be after the other in logical time. Whether this is bad depends on your view on logical time travel. Perhaps there is a sense in which logical time can be loopy, due to prediction (which is like logical time travel). Perhaps logical time can't be loopy, and this is a flaw in the models used by traditional game theory. Iterated Games In logical time, all games are iterated games. An agent tries to forecast what happens in the decision problem it finds itself in by comparing it to similar decision problems which are small enough for it to look at. This puts it later in logical time than the small examples. "Similar games" includes the exact same game, but in which both players have had less time to think. This means it is appropriate to use iterated strategies. Agents who are aware of logical time can play tit-for-tat in single-shot Prisoner's Dilemma, and so, can cooperate with each other. Iterated games are different in character than single-shot games. The folk theorem shows that almost any outcome is possible in iterated play (in a certain sense). This makes it difficult to avoid very bad outcomes, such as nearly always defecting in the prisoner's dilemma, despite the availability of much better equilibria such as tit-for-tat. Intuitively, this is because (as Yoav Shoham et al point out in If multi-agent learning is the answer, what is the question?) it is difficult to separate "teaching behavior" from "learning behavior": as in the tit-for-tat s...

AXRP - the AI X-risk Research Podcast
12 - AI Existential Risk with Paul Christiano

AXRP - the AI X-risk Research Podcast

Play Episode Listen Later Dec 2, 2021 169:36


Why would advanced AI systems pose an existential risk, and what would it look like to develop safer systems? In this episode, I interview Paul Christiano about his views of how AI could be so dangerous, what bad AI scenarios could look like, and what he thinks about various techniques to reduce this risk. Topics we discuss, and timestamps (due to mp3 compression, the timestamps may be tens of seconds off): 00:00:38 - How AI may pose an existential threat 00:13:36 - AI timelines 00:24:49 - Why we might build risky AI 00:33:58 - Takeoff speeds 00:51:33 - Why AI could have bad motivations 00:56:33 - Lessons from our current world 01:08:23 - "Superintelligence" 01:15:21 - Technical causes of AI x-risk 01:19:32 - Intent alignment 01:33:52 - Outer and inner alignment 01:43:45 - Thoughts on agent foundations 01:49:35 - Possible technical solutions to AI x-risk 01:49:35 - Imitation learning, inverse reinforcement learning, and ease of evaluation 02:00:34 - Paul's favorite outer alignment solutions 02:01:20 - Solutions researched by others 02:06:13 - Decoupling planning from knowledge 02:17:18 - Factored cognition 02:25:34 - Possible solutions to inner alignment 02:31:56 - About Paul 02:31:56 - Paul's research style 02:36:36 - Disagreements and uncertainties 02:46:08 - Some favorite organizations 02:48:21 - Following Paul's work The transcript Paul's blog posts on AI alignment Material that we mention: Cold Takes - The Most Important Century Open Philanthropy reports on: Modeling the human trajectory The computational power of the human brain AI timelines (draft) Whether AI could drive explosive economic growth Takeoff speeds Superintelligence: Paths, Dangers, Strategies Wei Dai on metaphilosophical competence: Two neglected problems in human-AI safety The argument from philosophical difficulty Some thoughts on metaphilosophy AI safety via debate Iterated distillation and amplification Scalable agent alignment via reward modeling: a research direction Learning the prior Imitative generalisation (AKA 'learning the prior') When is unaligned AI morally valuable?

Breaking Math Podcast
RR38: The Great Stratagem Heist (Game Theory: Iterated Elimination of Dominated Strategies)

Breaking Math Podcast

Play Episode Listen Later May 23, 2021 33:07 Very Popular


This is a rerun of one of our favorite episodes while we change our studio around. Game theory is all about decision-making and how it is impacted by choice of strategy, and a strategy is a decision that is influenced not only by the choice of the decision-maker, but one or more similar decision makers. This episode will give an idea of the type of problem-solving that is used in game theory. So what is strict dominance? How can it help us solve some games? And why are The Obnoxious Seven wanted by the police? Distributed under a Creative Commons Attribution-ShareAlike 4.0 International License. For more information, visit CreativeCommons.or [Featuring: Sofía Baca; Diane Baca] --- This episode is sponsored by · Anchor: The easiest way to make a podcast. https://anchor.fm/app Support this podcast: https://anchor.fm/breakingmathpodcast/support

ASC Workshops
Topological partition functions and (iterated) integrals of modular forms

ASC Workshops

Play Episode Listen Later Apr 6, 2021 68:43


As a consequence of electric-magnetic duality, partition functions of four-dimensional gauge theories can be expressed in terms of modular forms in many cases. I will discuss new results for the modularity of topologically twisted partition functions of N=2 and N=4 supersymmetric theories, and in particular how these partititon functions may involve (iterated) integrals of modular forms.

Two for Tea with Iona Italia and Helen Pluckrose
57 - Buster Benson - The Value of Arguing

Two for Tea with Iona Italia and Helen Pluckrose

Play Episode Listen Later Jun 8, 2020 74:28


You can find Buster’s book Why Are We Yelling: The Art of Productive Disagreement here: https://www.amazon.com/Why-Are-We-Yelling-hardcover/dp/0525540105. For my Letter exchange with Buster (one of my two favourite conversations I’ve ever had on the site) see: https://letter.wiki/conversation/225. Buster’s Letter exchange with B. J. Campbell on guns see: https://letter.wiki/conversation/129. Belief tracking: https://busterbenson.com/beliefs/ Write to Buster at Letter: https://letter.wiki/BusterBenson/conversations Write to me at Letter: https://letter.wiki/IonaItalia/conversations Further References James Lindsay and Peter Boghossian Impossible Conversations: A Very Practical Guide (2019) My review of Impossible Conversations for Areo magazine: https://areomagazine.com/2019/09/03/impossible-conversations/ Oliver Traldi’s review of Impossible Conversations for Arc Digital: https://arcdigital.media/the-political-pick-up-artists-6bcece72bb92 Timestamps 2:49 Persuasion vs. changing your own mind 5:02 The value of honest, open conflict 9:11 The voice of power, the voice of avoidance, the voice of truth 13:12 How to make disagreements feel less threatening and more exploratory 17:42 Techniques for helping a group to have more productive disagreements 25:12 Conflicts of head, heart and hands 35:06 Disagreements with people you’re close to 44:22 Opinion tracking 51:40 Tribalism 52:45 Iterated prisoner’s dilemma games 55:10 Lessons in cooperation from Covid-19 1:01:46 How to improve good will.

Breaking Math Podcast
38: The Great Stratagem Heist (Game Theory: Iterated Elimination of Dominated Strategies)

Breaking Math Podcast

Play Episode Listen Later Apr 22, 2019 34:16 Very Popular


Game theory is all about decision-making and how it is impacted by choice of strategy, and a strategy is a decision that is influenced not only by the choice of the decision-maker, but one or more similar decision makers. This episode will give an idea of the type of problem-solving that is used in game theory. So what is strict dominance? How can it help us solve some games? And why are The Obnoxious Seven wanted by the police? --- This episode is sponsored by · Anchor: The easiest way to make a podcast. https://anchor.fm/app Support this podcast: https://anchor.fm/breakingmathpodcast/support

MCMP – Mathematical Philosophy (Archive 2011/12)
Belief Dynamics under Iterated Revision: Cycles, Fixed Points and Truth-tracking

MCMP – Mathematical Philosophy (Archive 2011/12)

Play Episode Listen Later Apr 20, 2019 79:47


Sonja Smets (University of Groningen) gives a talk at the MCMP Colloquium titled "Belief Dynamics under Iterated Revision: Cycles, Fixed Points and Truth-tracking". Abstract: We investigate the long-term behavior of processes of learning by iterated belief-revision with new truthful information. In the case of higher-order doxastic sentences, the iterated revision can even be induced by repeated learning of the same sentence (which conveys new truths at each stage by referring to the agent's own current beliefs at that stage). For a number of belief-revision methods (conditioning, lexicographic revision and minimal revision), we investigate the conditions in which iterated belief revision with truthful information stabilizes: while the process of model-changing by iterated conditioning always leads eventually to a fixed point (and hence all doxastic attitudes, including conditional beliefs, strong beliefs, and any form of "knowledge", eventually stabilize), this is not the case for other belief-revision methods. We show that infinite revision cycles exist (even when the initial model is finite and even when in the case of repeated revision with one single true sentence), but we also give syntactic and semantic conditions ensuring that beliefs stabilize in the limit. Finally, we look at the issue of convergence to truth, giving both sufficient conditions ensuring that revision stabilizes on true beliefs, and (stronger) conditions ensuring that the process stabilizes on "full truth" (i.e. beliefs that are both true and complete). This talk is based on joint work with A. Baltag.

LMU Analytical Methods for Lawyers - Lehrstuhl für Bürgerliches Recht, Deutsches, Europ. und Int. Unternehmensrecht

Game Theory - Decision theory vs. game theory; A typical legal application of game theory; Some applications of game theory in legal settings; Representation of games; Normal-form games; The prisoner's dilemma; Prisoner's dilemma in litigation; Solution concepts for the prisoner's dilemma; Discoordination game: Matching pennies; A clever takeover bid; Dominant strategy equilibrium; Iterated games; The iterated prisoner's dilemma; Social benefits or social losses?; Achieving the cooperative outcome; N-person prisoner's dilemmata and common goods; Coordination games; Extensive-form games; The ultimatum game.

Learning Machines 101
LM101-066: How to Solve Constraint Satisfaction Problems using MCMC Methods (Rerun)

Learning Machines 101

Play Episode Listen Later Jul 17, 2017 34:00


In this episode of Learning Machines 101 (www.learningmachines101.com) we discuss how to solve constraint satisfaction inference problems where knowledge is represented as a large unordered collection of complicated probabilistic constraints among a collection of variables. The goal of the inference process is to infer the most probable values of the unobservable variables given the observable variables. Specifically, Monte Carlo Markov Chain ( MCMC ) methods are discussed.

The Bayesian Conspiracy
13 – Game Theory

The Bayesian Conspiracy

Play Episode Listen Later Jul 20, 2016 97:15


It’s about modeling agents making decisions, not game design.

Probabilistic Systems Analysis and Applied Probability (2013)

In this lecture, the professor discussed conditional expectation and sum of a random number of random variables.

MCMP – Philosophy of Science
Convergence of Iterated Belief Updates

MCMP – Philosophy of Science

Play Episode Listen Later Jun 29, 2015 55:00


Berna Kilinç (Boğaziçi University) gives a talk at the MCMP Colloquium (3 June, 2015) titled "Convergence of Iterated Belief Updates". Abstract: One desideratum on belief upgrade operations is that their iteration is truth-tropic, either on finite or infinite streams of reliable information. Under special circumstances repeated Bayesian updating satisfies this desideratum as shown for instance by the Gaifman and Snir theorem. There are a few analogous results in recent research within dynamic epistemic logic: Baltag et al establish the decidability of propositions for some but not all upgrade operations on finite epistemic spaces. In this talk further convergence results will be established for qualitative stable belief.

Probabilistic Systems Analysis and Applied Probability
Lecture 12: Iterated Expectations

Probabilistic Systems Analysis and Applied Probability

Play Episode Listen Later Jun 29, 2015 47:53


In this lecture, the professor discussed conditional expectation and sum of a random number of random variables.

SAGE Life & Biomedical Sciences
TIA March 2013: Perception of Pure Tones and Iterated Rippled Noise for Normal Hearing and Cochlear Implant Users

SAGE Life & Biomedical Sciences

Play Episode Listen Later Feb 12, 2015 19:32


A conversation with Richard Penninger and Dr. Charles Limb about their study on perception of pure tones and iterated rippled noise for normal hearing and Cochlear Implant users. Read the full article here.

Learning Machines 101
LM101-021: How to Solve Large Complex Constraint Satisfaction Problems (Monte Carlo Markov Chain)

Learning Machines 101

Play Episode Listen Later Jan 26, 2015 35:11


We discuss how to solve constraint satisfaction inference problems where knowledge is represented as a large unordered collection of complicated probabilistic constraints among a collection of variables. The goal of the inference process is to infer the most probable values of the unobservable variables given the observable variables. Please visit: www.learningmachines101.com to obtain transcripts of this podcast and download free machine learning software!

Grothendieck-Teichmüller Groups, Deformation and Operads
A shuffle product formula for generalized iterated integrals

Grothendieck-Teichmüller Groups, Deformation and Operads

Play Episode Listen Later Apr 10, 2013 53:48


Joyner, S (Brandeis University) Tuesday 09 April 2013, 15:00-16:00

Grothendieck-Teichmüller Groups, Deformation and Operads
Iterated integrals and de Rham fundamental group of P^1 minus 3 points

Grothendieck-Teichmüller Groups, Deformation and Operads

Play Episode Listen Later Mar 13, 2013 94:00


Brown, F (IHES) Wednesday 06 March 2013, 10:30-12:00

SAGE Podcast
Perception of Pure Tones and Iterated Rippled Noise for Normal Hearing and Cochlear Implant Users

SAGE Podcast

Play Episode Listen Later Mar 1, 2013 19:32


MCMP – Mathematical Philosophy (Archive 2011/12)

Jon Erling Litland (Oslo) gives a talk at the Workshop on Groundedness (26-27 October, 2012) titled "Pure Logic of Iterated Ground". Abstract: The presently existing logics of ground have not had anything to say about iterated grounding claims, that is, claims of the form: "A grounds that (B grounds C)". I develop a pure logic of iterated ground providing a systematic account of such iterated grounding claims. The logic is developed as a Prawitz style natural deduction system; the grounding operators are provided with both introduction and elimination rules, and normalization can be proved. The resulting logic is a conservative extension of Kit Fine's Pure Logic of Ground.

Limit theorems and applications (SAMSOS, 2008)
06 - Weighted power variations of fractional and iterated Brownian motions - Ivan NOURDIN

Limit theorems and applications (SAMSOS, 2008)

Play Episode Listen Later Jan 8, 2008 46:48


Ivan NOURDIN. Université Paris 6. Ecouter l'intervention : Bande son disponible au format mp3 Durée : 47 mn