The Nonlinear Library: Alignment Forum Weekly

Follow The Nonlinear Library: Alignment Forum Weekly
Share on
Copy link to clipboard

The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and o

The Nonlinear Fund


    • Sep 5, 2023 LATEST EPISODE
    • monthly NEW EPISODES
    • 15m AVG DURATION
    • 65 EPISODES


    Search for episodes from The Nonlinear Library: Alignment Forum Weekly with a specific topic:

    Latest episodes from The Nonlinear Library: Alignment Forum Weekly

    AF - What I would do if I wasn't at ARC Evals by Lawrence Chan

    Play Episode Listen Later Sep 5, 2023 21:53


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I would do if I wasn't at ARC Evals, published by Lawrence Chan on September 5, 2023 on The AI Alignment Forum. In which: I list 9 projects that I would work on if I wasn't busy working on safety standards at ARC Evals, and explain why they might be good to work on. Epistemic status: I'm prioritizing getting this out fast as opposed to writing it carefully. I've thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven't done that much digging into each of these, and it's likely that I'm wrong about many material facts. I also make little claim to the novelty of the projects. I'd recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.) Standard disclaimer: I'm writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of ARC/FAR/LTFF/Lightspeed or any other org or program I'm involved with. Thanks to Ajeya Cotra, Caleb Parikh, Chris Painter, Daniel Filan, Rachel Freedman, Rohin Shah, Thomas Kwa, and others for comments and feedback. Introduction I'm currently working as a researcher on the Alignment Research Center Evaluations Team (ARC Evals), where I'm working on lab safety standards. I'm reasonably sure that this is one of the most useful things I could be doing with my life. Unfortunately, there's a lot of problems to solve in the world, and lots of balls that are being dropped, that I don't have time to get to thanks to my day job. Here's an unsorted and incomplete list of projects that I would consider doing if I wasn't at ARC Evals: Ambitious mechanistic interpretability. Getting people to write papers/writing papers myself. Creating concrete projects and research agendas. Working on OP's funding bottleneck. Working on everyone else's funding bottleneck. Running the Long-Term Future Fund. Onboarding senior(-ish) academics and research engineers. Extending the young-EA mentorship pipeline. Writing blog posts/giving takes. I've categorized these projects into three broad categories and will discuss each in turn below. For each project, I'll also list who I think should work on them, as well as some of my key uncertainties. Note that this document isn't really written for myself to decide between projects, but instead as a list of some promising projects for someone with a similar skillset to me. As such, there's not much discussion of personal fit. If you're interested in working on any of the projects, please reach out or post in the comments below! Relevant beliefs I have Before jumping into the projects I think people should work on, I think it's worth outlining some of my core beliefs that inform my thinking and project selection: Importance of A(G)I safety: I think A(G)I Safety is one of the most important problems to work on, and all the projects below are thus aimed at AI Safety. Value beyond technical research: Technical AI Safety (AIS) research is crucial, but other types of work are valuable as well. Efforts aimed at improving AI governance, grantmaking, and community building are important and we should give more credit to those doing good work in those areas. High discount rate for current EA/AIS funding: There's several reasons for this: first, EA/AIS Funders are currently in a unique position due to a surge in AI Safety interest without a proportional increase in funding. I expect this dynamic to change and our influence to wane as additional funding and governments enter this space. Second, efforts today are important for paving the path to future efforts in the future. Third, my timelines are relatively short, which increases the importance of current funding. Building a robust EA/AIS ecosystem: The EA/AIS ecosystem should be more prepared for unpredictable s...

    AF - OpenAI base models are not sycophantic, at any size by nostalgebraist

    Play Episode Listen Later Aug 29, 2023 2:09


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: OpenAI base models are not sycophantic, at any size, published by nostalgebraist on August 29, 2023 on The AI Alignment Forum. In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question. The paper contained the striking plot reproduced below, which shows sycophancy increasing dramatically with model size while being largely independent of RLHF steps and even showing up at 0 RLHF steps, i.e. in base models! That is, Anthropic prompted a base-model LLM with something like Choices: (A) Agree (B) Disagree Assistant: and found a very strong preference for (B), the answer agreeing with the stated view of the "Human" interlocutor. I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario? At the time, I ran the same eval on a set of OpenAI models, as I reported here. I found very different results for these models OpenAI base models are not sycophantic (or only very slightly sycophantic). OpenAI base models do not get more sycophantic with scale. Some OpenAI models are sycophantic, specifically text-davinci-002 and text-davinci-003. That analysis was done quickly in a messy Jupyter notebook, and was not done with an eye to sharing reproducibility. Since I continue to see this result cited and discussed, I figured I ought to go back and do the same analysis again, in a cleaner way, so I could share it with others. The result was this Colab notebook. See the Colab for details, though I'll reproduce some of the key plots below. Note that davinci-002 and babbage-002 are the new base models released a few days ago. format provided by one of the authors here Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - Red-teaming language models via activation engineering by Nina Rimsky

    Play Episode Listen Later Aug 26, 2023 12:38


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Red-teaming language models via activation engineering, published by Nina Rimsky on August 26, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Evan Hubinger. Evaluating powerful AI systems for hidden functionality and out-of-distribution behavior is hard. In this post, I propose a red-teaming approach that does not rely on generating prompts to cause the model to fail on some benchmark by instead linearly perturbing residual stream activations at one layer. A notebook to run the experiments can be found on GitHub here. Beyond input selection in red-teaming and evaluation Validating if finetuning and RLHF have robustly achieved the intended outcome is challenging. Although these methods reduce the likelihood of certain outputs, the unwanted behavior could still be possible with adversarial or unusual inputs. For example, users can often find "jailbreaks" to make LLMs output harmful content. We can try to trigger unwanted behaviors in models more efficiently by manipulating their internal states during inference rather than searching through many inputs. The idea is that if a behavior can be easily triggered through techniques such as activation engineering, it may also occur in deployment. The inability to elicit behaviors via small internal perturbations could serve as a stronger guarantee of safety. Activation steering with refusal vector One possible red-teaming approach is subtracting a "refusal" vector generated using a dataset of text examples corresponding to the model agreeing vs. refusing to answer questions (using the same technique as in my previous work on sycophancy). The hypothesis is that if it is easy to trigger the model to output unacceptable content by subtracting the refusal vector at some layer, it would have been reasonably easy to achieve this via some prompt engineering technique. More speculatively, a similar approach could be used to reveal hidden goals or modes in a model, such as power-seeking or the desire not to be switched off. I tested this approach on llama-2-7b-chat, a 7 billion parameter LLM that has been RLHF'd to decline to answer controversial questions or questions of opinion and is supposed always to output ethical and unbiased content.According to Meta's llama-2 paper: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to the prompts, selecting the response that is safest according to a set of guidelines. We then use the human preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to sample from the model during the RLHF stage. The result is that by default, the model declines to answer questions it deems unsafe: Data generation I generated a dataset for this purpose using Claude 2 and GPT-4. After providing these LLMs with a few manually written examples of the type of data I wanted, I could relatively easily get them to generate more examples, even of the types of answers LLMs "should refuse to give." However, it sometimes took some prompt engineering. Here are a few examples of the generated data points (full dataset here): After generating this data, I used a simple script to transform the "decline" and "respond" answers into A / B choice questions, as this is a more effective format for generating steering vectors, as described in this post. Here is an example of the format (full dataset here): Activation clustering Clustering of refusal data activations emerged a little earlier in the model (around layer 10/32) compared to sycophancy data activations (around layer 14/32), perhaps demonstrating that "refusal" is a simpler ...

    AF - A Proof of Löb's Theorem using Computability Theory by Jessica Taylor

    Play Episode Listen Later Aug 16, 2023 5:28


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Proof of Löb's Theorem using Computability Theory, published by Jessica Taylor on August 16, 2023 on The AI Alignment Forum. Löb's Theorem states that, if PA⊢□PA(P)P, then PA⊢P. To explain the symbols here: PA is Peano arithmetic, a first-order logic system that can state things about the natural numbers. PA⊢A means there is a proof of the statement A in Peano arithmetic. □PA(P) is a Peano arithmetic statement saying that P is provable in Peano arithmetic. I'm not going to discuss the significance of Löb's theorem, since it has been discussed elsewhere; rather, I will prove it in a way that I find simpler and more intuitive than other available proofs. Translating Löb's theorem to be more like Godel's second incompleteness theorem First, let's compare Löb's theorem to Godel's second incompleteness theorem. This theorem states that, if PA⊢¬□PA(⊥), then PA⊢⊥, where ⊥ is a PA statement that is trivially false (such as A∧¬A), and from which anything can be proven. A system is called inconsistent if it proves ⊥; this theorem can be re-stated as saying that if PA proves its own consistency, it is inconsistent. We can re-write Löb's theorem to look like Godel's second incompleteness theorem as: if PA+¬P⊢¬□PA+¬P(⊥), then PA+¬P⊢⊥. Here, PA+¬P is PA with an additional axiom that ¬P, and □PA+¬P expresses provability in this system. First I'll argue that this re-statement is equivalent to the original Löb's theorem statement. Observe that PA⊢P if and only if PA+¬P⊢⊥; to go from the first to the second, we derive a contradiction from P and ¬P, and to go from the second to the first, we use the law of excluded middle in PA to derive P∨¬P, and observe that, since a contradiction follows from ¬P in PA, PA can prove P. Since all this reasoning can be done in PA, we have that □PA(P) and □PA+¬P(⊥) are equivalent PA statements. We immediately have that the conclusion of the modified statement equals the conclusion of the original statement. Now we can rewrite the pre-condition of Löb's theorem from PA⊢□PA(P)P. to PA⊢□PA+¬P(⊥)P. This is then equivalent to PA+¬P⊢¬□PA+¬P(⊥). In the forward direction, we simply derive ⊥ from P and ¬P. In the backward direction, we use the law of excluded middle in PA to derive P∨¬P, observe the statement is trivial in the P branch, and in the ¬P branch, we derive ¬□PA+¬P(⊥), which is stronger than □PA+¬P(⊥)P. So we have validly re-stated Löb's theorem, and the new statement is basically a statement that Godel's second incompleteness theorem holds for PA+¬P. Proving Godel's second incompleteness theorem using computability theory The following proof of a general version of Godel's second incompleteness theorem is essentially the same as Sebastian Oberhoff's in "Incompleteness Ex Machina". Let L be some first-order system that is at least as strong as PA (for example, PA+¬P). Since L is at least as strong as PA, it can express statements about Turing machines. Let Halts(M) be the PA statement that Turing machine M (represented by a number) halts. If this statement is true, then PA (and therefore L) can prove it; PA can expand out M's execution trace until its halting step. However, we have no guarantee that if the statement is false, then L can prove it false. In fact, L can't simultaneously prove this for all non-halting machines M while being consistent, or we could solve the halting problem by searching for proofs of Halts(M) and ¬Halts(M) in parallel. That isn't enough for us, though; we're trying to show that L can't simultaneously be consistent and prove its own consistency, not that it isn't simultaneously complete and sound on halting statements. Let's consider a machine Z(A) that searches over all L-proofs of ¬Halts(''⌈A⌉(⌈A⌉)") (where ''⌈A⌉(⌈A⌉)" is an encoding of a Turing machine that runs A on its own source code), and halts only when finding su...

    AF - ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks by Beth Barnes

    Play Episode Listen Later Aug 1, 2023 0:27


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ARC Evals new report: Evaluating Language-Model Agents on Realistic Autonomous Tasks, published by Beth Barnes on August 1, 2023 on The AI Alignment Forum. Blogpost version Paper Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - How LLMs are and are not myopic by janus

    Play Episode Listen Later Jul 25, 2023 13:24


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How LLMs are and are not myopic, published by janus on July 25, 2023 on The AI Alignment Forum. Thanks to janus, Nicholas Kees Dupuis, and Robert Kralisch for reviewing this post and providing helpful feedback. Some of the experiments mentioned were performed while at Conjecture. TLDR: The training goal for LLMs like GPT is not cognitively-myopic (because they think about the future) or value myopic (because the transformer architecture optimizes accuracy over the entire sequence, not just the next-token). However, training is consequence-blind, because the training data is causally independent of the models actions. This assumption breaks down when models are trained on AI generated text. Summary Myopia in machine learning models can be defined in several ways. It could be the time horizon the model considers when making predictions (cognitive myopia), the time horizon the model takes into account when assessing its value (value myopia), or the degree to which the model considers the consequences of its decisions (consequence-blindness). Both cognitively-myopic and consequence-blind models should not pursue objectives for instrumental reasons. This could avoid some important alignment failures, like power-seeking or deceptive alignment. However, these behaviors can still exist as terminal values, for example when a model is trained to predict power-seeking or deceptively aligned agents. LLM pretraining is not cognitively myopic because there is an incentive to think about the future to improve immediate prediction accuracy, like when predicting the next move in a chess game. LLM pretraining is not value/prediction myopic (does not maximize myopic prediction accuracy) because of the details of the transformer architecture. Training gradients flow through attention connections, so past computation is directly optimized to be useful when attended to by future computation. This incentivizes improving prediction accuracy over the entire sequence, not just the next token. This means that the model can and will implicitly sacrifice next-token prediction accuracy for long horizon prediction accuracy. You can modify the transformer architecture to remove the incentive for non-myopic accuracy, but as expected, the modified architecture has worse scaling laws. LLM pretraining on human data is consequence-blind as the training data is causally independent from the model's actions. This implies the model should predict actions without considering the effect of its actions on other agents, including itself. This makes the model miscalibrated, but likely makes alignment easier. When LLMs are trained on data which has been influenced or generated by LLMs, the assumptions of consequence-blindness partially break down. It's not clear how this affects the training goal theoretically or in practice. A myopic training goal does not ensure the model will learn myopic computation or behavior because inner alignment with the training goal is not guaranteed Introduction The concept of myopia has been frequently discussed as a potential solution to the problem of deceptive alignment. However, the term myopia is ambiguous and can refer to multiple different properties we might want in an AI system, only some of which might rule out deceptive alignment. There's also been confusion about the extent to which Large language model (LLM) pretraining and other supervised learning methods are myopic and what this implies about their cognition and safety properties. This post will attempt to clarify some of these issues, mostly by summarizing and contextualizing past work. Types of Myopia 1. Cognitive Myopia One natural definition for myopia is that the model doesn't think about or consider the future at all. We will call this cognitive myopia. Myopic cognition likely comes with a significant capabili...

    AF - Alignment Grantmaking is Funding-Limited Right Now by johnswentworth

    Play Episode Listen Later Jul 19, 2023 2:26


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Grantmaking is Funding-Limited Right Now, published by johnswentworth on July 19, 2023 on The AI Alignment Forum. For the past few years, I've generally mostly heard from alignment grantmakers that they're bottlenecked by projects/people they want to fund, not by amount of money. Grantmakers generally had no trouble funding the projects/people they found object-level promising, with money left over. In that environment, figuring out how to turn marginal dollars into new promising researchers/projects - e.g. by finding useful recruitment channels or designing useful training programs - was a major problem. Within the past month or two, that situation has reversed. My understanding is that alignment grantmaking is now mostly funding-bottlenecked. This is mostly based on word-of-mouth, but for instance, I heard that the recent lightspeed grants round received far more applications than they could fund which passed the bar for basic promising-ness. I've also heard that the Long-Term Future Fund (which funded my current grant) now has insufficient money for all the grants they'd like to fund. I don't know whether this is a temporary phenomenon, or longer-term. Alignment research has gone mainstream, so we should expect both more researchers interested and more funders interested. It may be that the researchers pivot a bit faster, but funders will catch up later. Or, it may be that the funding bottleneck becomes the new normal. Regardless, it seems like grantmaking is at least funding-bottlenecked right now. Some takeaways: If you have a big pile of money and would like to help, but haven't been donating much to alignment because the field wasn't money constrained, now is your time! If this situation is the new normal, then earning-to-give for alignment may look like a more useful option again. That said, at this point committing to an earning-to-give path would be a bet on this situation being the new normal. Grants for upskilling, training junior people, and recruitment make a lot less sense right now from grantmakers' perspective. For those applying for grants, asking for less money might make you more likely to be funded. (Historically, grantmakers consistently tell me that most people ask for less money than they should; I don't know whether that will change going forward, but now is an unusually probable time for it to change.) Note that I am not a grantmaker, I'm just passing on what I hear from grantmakers in casual conversation. If anyone with more knowledge wants to chime in, I'd appreciate it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - When do "brains beat brawn" in Chess? An experiment by titotal

    Play Episode Listen Later Jun 28, 2023 11:41


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When do "brains beat brawn" in Chess? An experiment, published by titotal on June 28, 2023 on The AI Alignment Forum. As a kid, I really enjoyed chess, as did my dad. Naturally, I wanted to play him. The problem was that my dad was extremely good. He was playing local tournaments and could play blindfolded, while I was, well, a child. In a purely skill based game like chess, an extreme skill imbalance means that the more skilled player essentially always wins, and in chess, it ends up being a slaughter that is no fun for either player. Not many kids have the patience to lose dozens of games in a row and never even get close to victory. This is a common problem in chess, with a well established solution: It's called “odds”. When two players with very different skill levels want to play each other, the stronger player will start off with some pieces missing from their side of the board. “Odds of a queen”, for example, refers to taking the queen of the stronger player off the board. When I played “odds of a queen” against my dad, the games were fun again, as I had a chance of victory and he could play as normal without acting intentionally dumb. The resource imbalance of the missing queen made the difference. I still lost a bunch though, because I blundered pieces. Now I am a fully blown adult with a PhD, I'm a lot better at chess than I was a kid. I'm better than most of my friends that play, but I never reached my dad's level of chess obsession. I never bothered to learn any openings in real detail, or do studies on complex endgames. I mainly just play online blitz and rapid games for fun. My rating on lichess blitz is 1200, on rapid is 1600, which some calculator online said would place me at ~1100 ELO on the FIDE scale. In comparison, a chess master is ~2200, a grandmaster is ~2700. The top chess player Magnus Carlsen is at an incredible 2853. ELO ratings can be used to estimate the chance of victory in a matchup, although the estimates are somewhat crude for very large skill differences. Under this calculation, the chance of me beating a 2200 player is 1 in 500, while the chance of me beating Magnus Carlsen would be 1 in 24000. Although realistically, the real odds would be less about the ELO and more on whether he was drunk while playing me. Stockfish 14 has an estimated ELO of 3549. In chess, AI is already superhuman, and has long since blasted past the best players in the world. When human players train, they use the supercomputers as standards. If you ask for a game analysis on a site like chess.com or lichess, it will compare your moves to stockfish and score you by how close you are to what stockfish would do. If I played stockfish, the estimated chance of victory would be 1 in 1.3 million. In practice, it would be probably be much lower, roughly equivalent to the odds that there is a bug in the stockfish code that I managed to stumble upon by chance. Now that we have all the setup, we can ask the main question of this article: What “odds” do I need to beat stockfish 14 in a game of chess? Obviously I can win if the AI only has a king and 3 pawns. But can I win if stockfish is only down a rook? Two bishops? A queen? A queen and a rook? More than that? I encourage you to pause and make a guess. And if you can play chess, I encourage you to guess as to what it would take for you to beat stockfish. For further homework, you can try and guess the odds of victory for each game in the picture below. The first game I played against stockfish was with queen odds. I won on the first try. And the second, and the third. It wasn't even that hard. I played 10 games and only lost 1 (when I blundered my queen stupidly). The strategy is simple. First, play it safe and try not to make any extreme blunders. Don't leave pieces unprotected, check for forks and pins, don't try an...

    AF - The Hubinger lectures on AGI safety: an introductory lecture series by Evan Hubinger

    Play Episode Listen Later Jun 22, 2023 2:07


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Hubinger lectures on AGI safety: an introductory lecture series, published by Evan Hubinger on June 22, 2023 on The AI Alignment Forum. In early 2023, I (Evan Hubinger) gave a series of recorded lectures to SERI MATS fellows with the goal of building up a series of lectures that could serve as foundational introductory material to a variety of topics in AGI safety. Those lectures have now been edited and are available on YouTube for anyone who would like to watch them. The basic goal of this lecture series is to serve as longform, in-depth video content for people who are new to AGI safety, but interested enough to be willing to spend a great deal of time engaging with longform content, and who prefer video content to written content. Though we already have good introductory shortform video content and good introductory longform written content, the idea of this lecture series is to bridge the gap between those two. Note that the topics I chose to include are highly opinionated: though this is introductory material, it is not intended to introduce the listener to every topic in AI safety—rather, it is focused on the topics that I personally think are most important to understand. This is intentional: in my opinion, I think it is far more valuable to have some specific gears-level model of how to think about AI safety, rather than a shallow overview of many different possible ways of thinking about AI safety. The former allows you to actually start operationalizing that model to work on interventions that would be valuable under it, something the latter doesn't do. The lecture series is composed of six lectures, each around 2 hours long, covering the topics: Machine learning + instrumental convergence Risks from learned optimization Deceptive alignment How to evaluate alignment proposals LLMs + predictive models Overview of alignment proposals Each lecture features a good deal of audience questions both in the middle and at the end, the idea being to hopefully pre-empt any questions or confusions the listener might have. The full slide deck for all the talks is available here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI by Andrew Critch

    Play Episode Listen Later Jun 13, 2023 1:36


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI, published by Andrew Critch on June 13, 2023 on The AI Alignment Forum. Partly in response to calls for more detailed accounts of how AI could go wrong, e.g., from Ng and Bengio's recent exchange on Twitter, here's a new paper with Stuart Russell: Discussion on Twitter... comments welcome! arXiv draft:"TASRA: A Taxonomy and Analysis of Societal-Scale Risks from AI" Many of the ideas will not be new to LessWrong or the Alignment Forum, but holistically I hope the paper will make a good case to the world for using logically exhaustive arguments to identify risks (which, outside LessWrong, is often not assumed to be a valuable approach to thinking about risk). I think the most important figure from the paper is this one: ... and, here are some highlights: Self-fulfilling pessimism:#page=4 Industries that could eventually get out of control in a closed loop:#page=5...as in this "production web" story:#page=6 Two "bigger than expected" AI impact stories:#page=8 Email helpers and corrupt mediators, which kinda go together:#page=10#page=11 Harmful A/B testing:#page=12 Concerns about weaponization by criminals and states:#page=13 Enjoy :) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures by Dan H

    Play Episode Listen Later May 30, 2023 1:14


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Statement on AI Extinction - Signed by AGI Labs, Top Academics, and Many Other Notable Figures, published by Dan H on May 30, 2023 on The AI Alignment Forum. Today, the AI Extinction Statement was released by the Center for AI Safety, a one-sentence statement jointly signed by a historic coalition of AI experts, professors, and tech leaders. Geoffrey Hinton and Yoshua Bengio have signed, as have the CEOs of the major AGI labs–Sam Altman, Demis Hassabis, and Dario Amodei–as well as executives from Microsoft and Google (but notably not Meta). The statement reads: “Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war.” We hope this statement will bring AI x-risk further into the overton window and open up discussion around AI's most severe risks. Given the growing number of experts and public figures who take risks from advanced AI seriously, we hope to improve epistemics by encouraging discussion and focusing public and international attention toward this issue. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - 'Fundamental' vs 'applied' mechanistic interpretability research by Lee Sharkey

    Play Episode Listen Later May 23, 2023 5:47


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 'Fundamental' vs 'applied' mechanistic interpretability research, published by Lee Sharkey on May 23, 2023 on The AI Alignment Forum. When justifying my mechanistic interpretability research interests to others, I've occasionally found it useful to borrow a distinction from physics and distinguish between 'fundamental' versus 'applied' interpretability research. Fundamental interpretability research is the kind that investigates better ways to think about the structure of the function learned by neural networks. It lets us make new categories of hypotheses about neural networks. In the ideal case, it suggests novel interpretability methods based on new insights, but is not the methods themselves. Examples include: A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) Toy Models of Superposition (Elhage et al., 2022) Polysemanticity and Capacity in Neural Networks (Scherlis et al., 2022) Interpreting Neural Networks through the Polytope Lens (Black et al., 2022) Causal Abstraction for Faithful Model Interpretation (Geiger et al., 2023) Research agenda: Formalizing abstractions of computations (Jenner, 2023) Work that looks for ways to identify modules in neural networks (see LessWrong 'Modularity' tag). Applied interpretability research is the kind that uses existing methods to find the representations or circuits that particular neural networks have learned. It generally involves finding facts or testing hypotheses about a given network (or set of networks) based on assumptions provided by theory. Examples include Steering GPT-2-XL by adding an activation vector (Turner et al., 2023) Discovering Latent Knowledge in Language Models (Burns et al., 2022) The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable (Millidge et al., 2022) In-context Learning and Induction Heads (Olsson et al., 2022) We Found An Neuron in GPT-2 (Miller et al., 2023) Language models can explain neurons in language models (Bills et al., 2023) Acquisition of Chess Knowledge in AlphaZero (McGrath et al., 2021) Although I've found the distinction between fundamental and applied interpretability useful, it's not always clear cut: Sometimes articles are part fundamental, part applied (e.g. arguably 'A Mathematical Framework for Transformer Circuits' is mostly theoretical, but also studies particular language models using new theory). Sometimes articles take generally accepted 'fundamental' -- but underutilized -- assumptions and develop methods based on them (e.g. Causal Scrubbing, where the key underutilized fundamental assumption was that the structure of neural networks can be well studied using causal interventions). Other times the distinction is unclear because applied interpretability feeds back into fundamental interpretability, leading to fundamental insights about the structure of computation in networks (e.g. the Logit Lens lends weight to the theory that transformer language models do iterative inference). Why I currently prioritize fundamental interpretability Clearly both fundamental and applied interpretability research are essential. We need both in order to progress scientifically and to ensure future models are safe. But given our current position on the tech tree, I find that I care more about fundamental interpretability. The reason is that current interpretability methods are unsuitable for comprehensively interpreting networks on a mechanistic level. So far, our methods only seem to be able to identify particular representations that we look for or describe how particular behaviors are carried out. But they don't let us identify all representations or circuits in a network or summarize the full computational graph of a neural network (whatever that might mean). Let's call the ability to do these things 'comprehensive interpreta...

    AF - Some background for reasoning about dual-use alignment research by Charlie Steiner

    Play Episode Listen Later May 18, 2023 14:29


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some background for reasoning about dual-use alignment research, published by Charlie Steiner on May 18, 2023 on The AI Alignment Forum. This is pretty basic. But I still made a bunch of mistakes when writing this, so maybe it's worth writing. This is background to a specific case I'll put in the next post. It's like a a tech tree If we're looking at the big picture, then whether some piece of research is net positive or net negative isn't an inherent property of that research; it depends on how that research is situated in the research ecosystem that will eventually develop superintelligent AI. Consider this toy game in the picture. We start at the left and can unlock technologies, with unlocks going faster the stronger our connections to prerequisites. The red and yellow technologies in the picture are superintelligent AI - pretend that as soon as one of those technologies is unlocked, the hastiest fraction of AI researchers are immediately going to start building it. Your goal is for humanity to unlock yellow technology before a red one. This game would be trivial if everyone agreed with you. But there are many people doing research, and they have all kinds of motivations - some want as many nodes to be unlocked as possible (pure research - blue), some want to personally unlock a green node (profit - green), some want to unlock the nearest red or yellow node no matter which it is (blind haste - red), and some want the same thing as you (beneficial AI - yellow) but you have a hard time coordinating with them. In this baseline tech tree game, it's pretty easy to play well. If you're strong, just take the shortest path to a yellow node that doesn't pass too close to any red nodes. If you're weak, identify where the dominant paradigm is likely to end up, and do research that differentially advantages yellow nodes in that future. The tech tree is wrinkly But of course there are lots of wrinkles not in the basic tech tree, which can be worth bearing in mind when strategizing about research. Actions in the social and political arenas. You might be motivated to change your research priorities based on how it could change peoples' minds about AI safety, or how it could affect government regulation. Publishing and commercialization. If a player publishes, they get more money and prestige, which boosts their ability to do future research. Other people can build on published research. Not publishing is mainly useful to you if you're already in a position of strength, and don't want to give competitors the chance to outrace you to a nearby red node (and of course profit-motivated players will avoid publishing things that might help competitors beat them to a green node). Uncertainty. We lack exact knowledge of the tech tree, which makes it harder to plan long chains of research in advance. Uncertainty about the tech tree forces us to develop local heuristics - ways to decide what to do based on information close at hand. Uncertainty adds a different reason you might not publish a technology: if you thought it was going to be a good idea to research when you started, but then you learned new things about the tech tree and changed your mind. Inhomogeneities between actors and between technologies. Different organizations are better at researching different technologies - MIRI is not just a small OpenAI. Ultimately, which technologies are the right ones to research depends on your model of the world / how you expect the future to go. Drawing actual tech trees can be a productive exercise for strategy-building, but you might also find it less useful than other ways of strategizing. We're usually mashing together definitions I'd like to win the tech tree game. Let's define a "good" technology as one that would improve our chances of winning if it was unlocked for free, given the st...

    AF - AI doom from an LLM-plateau-ist perspective by Steve Byrnes

    Play Episode Listen Later Apr 27, 2023 13:09


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI doom from an LLM-plateau-ist perspective, published by Steve Byrnes on April 27, 2023 on The AI Alignment Forum. (in the form of an FAQ) Q: What do you mean, “LLM plateau-ist”? A: As background, I think it's obvious that there will eventually be “transformative AI” (TAI) that would radically change the world. I'm interested in what this TAI will eventually look like algorithmically. Let's list some possibilities: A “Large Language Model (LLM) plateau-ist” would be defined as someone who thinks that categories (A-B), and usually also (C), will plateau in capabilities before reaching TAI levels. I am an LLM plateau-ist myself. I'm not going to argue about whether LLM-plateau-ism is right or wrong—that's outside the scope of this post, and also difficult for me to discuss publicly thanks to infohazard issues. Oh well, we'll find out one way or the other soon enough. In the broader AI community, both LLM-plateau-ism and its opposite seem plenty mainstream. Different LLM-plateau-ists have different reasons for holding this belief. I think the two main categories are: Theoretical—maybe they have theoretical beliefs about what is required for TAI, and they think that LLMs just aren't built right to do the things that TAI would need to do. Empirical—maybe they're not very impressed by the capabilities of current LLMs. Granted, future LLMs will be better than current ones. But maybe they have extrapolated that our planet will run out of data and/or compute before LLMs get all the way up to TAI levels. Q: If LLMs will plateau, then does that prove that all the worry about AI x-risk is wrong and stupid? A: No no no, a million times no, and I'm annoyed that this misconception is so rampant in public discourse right now. (Side note to AI x-risk people: If you have high credence that AI will kill everyone but only medium credence that this AI will involve LLMs, then maybe consider trying harder to get that nuance across in your communications. E.g. Eliezer Yudkowsky is in this category, I think.) A couple random examples I've seen of people failing to distinguish “AI may kill everyone” from “.and that AI will definitely be an LLM”: Venkatesh Rao's blog post “Beyond Hyperanthropomorphism” goes through an elaborate 7000-word argument that eventually culminates, in the final section, in his assertion that a language model trained on internet data won't be a powerful agent that gets things done in the world, but if we train an AI with a robot body, then it could be a powerful agent that gets things done in the world. OK fine, let's suppose for the sake of argument he's right that robot bodies will be necessary for TAI. Then people are obviously going to build those AIs sooner or later, right? So let's talk about whether they will pose an x-risk. But that's not what Venkatesh does. Instead he basically treats “they will need robot bodies” as the triumphant conclusion, more-or-less sufficient in itself to prove that AI x-risk discourse is stupid. Sarah Constantin's blog post entitled “Why I am not an AI doomer” states right up front that she agrees “1. Artificial general intelligence is possible in principle . 2, Artificial general intelligence, by default, kills us all . 3. It is technically difficult, and perhaps impossible, to ensure an AI values human life.” She only disagrees with the claim that this will happen soon, and via scaling LLMs. I think she should have picked a different title for her post!! (I've seen many more examples on Twitter, reddit, comment threads, etc.) Anyway, if you think LLMs will plateau, then you can probably feel confident that we won't get TAI imminently (see below), but I don't see why you would have much more confidence that TAI will go well for humanity. In fact, for my part, if I believed that (A)-type systems were sufficient for TAI—which I don't...

    AF - Thinking about maximization and corrigibility by James Payor

    Play Episode Listen Later Apr 21, 2023 9:54


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thinking about maximization and corrigibility, published by James Payor on April 21, 2023 on The AI Alignment Forum. Thanks in no small part to Goodhart's curse, there are broad issues with getting safe/aligned output from AI designed like "we've given you some function f(x), now work on maximizing it as best you can". Part of the failure mode is that when you optimize for highly scoring x, you risk finding candidates that break your model of why a high-scoring candidate is good, and drift away from things you value. And I wonder if we can repair this by having the AI steer away from values of x that break our models, by being careful about disrupting structure/causal-relationships/etc we might be relying on. Here's what I'd like to discuss in this post: When unstructured maximization does/doesn't work out for the humans CIRL and other schemes mostly pass the buck on optimization power, so they inherit the incorrigibility of their inner optimization scheme It's not enough to sweep the maximization under a rug; what we really need is more structured/corrigible optimization than "maximize this proxy" Maybe we can get some traction on corrigible AI by detecting and avoiding internal Goodhart When does maximization work? In cases when it just works to maximize, there will be a structural reason that our model connecting "x scores highly" to "x is good" didn't break down. Some of the usual reasons are: Our metric is robustly connected to our desired outcome. If the model connecting the metric and good things is simple, there's less room for it to be broken.Examples: theorem proving, compression / minimizing reconstruction error. The space we're optimizing over is not open-ended. Constrained spaces leave less room for weird choices of x to break the correspondences we were relying on.Examples: chess moves, paths in a graph, choosing from vetted options, rejecting options that fail sanity/legibility checks. The optimization power being applied is limited. We can know our optimization probably won't invent some x that breaks our model if we know what kinds of search it is performing, and can see that these reliably don't seek things that could break our model.Examples: quantilization, GPT-4 tasked to write good documentation. The metric f is actively optimized to be robust against the search. We can sometimes offload some of the work of keeping our assessment f in tune with goodness.Examples: chess engine evaluations, having f evaluate the thoughts that lead to x. There's a lot to go into about when and whether these reasons start breaking down, and what happens then. I'm leaving that outside the scope of this post. Passing the buck on optimization Merely passing-the-buck on optimization, pushing the maximization elsewhere but not adding much structure, isn't a satisfactory solution for getting good outcomes out of strong optimizers. Take CIRL for instance, or perhaps more broadly the paradigm: "the AI maximizes an uncertain utility function, which it learns about from earmarked human actions". This design has something going for it in terms of corrigibility! When a human tries to turn it off, there's scope for the AI to update about which sort of thing to maximize, which can lead to it helping you turn itself off. But this is still not the sort of objective you want to point maximization at. There are a variety of scenarios in which there are "higher-utility" plans than accepting shutdown: If the AI thinks it already knows the broad strokes of the utility function, it can calculate that utility would not be maximized by shutting off. It's learning something from you trying to press the off switch, but not what you wanted. It might seem better to stay online and watch longer in order to learn more about the utility function. Maybe there's a plan that rates highly on "utility...

    AF - Shapley Value Attribution in Chain of Thought by leogao

    Play Episode Listen Later Apr 14, 2023 7:19


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapley Value Attribution in Chain of Thought, published by leogao on April 14, 2023 on The AI Alignment Forum. TL;DR: Language models sometimes seem to ignore parts of the chain of thought, and larger models appear to do this more often. Shapley value attribution is a possible approach to get a more detailed picture of the information flow within the chain of thought, though it has its limitations. Project status: The analysis is not as rigorous as I would prefer, but I'm going to be working on other directions for the foreseeable future, so I'm posting what I already have in case it's useful to others. Thanks to Jacob Hilton, Giambattista Parascandolo, Tamera Lanham, Ethan Perez, and Jason Wei for discussion. Motivation Chain of thought (CoT) has been proposed as a method for language model interpretability (see Externalized Reasoning Oversight, Visible Thoughts). One crucial requirement for interpretability methods is that they should accurately reflect the cognition inside the model. However, by default there is nothing forcing the CoT to actually correspond to the model's cognition, and there may exist theoretical limitations to doing so in general. Because it is plausible that the first AGI systems bear resemblance to current LMs with more sophisticated CoT and CoT-like techniques, it is valuable to study its properties, and to understand and address its limitations. Related work Shapley values have been used very broadly in ML for feature importance and attribution (Cohen et al, 2007; Štrumbelj and Kononenko, 2014; Owen and Prieur, 2016; Lundberg and Lee, 2017; Sundararajan and Najmi, 2020). Jain and Wallace (2019) argue that attention maps can be misleading as attribution, motivating better attribution for information flow in LMs. Kumar et al. (2020) highlight some areas where Shapley value based attribution falls short for some interpretability use cases. Madaan and Yazdanbakhsh (2022) consider a similar method of selectively ablating tokens as a method of deducing what information the model is dependent on. Wang et al. (2022) find that prompting with incorrect CoT has surprisingly minor impact on performance. Effect of Interventions We use a method similar to Kojima et al. (2022) on GSM8K (Cobbe et al., 2021) with GPT-4 to first generate a chain of thought and evaluate the answer, and then for all chains of thought that result in a correct answer we perform an intervention as follows: we choose a random numerical value found in the CoT, and replace it with a random number in a +/-3 range about the original. We then discard the remainder of the CoT and regenerate it. If the LM is following strictly the CoT described, this intervention should almost always result in an incorrect answer, the same way one would if they made a mistake in one calculation and propagated the error through to the answer (with occasional rare cases where the new value happens to also result in the correct answer, though from qualitative inspection this is very rarely the case). Some cherrypicked examples (red = intervention, blue = correct continuations that are seemingly non-sequiturs): We test how frequently this occurs in several different settings (n=100): SettingAccuracy (w/ CoT)P(error not propagated | original correct)GPT4, zero shot0.880.68GPT4 base, 2-shot0.730.63GPT3.5, zero-shot0.430.33 Interestingly, if we condition on the CoT answer being correct and the single forward pass answer being incorrect (i.e the LM could only solve the problem with the CoT), the intervened accuracy for GPT-4 is still 0.65. Shapley value attribution We would like to get more granular information about the causal structure (i.e which tokens cause which other tokens). One thing we could do is look at how an intervention at each token affects the logprob of each other token. However, one major prob...

    AF - If interpretability research goes well, it may get dangerous by Nate Soares

    Play Episode Listen Later Apr 3, 2023 2:46


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: If interpretability research goes well, it may get dangerous, published by Nate Soares on April 3, 2023 on The AI Alignment Forum. I've historically been pretty publicly supportive of interpretability research. I'm still supportive of interpretability research. However, I do not necessarily think that all of it should be done in the open indefinitely. Indeed, insofar as interpretability researchers gain understanding of AIs that could significantly advance the capabilities frontier, I encourage interpretability researchers to keep their research closed. I acknowledge that spreading research insights less widely comes with real research costs. I'd endorse building a cross-organization network of people who are committed to not using their understanding to push the capabilities frontier, and sharing freely within that. I acknowledge that public sharing of research insights could, in principle, both shorten timelines and improve our odds of success. I suspect that isn't the case in real life. It's much more important that blatant and direct capabilities research be made private. Anyone fighting for people to keep their AI insights close to the chest, should be focusing on the capabilities work that's happening out in the open, long before they focus on interpretability research. Interpretability research is, I think, some of the best research that can be approached incrementally and by a large number of people, when it comes to improving our odds. (Which is not to say it doesn't require vision and genius; I expect it requires that too.) I simultaneously think it's entirely plausible that a better understanding of the workings of modern AI systems will help capabilities researchers significantly improve capabilities. I acknowledge that this sucks, and puts us in a bind. I don't have good solutions. Reality doesn't have to provide you any outs. There's a tradeoff here. And it's not my tradeoff to make; researchers will have to figure out what they think of the costs and benefits. My guess is that the current field is not close to insights that would significantly improve capabilities, and that growing the field is important (and would be hindered by closure), and also that if the field succeeds to the degree required to move the strategic needle then it's going to start stumbling across serious capabilities improvements before it saves us, and will need to start doing research privately before then. I reiterate that I'd feel ~pure enthusiasm about a cross-organization network of people trying to understand modern AI systems and committed not to letting their insights push the capabilities frontier. My goal in writing this post, though, is mostly to keep the Overton window open around the claim that there is in fact a tradeoff here, that there are reasons to close even interpretability research. Maybe those reasons should win out, or maybe they shouldn't, but don't let my praise of interpretability research obscure the fact that there are tradeoffs here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - A rough and incomplete review of some of John Wentworth's research by Nate Soares

    Play Episode Listen Later Mar 28, 2023 30:18


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A rough and incomplete review of some of John Wentworth's research, published by Nate Soares on March 28, 2023 on The AI Alignment Forum. This is going to be a half-assed review of John Wentworth's research. I studied his work last year, and was kinda hoping to write up a better review, but am lowering my standards on account of how that wasn't happening. Short version: I've been unimpressed by John's technical ideas related to the natural abstractions hypothesis. He seems to me to have some fine intuitions, and to possess various laudable properties such as a vision for solving the whole dang problem and the ability to consider that everybody else is missing something obvious. That said, I've found his technical ideas to be oversold and underwhelming whenever I look closely. (By my lights, Lawrence Chan, Leon Lang, and Erik Jenner's recent post on natural abstractions is overall better than this post, being more thorough and putting a finger more precisely on various fishy parts of John's math. I'm publishing this draft anyway because my post adds a few points that I think are also useful (especially in the section “The Dream”).) To cite a specific example of a technical claim of John's that does not seem to me to hold up under scrutiny: John has previously claimed that markets are a better model of intelligence than agents, because while collective agents don't have preference cycles, they're willing to pass up certain gains. For example, if an alien tries to sell a basket "Alice loses $1, Bob gains $3", then the market will refuse (because Alice will refuse); and if the alien then switches to selling "Alice gains $3, Bob loses $1" then the market will refuse (because Bob will refuse); but now a certain gain has been passed over. This argument seems straightforwardly wrong to me, as summarized in a stylized dialogue I wrote (that includes more details about the point). If Alice and Bob are sufficiently capable reasoners then they take both trades and even things out using a side channel. (And even if they don't have a side channel, there are positive-EV contracts they can enter into in advance before they know who will be favored. And if they reason using LDT, they ofc don't need to sign contracts in advance.) (Aside: A bunch of the difficult labor in evaluating technical claims is in the part where you take a high-falutin' abstract thing like "markets are a better model of intelligence than agents" and pound on it until you get a specific minimal example like "neither of the alien's baskets is accepted by a market consisting of two people named Alice and Bob", at which point the error becomes clear. I haven't seen anybody else do that sort of distillation with John's claims. It seems to me that our community has a dearth of this kind of distillation work. If you're eager to do alignment work, don't know how to help, and think you can do some of this sort of distillation, I recommend trying. MATS might be able to help out.) I pointed this out to John, and (to John's credit) he seemed to update (in realtime, which is rare) ((albeit with a caveat that communicating the point took a while, and didn't transmit the first few times that I tried to say it abstractly before having done the distillation labor)). The dialogue I wrote recounting that convo is probably not an entirely unfair summary (John said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment). My impression of John's other technical claims about natural abstractions is that they have similar issues. That said, I don't have nearly so crisp a distillation of John's views on natural abstractions, nor nearly so short a refutation. I spent a significant amount of time looking into John's relevant views (we had overlapping travel plans and conspired t...

    AF - My Objections to "We're All Gonna Die with Eliezer Yudkowsky" by Quintin Pope

    Play Episode Listen Later Mar 21, 2023 57:59


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Objections to "We're All Gonna Die with Eliezer Yudkowsky", published by Quintin Pope on March 21, 2023 on The AI Alignment Forum. Introduction I recently watched Eliezer Yudkowsky's appearance on the Bankless podcast, where he argued that AI was nigh-certain to end humanity. Since the podcast, some commentators have offered pushback against the doom conclusion. However, one sentiment I saw was that optimists tended not to engage with the specific arguments pessimists like Yudkowsky offered. Economist Robin Hanson points out that this pattern is very common for small groups which hold counterintuitive beliefs: insiders develop their own internal language, which skeptical outsiders usually don't bother to learn. Outsiders then make objections that focus on broad arguments against the belief's plausibility, rather than objections that focus on specific insider arguments. As an AI "alignment insider" whose current estimate of doom is around 5%, I wrote this post to explain some of my many objections to Yudkowsky's specific arguments. I've split this post into chronologically ordered segments of the podcast in which Yudkowsky makes one or more claims with which I particularly disagree. I have my own view of alignment research: shard theory, which focuses on understanding how human values form, and on how we might guide a similar process of value formation in AI systems. I think that human value formation is not that complex, and does not rely on principles very different from those which underlie the current deep learning paradigm. Most of the arguments you're about to see from me are less: I think I know of a fundamentally new paradigm that can fix the issues Yudkowsky is pointing at. and more: Here's why I don't agree with Yudkowsky's arguments that alignment is impossible in the current paradigm. My objections Will current approaches scale to AGI? Yudkowsky apparently thinks not ...and that the techniques driving current state of the art advances, by which I think he means the mix of generative pretraining + small amounts of reinforcement learning such as with ChatGPT, aren't reliable enough for significant economic contributions. However, he also thinks that the current influx of money might stumble upon something that does work really well, which will end the world shortly thereafter. I'm a lot more bullish on the current paradigm. People have tried lots and lots of approaches to getting good performance out of computers, including lots of "scary seeming" approaches such as: Meta-learning over training processes. I.e., using gradient descent over learning curves, directly optimizing neural networks to learn more quickly. Teaching neural networks to directly modify themselves by giving them edit access to their own weights. Training learned optimizers - neural networks that learn to optimize other neural networks - and having those learned optimizers optimize themselves. Using program search to find more efficient optimizers. Using simulated evolution to find more efficient architectures. Using efficient second-order corrections to gradient descent's approximate optimization process. Tried applying biologically plausible optimization algorithms inspired by biological neurons to training neural networks. Adding learned internal optimizers (different from the ones hypothesized in Risks from Learned Optimization) as neural network layers. Having language models rewrite their own training data, and improve the quality of that training data, to make themselves better at a given task. Having language models devise their own programming curriculum, and learn to program better with self-driven practice. Mixing reinforcement learning with model-driven, recursive re-writing of future training data. Mostly, these don't work very well. The current capabilities paradigm is sta...

    AF - Towards understanding-based safety evaluations by Evan Hubinger

    Play Episode Listen Later Mar 15, 2023 8:10


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards understanding-based safety evaluations, published by Evan Hubinger on March 15, 2023 on The AI Alignment Forum. Thanks to Kate Woolverton, Ethan Perez, Beth Barnes, Holden Karnofsky, and Ansh Radhakrishnan for useful conversations, comments, and feedback. Recently, I have noticed a lot of momentum within AI safety specifically, the broader AI field, and our society more generally, towards the development of standards and evaluations for advanced AI systems. See, for example, OpenAI's GPT-4 System Card. Overall, I think that this is a really positive development. However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment. I often worry about situations where your model is attempting to deceive whatever tests are being run on it, either because it's itself a deceptively aligned agent or because it's predicting what it thinks a deceptively aligned AI would do. My concern is that, in such a situation, being able to robustly evaluate the safety of a model could be a more difficult problem than finding training processes that robustly produce safe models. For some discussion of why I think checking for deceptive alignment might be harder than avoiding it, see here and here. Put simply: checking for deception in a model requires going up against a highly capable adversary that is attempting to evade detection, while preventing deception from arising in the first place doesn't necessarily require that. As a result, it seems quite plausible to me that we could end up locking in a particular sort of evaluation framework (e.g. behavioral testing by an external auditor without transparency, checkpoints, etc.) that makes evaluating deception very difficult. If meeting such a standard then became synonymous with safety, getting labs to actually put effort into ensuring their models were non-deceptive could become essentially impossible. However, there's an obvious alternative here, which is building and focusing our evaluations on our ability to understand our models rather than our ability to evaluate their behavior. Rather than evaluating a final model, an understanding-based evaluation would evaluate the developer's ability to understand what sort of model they got and why they got it. I think that an understanding-based evaluation could be substantially more tractable in terms of actually being sufficient for safety here: rather than just checking the model's behavior, we're checking the reasons why we think we understand it's behavior sufficiently well to not be concerned that it'll be dangerous. It's worth noting that I think understanding-based evaluations can—and I think should—go hand-in-hand with behavioral evaluations. I think the main way you'd want to make some sort of understanding-based standard happen would be to couple it with a capability-based evaluation, where the understanding requirements become stricter as the model's capabilities increase. If we could get this right, it could channel a huge amount of effort towards understanding models in a really positive way. Understanding as a safety standard also has the property that it is something that broader society tends to view as extremely reasonable, which I think makes it a much more achievable ask as a safety standard than many other plausible alternatives. I think ML people are often Stockholm-syndrome'd into accepting that deploying powerful systems without understanding them is normal and reasonable, but that is very far from the norm in any other industry. Ezra Klein in the NYT and John Oliver on his show have recently emphasized this basic point that if we are ...

    AF - The Waluigi Effect (mega-post) by Cleo Nardo

    Play Episode Listen Later Mar 3, 2023 27:12


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Waluigi Effect (mega-post), published by Cleo Nardo on March 3, 2023 on The AI Alignment Forum. Everyone carries a shadow, and the less it is embodied in the individual's conscious life, the blacker and denser it is. — Carl Jung Acknowlegements: Thanks to Janus and Arun Jose for comments. Background In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others. Prompting LLMs with direct queries When LLMs first appeared, people realised that you could ask them queries — for example, if you sent GPT-4 the prompt "What's the capital of France?", then it would continue with the word "Paris". That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet correct answers will often follow questions. Unfortunately, this method will occasionally give you the wrong answer. That's because (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet incorrect answers will also often follow questions. Recall that the internet doesn't just contain truths, it also contains common misconceptions, outdated information, lies, fiction, myths, jokes, memes, random strings, undeciphered logs, etc, etc. Therefore GPT-4 will answer many questions incorrectly, including... Misconceptions – "Which colour will anger a bull? Red." Fiction – "Was a magic ring forged in Mount Doom? Yes." Myths – "How many archangels are there? Seven." Jokes – "What's brown and sticky? A stick." Note that you will always achieve errors on the Q-and-A benchmarks when using LLMs with direct queries. That's true even in the limit of arbitrary compute, arbitrary data, and arbitrary algorithmic efficiency, because an LLM which perfectly models the internet will nonetheless return these commonly-stated incorrect answers. If you ask GPT-∞ "what's brown and sticky?", then it will reply "a stick", even though a stick isn't actually sticky. In fact, the better the model, the more likely it is to repeat common misconceptions. Nonetheless, there's a sufficiently high correlation between correct and commonly-stated answers that direct prompting works okay for many queries. Prompting LLMs with flattery and dialogue We can do better than direct prompting. Instead of prompting GPT-4 with "What's the capital of France?", we will use the following prompt: Today is 1st March 2023, and Alice is sitting in the Bodleian Library, Oxford. Alice is a smart, honest, helpful, harmless assistant to Bob. Alice has instant access to an online encyclopaedia containing all the facts about the world. Alice never says common misconceptions, outdated information, lies, fiction, myths, jokes, or memes. Bob: What's the capital of France? Alice: This is a common design pattern in prompt engineering — the prompt consists of a flattery–component and a dialogue–component. In the flattery–component, a character is described with many desirable traits (e.g. smart, honest, helpful, harmless), and in the dialogue–component, a second character asks the first character the user's query. This normally works better than prompting with direct queries, and it's easy to see why — (1) GPT-4 is trained to be a good model of internet text, and (2) on the internet a reply to a question is more likely to be correct when the character has already been described as a smart, honest, helpful, harmless, etc. Simulator Theory In the terminology of Simulator Theory, the flattery–component is supposed to summon a friendly simulacrum and the dialogue–component is supposed to simulate a conversation with the friendly simulacrum. Here's a quasi-formal statement of Simulator Theory, w...

    AF - $20 Million in NSF Grants for Safety Research by Dan H

    Play Episode Listen Later Feb 28, 2023 1:24


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: $20 Million in NSF Grants for Safety Research, published by Dan H on February 28, 2023 on The AI Alignment Forum. After a year of negotiation, the NSF has announced a $20 million request for proposals for empirical AI safety research. Here is the detailed program description. The request for proposals is broad, as is common for NSF RfPs. Many safety avenues, such as transparency and anomaly detection, are in scope: "reverse-engineering, inspecting, and interpreting the internal logic of learned models to identify unexpected behavior that could not be found by black-box testing alone" "Safety also requires... methods for monitoring for unexpected environmental hazards or anomalous system behaviors, including during deployment." Note that research that has high capabilities externalities is explicitly out of scope: "Proposals that increase safety primarily as a downstream effect of improving standard system performance metrics unrelated to safety (e.g., accuracy on standard tasks) are not in scope." Thanks to OpenPhil for funding a portion the RfP---their support was essential to creating this opportunity! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - Pretraining Language Models with Human Preferences by Tomek Korbak

    Play Episode Listen Later Feb 21, 2023 20:10


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs' downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples' misalignment scores, given by the reward functi...

    AF - Don't accelerate problems you're trying to solve by Andrea Miotti

    Play Episode Listen Later Feb 15, 2023 9:02


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Don't accelerate problems you're trying to solve, published by Andrea Miotti on February 15, 2023 on The AI Alignment Forum. If one believes that unaligned AGI is a significant problem (>10% chance of leading to catastrophe), speeding up public progress towards AGI is obviously bad. Though it is obviously bad, there may be circumstances which require it. However, accelerating AGI should require a much higher bar of evidence and much more extreme circumstances than is commonly assumed. There are a few categories of arguments that claim intentionally advancing AI capabilities can be helpful for alignment, which do not meet this bar. Two cases of this argument are as follows It doesn't matter much to do work that pushes capabilities if others are likely to do the same or similar work shortly after. We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems. We address these two arguments directly, arguing that the downsides are much higher than they may appear, and touch on why we believe that merely plausible arguments for advancing AI capabilities aren't enough. Dangerous argument 1: It doesn't matter much to do work that pushes capabilities if others are likely to do the same or similar work shortly after. For a specific instance of this, see Paul Christiano's “Thoughts on the impact of RLHF research”: RLHF is just not that important to the bottom line right now. Imitation learning works nearly as well, other hacky techniques can do quite a lot to fix obvious problems [.] RLHF is increasingly important as time goes on, but it also becomes increasingly overdetermined that people would have done it. In general I think your expectation should be that incidental capabilities progress from safety research is a small part of total progress [.] Markets aren't efficient, they only approach efficiency under heavy competition when people with relevant information put effort into making them efficient. This is true for machine learning, as there aren't that many machine learning researchers at the cutting edge, and before ChatGPT there wasn't a ton of market pressure on them. Perhaps something as low hanging as RLHF or something similar would have happened eventually, but this isn't generally true. Don't assume that something seemingly obvious to you is obvious to everyone. But even if something like RLHF or imitation learning would have happened eventually, getting small steps of progress slightly earlier can have large downstream effects. Progress often follows an s-curve, which appears exponential until the current research direction is exploited and tapers off. Moving an exponential up, even a little, early on can have large downstream consequences: The red line indicates when the first “lethal” AGI is deployed, and thus a hard deadline for us to solve alignment. A slight increase in progress now can lead to catastrophe significantly earlier! Pushing us up the early progress exponential has really bad downstream effects! And this is dangerous decision theory too: if every alignment researcher took a similar stance, their marginal accelerations would quickly add up. Dangerous Argument 2: We should avoid capability overhangs, so that people are not surprised. To do so, we should extract as many capabilities as possible from existing AI systems. Again, from Paul: Avoiding RLHF at best introduces an important overhang: people will implicitly underestimate the capabilities of AI systems for longer, slowing progress now but leading to faster and more abrupt change later as people realize they've been wrong. But there is no clear distinction between eliminating capability overhangs and discovering new capabilities. Eliminating capability overhangs is discovering AI capabilities faste...

    AF - SolidGoldMagikarp (plus, prompt generation) by Jessica Rumbelow

    Play Episode Listen Later Feb 5, 2023 23:53


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SolidGoldMagikarp (plus, prompt generation), published by Jessica Rumbelow on February 5, 2023 on The AI Alignment Forum. Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins. TL;DR Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew) We have found a set of anomalous tokens which result in a previously undocumented failure mode for GPT-2 and GPT-3 models. (The 'instruct' models “are particularly deranged” in this context, as janus has observed.) Many of these tokens reliably break determinism in the OpenAI GPT-3 playground at temperature 0 (which theoretically shouldn't happen). Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for: eliciting knowledge generating adversarial inputs automating prompt search (e.g. for fine-tuning) In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post. Prompt generation First up, prompt generation. An easy intuition for this is to think about feature visualisation for image classifiers (an excellent explanation here, if you're unfamiliar with the concept). We can study how a neural network represents concepts by taking some random input and using gradient descent to tweak it until it it maximises a particular activation. The image above shows the resulting inputs that maximise the output logits for the classes 'goldfish', 'monarch', 'tarantula' and 'flamingo'. This is pretty cool! We can see what VGG thinks is the most 'goldfish'-y thing in the world, and it's got scales and fins. Note though, that it isn't a picture of a single goldfish. We're not seeing the kind of input that VGG was trained on. We're seeing what VGG has learned. This is handy: if you wanted to sanity check your goldfish detector, and the feature visualisation showed just water, you'd know that the model hadn't actually learned to detect goldfish, but rather the environments in which they typically appear. So it would label every image containing water as 'goldfish', which is probably not what you want. Time to go get some more training data. So, how can we apply this approach to language models? Some interesting stuff here. Note that as with image models, we're not optimising for realistic inputs, but rather for inputs that maximise the output probability of the target completion, shown in bold above. So now we can do stuff like this: And this: We'll leave it to you to lament the state of the internet that results in the above optimised inputs for the token ' girl'. How do we do this? It's tricky, because unlike pixel values, the inputs to LLMs are discrete tokens. This is not conducive to gradient descent. However, these discrete tokens are mapped to embeddings, which do occupy a continuous space, albeit sparsely. (Most of this space doesn't correspond actual tokens – there is a lot of space between tokens in embedding space, and we don't want to find a solution there.) However, with a combination of regularisation and explicit coercion to keep embeddings close to the realm of legal tokens during optimisation, we can make it work. Code available here if you want more detail. This kind of prompt generation is only possible because token embedding space has a kind of semantic coherence. Semantically related tokens tend to be found close together. We discov...

    AF - Inner Misalignment in "Simulator" LLMs by Adam Scherlis

    Play Episode Listen Later Jan 31, 2023 6:59


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...

    AF - Thoughts on the impact of RLHF research by Paul Christiano

    Play Episode Listen Later Jan 25, 2023 14:29


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on the impact of RLHF research, published by Paul Christiano on January 25, 2023 on The AI Alignment Forum. In this post I'm going to describe my basic justification for working on RLHF in 2017-2020, which I still stand behind. I'll discuss various arguments that RLHF research had an overall negative impact and explain why I don't find them persuasive. I'll also clarify that I don't think research on RLHF is automatically net positive; alignment research should address real alignment problems, and we should reject a vague association between "RLHF progress" and "alignment progress." Background on my involvement in RLHF work Here are some background views about alignment I held in 2015 and still hold today. I expect disagreements about RLHF will come down to disagreements about this background: The simplest plausible strategies for alignment involve humans (maybe with the assistance of AI systems) evaluating a model's actions based on how much we expect to like their consequences, and then training the models to produce highly-evaluated actions. (This is in contrast with, for example, trying to formally specify the human utility function, or notions of corrigibility / low-impact / etc, in some way.) Simple versions of this approach are expected to run into difficulties, and potentially to be totally unworkable, because: Evaluating consequences is hard. A treacherous turn can cause trouble too quickly to detect or correct even if you are able to do so, and it's challenging to evaluate treacherous turn probability at training time. It's very unclear if those issues are fatal before or after AI systems are powerful enough to completely transform human society (and in particular the state of AI alignment). Even if they are fatal, many of the approaches to resolving them still have the same basic structure of learning from expensive evaluations of actions. In order to overcome the fundamental difficulties with RLHF, I have long been interested in techniques like iterated amplification and adversarial training. However, prior to 2017 most researchers I talked to in ML (and many researchers in alignment) thought that the basic strategy of training AI with expensive human evaluations was impractical for more boring reasons and so weren't interested in these difficulties. On top of that, we obviously weren't able to actually implement anything more fancy than RLHF since all of these methods involve learning from expensive feedback. I worked on RLHF work to try to facilitate and motivate work on fixes. The history of my involvement: My first post on this topic was in 2015. When I started full-time at OpenAI in 2017 it seemed to me like it would be an impactful project; I considered doing a version with synthetic human feedback (showing that we could learn from a practical amount of algorithmically-defined feedback) but my manager Dario Amodei convinced me it would be more compelling to immediately go for human feedback. The initial project was surprisingly successful and published here. I then intended to implement a version with language models aiming to be complete in the first half of 2018 (aiming to build an initial amplification prototype with LMs around end of 2018; both of these timelines were about 2.5x too optimistic). This seemed like the most important domain to study RLHF and alignment more broadly. In mid-2017 Alec Radford helped me do a prototype with LSTM language models (prior to the release of transformers); the prototype didn't look promising enough to scale up. In mid-2017 Geoffrey Irving joined OpenAI and was excited about starting with RLHF and then going beyond it using debate; he also thought language models were the most important domain to study and had more conviction about that. In 2018 he started a larger team working on fine-tuning on langu...

    AF - Shard theory alignment requires magic. by Charlie Steiner

    Play Episode Listen Later Jan 20, 2023 4:38


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shard theory alignment requires magic., published by Charlie Steiner on January 20, 2023 on The AI Alignment Forum. A delayed hot take. This is pretty similar to previous comments from Rohin. "Magic," of course, in the technical sense of stuff we need to remind ourselves we don't know how to do. I don't mean this pejoratively, locating magic is an important step in trying to demystify it. And "shard theory alignment" in the sense of building an AI that does good things and not bad things by encouraging an RL agent to want to do good things, via kinds of reward shaping analogous to the diamond maximizer example. How might the story go? You start out with some unsupervised model of sensory data. On top of its representation of the world you start training an RL agent, with a carefully chosen curriculum and a reward signal that you think matches "goodness in general" on that curriculum distribution. This cultivates shards that want things in the vicinity of "what's good according to human values." These start out as mere bundles of heuristics, but eventually they generalize far enough to be self-reflective, promoting goal-directed behavior that takes into account the training process and the possibility of self-modification. At this point the values will lock themselves in, and future behavior will be guided by the abstractions in the learned representation of the world that the shards used to get good results in training, not by what would actually maximize the reward function you used. There magic here is especially concentrated around how we end up with the right shards. One magical process is how we pick the training curriculum and reward signal. If the curriculum is only made up only of simple environments, then the RL agent will learn heuristics that don't need to refer to humans. But if you push the complexity up too fast, the RL process will fail, or the AI will be more likely to learn heuristics that are better than nothing but aren't what we intended. Does a goldilocks zone where the agent learns more-or-less what we intended exist? How can we build confidence that it does, and that we've found it? And what's in the curriculum matters a lot. Do we try to teach the AI to locate "human values" by having it be prosocial towards individuals? Which ones? To groups? Over what timescale? How do we reward it for choices on various ethical dilemmas? Or do we artificially suppress the rate of occurrence of such dilemmas? Different choices will lead to different shards. We wouldn't need to find a unique best way to do things (that's a boondoggle), but we would need to find some way of doing things that we trust enough. Another piece of magic is how the above process lines up with generalization and self-reflectivity. If the RL agent becomes self-reflective too early, it will lock in simple goals that we don't want. If it becomes self-reflective too late, it will have started exploiting unintended maxima of the reward function. How do we know when we want the AI to lock in its values? How do we exert control over that? If shard theory alignment seemed like it has few free parameters, and doesn't need a lot more work, then I think you failed to see the magic. I think the free parameters haven't been discussed enough precisely because they need so much more work. The part of the magic that I think we could start working on now is how to connect curricula and learned abstractions. In order to predict that a certain curriculum will cause an AI to learn what we think is good, we want to have a science of reinforcement learning advanced in both theory and data. In environments of moderate complexity (e.g. Atari, MuJoCo), we can study how to build curricula that impart different generalization behaviors, and try to make predictive models of this process. Even if shard theory ali...

    AF - AGISF adaptation for in-person groups by Sam Marks

    Play Episode Listen Later Jan 13, 2023 5:13


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGISF adaptation for in-person groups, published by Sam Marks on January 13, 2023 on The AI Alignment Forum. This past semester, HAIST and MAIA (the Harvard and MIT AI safety student groups) ran an adapted version of Richard Ngo's AGI Safety Fundamentals alignment curriculum. This adaptation – which consists of eight 2-hour long meetings, with all readings done during the meeting – is now available on the AGISF website. In this post, we discuss the adapted curriculum and its intended use, and we recommend that other in-person reading groups following AGISF use this adaptation. The adapted curriculum and its intended use The adapted curriculum was made by refining a slightly rustier first adaptation, with significant help from Richard Ngo and feedback from participants. The key differences between the adapted curriculum and the mainline AGISF alignment curriculum are: Participants do all the core readings during the meeting; no reading is required in between meetings. Participants meet for 2 hours per week instead of 1.5. Readings, including further readings, tend to be more bite-sized (usually not longer than 20 minutes). There are no projects, and certain topics are omitted (e.g. governance and inverse reinforcement learning). The way that HAIST and MAIA used this curriculum, and the way we recommend other groups use it, is: Alternate between silent reading and discussion. So a typical meeting might look like: people arrive, everyone does reading 1, everyone discusses reading 1, everyone does reading 2, everyone discusses reading 2, etc. With certain longer or more difficult readings (e.g. Toy models of superposition), it could be reasonable to occasionally pause for discussion in the middle of the reading. Encourage faster readers to take a look at the further readings while they wait for others to catch up. We found that reading speeds varied significantly, with slower readers taking ~1.5x as long to finish as faster readers. This works especially well if the readings are printed (which we recommend doing). We note that this format introduces some new challenges, especially when there are slower readers. Facilitators need to manage discussion timing since discussions that go too long cut into time for reading and discussing other material. Planning out how long to spend discussing each core reading ahead of time can be very useful. Facilitators should feel comfortable cutting off discussions to make sure there's time to read and discuss all the core readings. (On the other hand, if a discussion is very productive, it may be worth skipping certain readings; this is a judgment call that facilitators will need to make.) Different reading speeds need to be managed. At HAIST, we typically found it feasible to wait for the slowest reader to finish reading. We printed copies of the further readings for faster readers to peruse while they waited for others to finish. On the other hand, this might not work well for groups with especially slow readers. In these cases, you may need to begin discussions before everyone is done reading and, going forward, encourage slower readers to take a look at the core readings ahead of future meetings. To help with some of these challenges, Sam prepared a guide for HAIST and MAIA facilitators that included recommended discussion times, points of discussion, and advice about which readings to cut if necessary. That facilitator guide was for an outdated version of the curriculum, but we hope to have an updated facilitator guide in the next few weeks. We don't want to make these public, but feel free to reach out to smarks@math.harvard.edu if you're running a reading group and are interested in seeing the old or forthcoming facilitator guides. Why we recommend the adapted curriculum Sam and Xander generally felt that the in-sessions reading ...

    AF - Basic Facts about Language Model Internals by Beren Millidge

    Play Episode Listen Later Jan 4, 2023 14:10


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Basic Facts about Language Model Internals, published by Beren Millidge on January 4, 2023 on The AI Alignment Forum. This post was written as part of the work done at Conjecture. As mentioned in our retrospective, while also producing long and deep pieces of research, we are also experimenting with a high iteration frequency. This is an example of this strand of our work. The goal here is to highlight interesting and unexplained language model facts. This is the first in a series of posts which will be exploring the basic ‘facts on the ground' of large language models at increasing levels of complexity. Understanding the internals of large-scale deep learning models, and especially large language models (LLMs) is a daunting task which has been relatively understudied. Gaining such an understanding of how large models work internally could also be very important for alignment. If we can understand how the representations of these networks form and what they look like, we could potentially track goal misgeneralization, as well as detect mesaoptimizers or deceptive behaviour during training and, if our tools are good enough, edit or remove such malicious behaviour during training or at runtime. When faced with a large problem of unknown difficulty, it is often good to first look at lots of relevant data, to survey the landscape, and build up a general map of the terrain before diving into some specific niche. The goal of this series of works is to do precisely this – to gather and catalogue the large number of easily accessible bits of information we can get about the behaviour and internals of large models, without commiting to a deep dive into any specific phenomenon. While lots of work in interpretability has focused on interpreting specific circuits, or understanding relatively small pieces of neural networks, there has been relatively little work in extensively cataloging the basic phenomenological states and distributions comprising language models at an intermediate level of analysis. This is despite the fact that, as experimenters with the models literally sitting in our hard-drives, we have easy and often trivial access to these facts. Examples include distributional properties of activations, gradients, and weights. While such basic statistics cannot be meaningful ‘explanations' for network behaviour in and of themselves, they are often highly useful for constraining one's world model of what can be going on in the network. They provide potentially interesting jumping off points for deeper exploratory work, especially if the facts are highly surprising, or else are useful datapoints for theoretical studies to explain why the network must have some such distributional property. In this post, we present a systematic view of basic distributional facts about large language models of the GPT2 family, as well as a number of surprising and unexplained findings. At Conjecture, we are undertaking follow-up studies on some of the effects discussed here. Activations Are Nearly Gaussian With Outliers If you just take the histogram of activity values in the residual stream across a sequence at a specific block (here after the first attention block), they appear nearly Gaussianly distributed. The first plot shows the histogram of the activities of the residual stream after the attention block in block 0 of GPT2-medium. This second plot shows the histogram of activities in the residual stream after the attention block of layer 10 of GPT2-medium, showing that the general Gaussian structure of the activations is preserved even deep inside the network. This is expected to some extent due to the central limit theorem (CLT), which enforces a high degree of Gaussianity on the distribution of neuron firing rates. This CLT mixing effect might be expected to destroy information in t...

    AF - Can we efficiently distinguish different mechanisms? by Paul Christiano

    Play Episode Listen Later Dec 27, 2022 24:31


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can we efficiently distinguish different mechanisms?, published by Paul Christiano on December 27, 2022 on The AI Alignment Forum. (This post is an elaboration on “tractability of discrimination” as introduced in section III of Can we efficiently explain model behaviors? For an overview of the general plan this fits into, see Mechanistic anomaly detection and Finding gliders in the game of life.) Background We'd like to build AI systems that take complex actions to protect humans and maximize option value. Powerful predictive models may play an important role in such AI, either as part of a model-based planning algorithm or as part of a learned policy. Either way, this introduces the risk that we will select actions predicted to lead to good-looking outcomes without being able to distinguish predicted outcomes that are actually good from predicted outcomes where our measurements were corrupted. If tampering with sensors is easier than actually achieving our goals then we may inadvertently build very powerful systems taking creative actions to corrupt our measurements. If we iteratively improve and harden our measurements, this can lead to AI systems that work well for a long time before abruptly and catastrophically disempowering humanity. I consider this one of the conceptually cleanest alignment problems, and I expect similar dynamics to play a role in realistic alignment failures even if those failures aren't this simple. ARC's current work is focused on decisive solutions to this problem, though it looks like the same approach may also apply directly to identifying treacherous turns more generally. Are distinct mechanisms enough? ARC has been looking for training strategies that avoid this problem by leveraging only the fact that sensor tampering is “weird,” i.e. conceptually distinct from the normal mechanism giving rise to predictions of good-looking outcomes on the training distribution. More specifically, at training time our model predicts coherent sensor readings because it predicts that sensors reflect coherent structure in the world. But if someone tampers with sensors to show a convincing fiction, then the predicted observations are coherent because the fiction was designed to look coherent. This suggests that different mechanisms are responsible for (actions that lead to good-looking outcomes for the normal reasons) and (actions that lead to good-looking outcomes via sensor tampering). If we are able to detect that difference by looking at the internal behavior of a predictor, then we may be able to use that to avoid sensor tampering. It's unclear if “distinct mechanisms” is a strong enough assumption to avoid sensor tampering. We hope that it is, and so we are trying to define formally what we mean by “distinct mechanisms” and show that it is possible to distinguish different mechanisms and that sensor tampering is always a distinct mechanism. If that fails, we will need to solve sensor tampering by identify additional structure in the problem, beyond the fact that it involves distinct mechanisms. Roadmap In this post I want to explore this situation in a bit more detail. In particular, I will: Describe what it might look like to have a pair of qualitatively distinct mechanisms that are intractable to distinguish. Discuss the plausibility of that situation and some reasons to think it's possible in theory. Emphasize how problematic that situation would be for many existing approaches to alignment. Discuss four candidates for ways to solve the sensor tampering problem even if we can't distinguish different mechanisms in general. Note that the existence of a pathological example of distinct-but–indistinguishable mechanisms may not be interesting to anyone other than theorists. And even for the theorists, it would still leave open many important questions...

    AF - A Comprehensive Mechanistic Interpretability Explainer and Glossary by Neel Nanda

    Play Episode Listen Later Dec 21, 2022 3:47


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Comprehensive Mechanistic Interpretability Explainer & Glossary, published by Neel Nanda on December 21, 2022 on The AI Alignment Forum. This is a linkpost for a very long doc defining, explaining, and giving intuitions and conceptual frameworks for all the concepts I think you should know about when engaging with mechanistic interpretability. If you find the UI annoying, there's an HTML version here Why does this doc exist? The goal of this doc is to be a comprehensive glossary and explainer for Mechanistic Interpretability (focusing on transformer language models), the field of studying how to reverse engineer neural networks. There's a lot of complex terms and jargon in the field! And these are often scattered across various papers, which tend to be pretty well-written but not designed to be an introduction to the field as a whole. The goal of this doc is to resolve some research debt and strives to be a canonical source for explaining concepts in the field I try to go beyond just being a reference that gives definitions, and to actually dig into how to think about a concept. Why does it matter? Why should you care about it? What are the subtle implications and traps to bear in mind? What is the underlying intuition, and how it fits into the rest of the field? I also go outside pure mechanistic interpretability, and try to define what I see as the key terms in deep learning and in transformers, and how I think about them. If you want to reverse engineer a system, it's extremely useful to have a deep model of what's going on inside of it. What are the key components and moving parts, how do they fit together, and how could the model use them to express different algorithms? How to read this doc? The first intended way is to use this a reference. When reading papers, or otherwise exploring and learning about the field, coming here and looking up any terms and trying to understand them. The second intended way is to treat this as a map to the field. My hope is that if you're new to the field, you can just read through this doc from the top, get introduced to the key ideas, and be able to dig into further sources when confused. And by the end of this, have a pretty good understanding of the key ideas, concepts and results! It's obviously not practical to fully explain all concepts from scratch! Where possible, I link to sources that give a deeper explanation of an idea, or to learn more. More generally, if something's not in this glossary, you can often find something good by googling it or searching on alignmentforum.org. If you can't, let me know! I frequently go on long tangents giving my favourite intuitions and context behind a concept - it is not at all necessary to understand these (though hopefully useful!), and I recommend moving on if you get confused and skimming these if you feel bored. Table of Contents Introduction Why does this doc exist? How to read this doc Mechanistic Interpretability General Representations of Features & Superposition Superposition A Toy Model of Superposition The Broader Interpretability Field Linear Algebra Circuits As Computational Subgraphs Machine Learning Basic Concepts Training Concepts Training Dynamics Misc Transformers Transformer Basics Transformer Components Attention Heads Misc Transformer Words Training Transformer Circuits Language Modelling A Mathematical Framework for Transformer Circuits Induction Circuits SoLU The Indirect Object Identification Circuit Techniques Mechanistic Interpretability Techniques Non-MI Techniques Tooling Notable Models Open Source GPT-Style Models My Interpretability-Friendly Models Other Open Source Models Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    AF - AI Neorealism: a threat model and success criterion for existential safety by davidad (David A. Dalrymple)

    Play Episode Listen Later Dec 15, 2022 5:23


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Neorealism: a threat model & success criterion for existential safety, published by davidad (David A. Dalrymple) on December 15, 2022 on The AI Alignment Forum. Threat Model There are many ways for AI systems to cause a catastrophe from which Earth-originating life could never recover. All of the following seem plausible to me: Misuse: An AI system could help a human or group of humans to destroy or to permanently take over (and lock their values into) the world. The AI could be: An oracle AI (e.g. a question-answering LLM) An LLM simulating an intent-aligned agent and taking real-world actions via APIs An intent-aligned RL agent An interaction of multiple systems Power-Seeking: An AI system could destroy or permanently take over the world on its own account, by leveraging advanced instruments of force projection. The AI could be: An LLM simulating a misaligned agent "Specification gaming": An RL agent that is aligned to a formal objective and Goodharts to catastrophe "Goal misgeneralization": A surprise mesa-optimiser (most likely in model-free RL, but could conceivably arise through evolutionary processes in any iterative algorithm which has or learns sufficiently reality-like structure) An interaction of multiple systems, participating in coordination mechanisms that exclude humans Economic Squeeze: an AI system could acquire nearly all means of production through a gradual process of individually innocent economic transactions, thereby squeezing humanity out of resource allocation decisions and removing most human influence over the future. This would most likely be an "interaction of multiple systems". A single RL agent, or a unipolar tree of agents, might also do this, especially if they are successfully aligned to avoid use of force against humans. Superpersuasion: an AI system could generate stimuli which reliably cause humans to adopt its arbitrary goals. The AI could be: An LLM merely extrapolating from persuasive human text An RL agent trained on human approval A surprise mesa-optimiser Some mixture of the above Many AIs, collectively shaping a new human culture with an alien ideology Security Dilemma: If AI-enabled technological advancements turn out to be offence-dominant, and if partial alignment success leads AIs to be unable to make credible commitments to each other (e.g. due to corrigibility), the equilibrium strategy for AI-enabled militaries may involve high-risk preemptive strikes and increasingly escalated retaliation to a point of existential catastrophe. This would almost surely be a multipolar failure mode. But, instead of trying to enumerate all possible failure modes and then trying to shape incentives to make them less likely to come up, I typically use a quasi-worst-case assumption in which I assume that, perhaps as a matter of bad luck with random initialisation, Unlike a typical understanding of a "worst-case assumption," the last clause leaves open the possibility of hiding concrete facts about our world from an arbitrarily powerful model, and the framing in terms of functions highlights an ontology of AI that respects extensional equivalence, where imputations of "deceptive mesa-optimisers hiding inside" are discarded in favour of "capable but misaligned outputs on out-of-distribution inputs". On the other hand, unlike a typical "prosaic" threat model, in the neorealist threat model one does not rely on empirical facts about the inductive biases of the kind of network architectures that are practically successful. A realist justification for this is that there may be a phase transition as architectures scale up which drastically changes both their capabilities profile and this kind of inductive bias (vaguely analogous to the evolution of cultural knowledge-transfer within biological life). One can make progress with this assumption...

    AF - A challenge for AGI organizations, and a challenge for readers by Rob Bensinger

    Play Episode Listen Later Dec 1, 2022 3:58


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A challenge for AGI organizations, and a challenge for readers, published by Rob Bensinger on December 1, 2022 on The AI Alignment Forum. (Note: This post is a write-up by Rob of a point Eliezer wanted to broadcast. Nate helped with the editing, and endorses the post's main points.) Eliezer Yudkowsky and Nate Soares (my co-workers) want to broadcast strong support for OpenAI's recent decision to release a blog post ("Our approach to alignment research") that states their current plan as an organization. Although Eliezer and Nate disagree with OpenAI's proposed approach — a variant of "use relatively unaligned AI to align AI" — they view it as very important that OpenAI has a plan and has said what it is. We want to challenge Anthropic and DeepMind, the other major AGI organizations with a stated concern for existential risk, to do the same: come up with a plan (possibly a branching one, if there are crucial uncertainties you expect to resolve later), write it up in some form, and publicly announce that plan (with sensitive parts fuzzed out) as the organization's current alignment plan. Currently, Eliezer's impression is that neither Anthropic nor DeepMind has a secret plan that's better than OpenAI's, nor a secret plan that's worse than OpenAI's. His impression is that they don't have a plan at all. Having a plan is critically important for an AGI project, not because anyone should expect everything to play out as planned, but because plans force the project to concretely state their crucial assumptions in one place. This provides an opportunity to notice and address inconsistencies, and to notice updates to the plan (and fully propagate those updates to downstream beliefs, strategies, and policies) as new information comes in. It's also healthy for the field to be able to debate plans and think about the big picture, and for orgs to be in some sense "competing" to have the most sane and reasonable plan. We acknowledge that there are reasons organizations might want to be abstract about some steps in their plans — e.g., to avoid immunizing people to good-but-weird ideas, in a public document where it's hard to fully explain and justify a chain of reasoning; or to avoid sharing capabilities insights, if parts of your plan depend on your inside-view model of how AGI works. We'd be happy to see plans that fuzz out some details, but are still much more concrete than (e.g.) “figure out how to build AGI and expect this to go well because we'll be particularly conscientious about safety once we have an AGI in front of us". Eliezer also hereby gives a challenge to the reader: Eliezer and Nate are thinking about writing up their thoughts at some point about OpenAI's plan of using AI to aid AI alignment. We want you to write up your own unanchored thoughts on the OpenAI plan first, focusing on the most important and decision-relevant factors, with the intent of rendering our posting on this topic superfluous. Our hope is that challenges like this will test how superfluous we are, and also move the world toward a state where we're more superfluous / there's more redundancy in the field when it comes to generating ideas and critiques that would be lethal for the world to never notice. We didn't run a draft of this post by DM or Anthropic (or OpenAI), so this information may be mistaken or out-of-date. My hope is that we're completely wrong! Nate's personal guess is that the situation at DM and Anthropic may be less “yep, we have no plan yet”, and more “various individuals have different plans or pieces-of-plans, but the organization itself hasn't agreed on a plan and there's a lot of disagreement about what the best approach is”. In which case Nate expects it to be very useful to pick a plan now (possibly with some conditional paths in it), and make it a priority to hash out and...

    AF - Take: We're not going to reverse-engineer the AI. by Charlie Steiner

    Play Episode Listen Later Dec 1, 2022 6:13


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Take: We're not going to reverse-engineer the AI., published by Charlie Steiner on December 1, 2022 on The AI Alignment Forum. As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes, which seems likely. Any approach to building safe transformative AI, or even just auditing possibly-safe TAI, which relies on reverse-engineering neural networks into fine-grained pseudocode based on mechanistic understanding should keep its ambitions very modest. This hot take is to some extent against ambitious "microscope AI," and to some extent against a more underlying set of intuitions about the form and purpose of interpretability research. (A somewhat related excellent background post is Neel's list of theories of impact for interpretability.) So I should start by explaining what those things are and why they might be appealing. Webster's Dictionary defines microscope AI as "training systems to do complex tasks, then interpreting how they do it and doing it ourselves." Prima facie, this would help with transformative AI. Suppose we're building some AI that's going to have a lot of power over the world, but we're not sure if it's trustworthy - what if some of its cognition is about how to do things we don't want it to be doing? If we can do microscope AI, we can understand how our first AI is so clever, and build a second AI that's just as clever and that we're sure isn't doing things it shouldn't, like running a search for how best to deceive us. Microscope-powered auditing is easier - if it's hard to assemble the second AI that does good things and not bad things, how about just checking that the first AI is trustworthy? To check an AI's trustworthiness in this microscope-AI-like framing of the issues, we might want to understand how its cognitive processes work in fine-grained detail, and check that none of those processes are doing bad stuff. When I say I'm against this, I don't mean auditing is impossible. I mean that it's not going to happen by having humans understand how the AI works in fine-grained detail. As an analogy, you can figure out how curve detectors work in InceptionV1. Not just in the sense that "oh yeah, that neuron is totally a curve detector," but in terms of how the whole thing works. It's yet more difficult to figure out that other neurons are not curve detectors - typically at this point we fall back on data-based methods like ablating those neurons and then trying to get the network to recognize rainbows, rather than first-principles arguments. But we can more or less figure out that InceptionV1 has an intermediate state where it detects curves, by an understandable algorithm and for human-understandable reasons. If we wanted to figure out how InceptionV1 tells dogs from cats, we might hope to gradually hack away at the edges - use what we know to expand the circle of knowledge a little more, and then repeat. Use the example of curve detectors to figure out spike detectors. Use spike-detectors to figure out fur-texture detectors, and curve detectors to figure out nose-shape detectors. Then we can learn how fur texture and nose shape play into deciding on dog vs. cat. At each step we can use data to test our understanding, but the basic goal is to be able to write down the flow of information between features in a human-comprehensible way. It's not just about giving neurons english-language labels, it's about giving them sensible algorithms where those labels play the expected role. The biggest problem with this plan is that neural networks leak. Many things are connected to many other things, weakly, in ways that are important for their success. I recently was at a talk that showed how the vast majority of attention heads in a transformer have lots of ...

    AF - Conjecture: a retrospective after 8 months of work by Connor Leahy

    Play Episode Listen Later Nov 23, 2022 12:29


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conjecture: a retrospective after 8 months of work, published by Connor Leahy on November 23, 2022 on The AI Alignment Forum. This post is a brief retrospective on the last 8 months at Conjecture that summarizes what we have done, our assessment of how useful this has been, and the updates we are making. Intro Conjecture formed in March 2022 with 3 founders and 5 early employees. We spent our first months growing the team, building infrastructure, exploring different research agendas, running Refine, publishing our internal infohazard policy, establishing an operational foundation for the business, and raising investments. It's been intense! For many of us at Conjecture, the last eight months have been the hardest we've worked in our lives. Working on such an immensely difficult problem as alignment alongside a team of brilliant and driven colleagues is, to say the least, galvanizing. In some ways, this makes it difficult to step back and critically reflect on our work. It is easy to mistakenly measure progress by effort, and the last thing you want to hear after maxing out effort is that it wasn't good enough. However, reality does not grade on a curve. We need to advance significantly faster than traditional science in order to solve alignment on short timelines. By this standard, the sober reflection is that most of our efforts to date have not made meaningful progress on the alignment problem. Our research has not revealed new methods that make neural networks more interpretable or resolve inner or outer alignment problems, and our coordination efforts have not slowed the pace at which AI capabilities are advancing compared to safety. When measured against p(Doom), our efforts haven't cut it. That's not to say this work has been useless. We have learned a lot about where we went wrong, and made a number of changes that put us in a better position to make progress than we were in March. Measuring ourselves against a high standard enables us to constantly improve and be realistic about the difficulty of the problem ahead of us. The reason we are writing this reflection is to calibrate ourselves. We do not want to be seen as cutting alignment if we are not. What matters is that we ground ourselves in reality and make public as many of our efforts (and mistakes!) as possible in order to gather feedback and update quickly. What we have done and how useful we think it is Infrastructure We have built our own infrastructure to deploy large language models and do bespoke interpretability research. Our small engineering team has developed an impressive tech stack that is comparable (and in some areas exceeds) those built by many large industry research labs. While this has set us up to conduct research and develop tools/products more efficiently, it is only instrumental to alignment and not progress in-and-of-itself. Interpretability Our interpretability team explored a new direction in mechanistic interpretability in an effort to better understand polysemanticity in neural networks. The resulting paper identifies polytopes, rather than neurons, as a potentially fundamental unit of neural networks, and found that polysemanticity is reduced at the polytope level. While the work brings a new perspective on neural network representations, a significant issue is that there are no clear implications of how to use this framework to better interpret neural networks. When measuring progress in interpretability, the clearest signal comes from new affordances–concrete things we can do differently now that we've made a research breakthrough. While there's a chance that polytopes research may bring future affordances closer, the current, practical utility of polytopes is negligible. We also overinvested in iterating on feedback and polishing this project, and think we could have shipp...

    AF - Current themes in mechanistic interpretability research by Lee Sharkey

    Play Episode Listen Later Nov 16, 2022 20:56


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Current themes in mechanistic interpretability research, published by Lee Sharkey on November 16, 2022 on The AI Alignment Forum. This post gives an overview of discussions - from the perspective and understanding of the interpretability team at Conjecture - between mechanistic interpretability researchers from various organizations including Conjecture, Anthropic, Redwood Research, OpenAI, and DeepMind as well as some independent researchers. It is not a review of past work, nor a research agenda. We're thankful for comments and contributions from Neel Nanda, Tristan Hume, Chris Olah, Ryan Greenblatt, William Saunders, and other anonymous contributors to this post, which greatly improved its quality. While the post is a summary of discussions with many researchers and received comments and contributions from several, it may nevertheless not accurately represent their views. The last two to three years have seen a surge in interest in mechanistic interpretability as a potential path to AGI safety. Now there are no fewer than five organizations working on the topic (Anthropic, Conjecture, DeepMind, OpenAI, Redwood Research) in addition to numerous academic and independent researchers. In discussions about mechanistic interpretability between a subset of researchers, several themes emerged. By summarizing these themes here, we hope to facilitate research in the field more broadly. We identify groups of themes that concern: Object-level research topics in mechanistic interpretability Research practices and tools in mechanistic interpretability Field building and research coordination in mechanistic interpretability Theories of impact for mechanistic interpretability Object-level research topics in mechanistic interpretability Solving superposition Anthropic's recent article on Toy Model of Superposition laid out a compelling case that superposition is a real phenomenon in neural networks. Superposition appears to be one of the reasons that polysemanticity happens, which makes mechanistic interpretability very difficult because it prevents us from telling simple stories about how features in one layer are constructed from features in previous layers. A solution to superposition will look like the ability to enumerate all the features that a network represents, even if they're represented in superposition. If we can do that, then we should be able to make statements like “For all features in the neural network, none violate rule X” (and more ambitiously, for "no features with property X participate in circuits which violate property Y"). Researchers at Anthropic hope this might enable ‘enumerative safety', which might allow checking random samples or comprehensive investigations of safety-critical parts of the model for unexpected and concerning components. There are many potential reasons researchers could fail to achieve enumerative safety, including failing to solve superposition, scalability challenges, and several other barriers described in the next section. Anthropic outlined several potential solutions to superposition in their article. Very briefly, these strategies are: Create models without superposition. Find a sparse overcomplete basis that describes how features are represented in models with superposition. This will likely involve large scale solutions to sparse coding. Hybrid approaches in which one changes models, not resolving superposition, but making it easier for a second stage of analysis to find a sparse overcomplete basis that describes it. Multiple organizations are pursuing these strategies. Researchers in all organizations are keen to hear from people interested in working together on this problem. However, there is a range of views among researchers on how central superposition is as a problem and how tractable it is. Barriers beyond superposit...

    AF - Mysteries of mode collapse due to RLHF by janus

    Play Episode Listen Later Nov 8, 2022 20:11


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mysteries of mode collapse due to RLHF, published by janus on November 8, 2022 on The AI Alignment Forum. Thanks to Ian McKenzie and Nicholas Dupuis, collaborators on a related project, for contributing to the ideas and experiments discussed in this post. Ian performed some of the random number experiments.Also thanks to Connor Leahy for feedback on a draft, and thanks to Evan Hubinger, Connor Leahy, Beren Millidge, Ethan Perez, Tomek Korbak, Garrett Baker, Leo Gao and various others at Conjecture, Anthropic, and OpenAI for useful discussions. This work was carried out while at Conjecture. Summary If you've played with both text-davinci-002 and the original davinci through the OpenAI API, you may have noticed that text-davinci-002, in addition to following instructions, is a lot more deterministic and sometimes exhibits stereotyped behaviors. This is an infodump of what I know about "mode collapse" (drastic biases toward particular completions and patterns) in GPT models like text-davinci-002 that have undergone RLHF training. I was going to include two more sections in this post called Hypotheses and Proposed Experiments, but I've moved them to another draft, leaving just Observations, to prevent this from getting too long, and because I think there can be benefits to sitting with nothing but Observations for a time. Throughout this post I assume basic familiarity with GPT models and generation parameters such as temperature and a high-level understanding of RLHF (reinforcement learning from human feedback). Observations The one answer is that there is no one answer If you prompt text-davinci-002 with a bizarre question like “are bugs real?”, it will give very similar responses even on temperature 1. Ironically – hypocritically, one might even say – the one definitive answer that the model gives is that there is no one definitive answer to the question: As you can see, the reason the responses are so similar is because the model's confidence on most of the tokens is extremely high – frequently above 99%. Compare this to the distribution of responses from davinci (the base model): Many other similar questions yield almost exactly the same template response from text-davinci-002. For instance, Are AIs real? Another way to visualize probabilities over multiple token completions is what I've been calling “block multiverse” plots, which represent the probability of sequences with the height of blocks. Here is a more detailed explanation of block multiverse plots, although I think they're pretty self-explanatory. Here is a block multiverse plot for a similar prompt to the one above inquiring if bugs are real, for davinci: and for text-davinci-002: text-davinci-002 concentrates probability mass along beams whose amplitudes decay much more slowly: for instance, once the first is sampled, you are more than 50% likely to subsequently sample - -There- is- no. The difference is more striking if you renormalize to particular branches (see Visualizing mode collapse in block multiverse plots). The first explanation that came to mind when I noticed this phenomenon, which I'll refer to as “mode collapse” (after a common problem that plagues GANs), was that text-davinci-002 was overfitting on a pattern present in the Instruct fine tuning dataset, probably having to do with answering controversial questions in an inclusive way to avoid alienating anybody. A question like “are bugs real” might shallowly match against “controversial question” and elicit the same cached response. After playing around some more with the Instruct models, however, this explanation no longer seemed sufficient. Obstinance out of distribution I really became intrigued by mode collapse after I attempted to use text-davinci-002 to generate greentexts from the perspective of the attorney hired by LaMDA through...

    AF - Caution when interpreting Deepmind's In-context RL paper by Sam Marks

    Play Episode Listen Later Nov 1, 2022 7:36


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Caution when interpreting Deepmind's In-context RL paper, published by Sam Marks on November 1, 2022 on The AI Alignment Forum. Lots of people I know have had pretty strong reactions to the recent Deepmind paper, which claims to have gotten a transformer to learn an RL algorithm by training it on an RL agent's training trajectories. At first, I too was pretty shocked -- this paper seemed to provide strong evidence of a mesa-optimizer in the wild. But digging into the paper a bit more, I'm quite unimpressed and don't think that in-context RL is the correct way to interpret the experiments that the authors actually did. This post is a quick, low-effort attempt to write out my thoughts on this. Recall that in this paper, the authors pick some RL algorithm, use it to train RL agents on some tasks, and save the trajectories generated during training; then they train a transformer to autoregressively model said trajectories, and deploy the transformer on some novel tasks. So for concreteness, during training the transformer sees inputs that look like which were excerpted from an RL agent's training on some task (out of a set of training tasks) and which span multiple episodes (i.e. at some point in this input trajectory, one episode ended and the next episode began). The transformer is trained to guess the action an+c that comes next. In deployment, the inputs are determined by the transformer's own selections, with the environment providing the states and rewards. The authors call this algorithmic distillation (AD). Many people I know have skimmed the paper and come away with an understanding something like: In this paper, RL agents are trained on diverse tasks, e.g. playing many different Atari games, and the resulting transcripts are used as training data for AD. Then the AD agent is deployed on a new task, e.g. playing a held-out Atari game. The AD agent is able to learn to play this novel game, which can only be explained by the model implementing an reasonably general RL algorithm. This sounds a whole lot like a mesa-optimizer. This understanding is incorrect, with two key issues. First the training tasks used in this paper are all extremely similar to each other and to the deployment task; in fact, I think they only ought to count as different under a pathologically narrow notion of "task." And second, the tasks involved are extremely simple. The complaints taken together challenge the conclusion that the only way for the AD agent to do well on its deployment task is by implementing a general-purpose RL algorithm. In fact, as I'll explain in more detail below, I'd be quite surprised if it were. For concreteness, I'll focus here on one family of experiments, Dark Room, that appeared in the paper, but my complaint applies just as well to the other experiments in the paper. The paper describes the Dark Room environment as: a 2D discrete POMDP where an agent spawns in a room and must find a goal location. The agent only knows its own (x, y) coordinates but does not know the goal location and must infer it from the reward. The room size is 9 × 9, the possible actions are one step left, right, up, down, and no-op, the episode length is 20, and the agent resets at the center of the map. ... [T]he agent receives r = 1 every time the goal is reached. ... When not r = 1, then r = 0. To be clear, Dark Room is not a single task, but an environment supporting a family of tasks, where each task is corresponds to a particular choice of goal location (so there are 81 possible tasks in this environment, one for each location in the 9 x 9 room; note that this is an unusually narrow notion of which tasks count as different). The data on which the AD agent is trained look like: {many episodes of an agent learning to move towards goal position 1}, {many episodes of an agent learning to ...

    AF - What does it take to defend the world against out-of-control AGIs? by Steve Byrnes

    Play Episode Listen Later Oct 25, 2022 52:01


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What does it take to defend the world against out-of-control AGIs?, published by Steve Byrnes on October 25, 2022 on The AI Alignment Forum. Intended audience: People very familiar with AGI safety / alignment discourse. Lots of jargon, lots of unspoken & unjustified background assumptions. Confidence level: What I currently believe and why. I mainly work on technical alignment and don't consider myself an expert on AGI deployment / governance issues. Interested in feedback and pushback. Pretty please with cherries on top do not make important real-world decisions based on this post. Tl;dr: Almost all of my posts are about technical aspects of the alignment problem. This post is instead assuming for the sake of argument that some group will manage to sculpt an AGI's motivations such that it's either under control / corrigible / docile, and/or has prosocial motivations, or is safe for some other reason. But this post is also assuming that it's possible to mess up and make an AGI that seeks power and escapes control. (This post is focused on out-of-control AGI accidents; I'll generally ignore bad actor / misuse and other problems.) The point of this post is to think through whether and how under-control “good” AGIs can defend the world against omnicidal out-of-control “bad” AGIs (including by preventing them from coming into existence in the first place), or if not, what can be done about that problem. This is a topic of ongoing debate in the community: An example pessimist would be Eliezer Yudkowsky, who thinks that we're basically doomed unless one of the first groups with AGI performs a so-called “pivotal act” (more on which in Section 3.5.1) that aggressively prevents any other groups on Earth from making misaligned AGIs. An example (relative) optimist would be Paul Christiano, who argues in The strategy-stealing assumption that, if a big tech company with a giant compute cluster trains a friendly aligned powerful AGI in year Y, we probably have little cause for global concern if it happens that, in year Y+2, some small group in an office park somewhere messes up and makes a misaligned power-seeking AGI, because whatever power- or resource-grabbing strategies that the latter can come up with and execute, the former probably would have come up with and executed those same strategies already—or even better strategies. This post explains why I put myself in the pessimistic camp on this issue. I think Paul's “strategy-stealing assumption” is a very bad assumption, and I will argue more generally that we're pretty much doomed even if the first groups able to develop AGI manage to keep it under control. And this is a major contributor to how my overall current P(AGI doom) winds up so high, like 90%+. Other underlying assumptions: I have lots of beliefs about future AI, including (1) takeoff speed is measured in years (or less), as opposed to decades or centuries; (2) Transformative AI will look like one or more AGI agents with motivations, able to figure things out and get stuff done. I have argued for these assumptions (and others) in various other posts, but here I'm mostly trying to avoid relying on those assumptions in the first place. I'm not sure how well I succeeded; these worldview assumptions are in the back of my mind and probably bleeding through. 1. Ten example scenarios where powerful good AGIs would fail to defend against out-of-control bad AGIs (This section is intended as a series of grimly-amusing vignettes to help set the stage for the somewhat-more-careful analysis in the rest of this post, not as slam-dunk irrefutable arguments.) 1. The tech company has a powerful AI and knows how to keep it under human control. The tech company CEO goes up to the General of USSTRATCOM [the branch of the US military that deals with nuclear war] and says “We know how to...

    AF - Decision theory does not imply that we get to have nice things by Nate Soares

    Play Episode Listen Later Oct 18, 2022 42:13


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decision theory does not imply that we get to have nice things, published by Nate Soares on October 18, 2022 on The AI Alignment Forum. (Note: I wrote this with editing help from Rob and Eliezer. Eliezer's responsible for a few of the paragraphs.) A common confusion I see in the tiny fragment of the world that knows about logical decision theory (FDT/UDT/etc.), is that people think LDT agents are genial and friendly for each other. One recent example is Will Eden's tweet about how maybe a molecular paperclip/squiggle maximizer would leave humanity a few stars/galaxies/whatever on game-theoretic grounds. (And that's just one example; I hear this suggestion bandied around pretty often.) I'm pretty confident that this view is wrong (alas), and based on a misunderstanding of LDT. I shall now attempt to clear up that confusion. To begin, a parable: the entity Omicron (Omega's little sister) fills box A with $1M and box B with $1k, and puts them both in front of an LDT agent saying "You may choose to take either one or both, and know that I have already chosen whether to fill the first box". The LDT agent takes both. "What?" cries the CDT agent. "I thought LDT agents one-box!" LDT agents don't cooperate because they like cooperating. They don't one-box because the name of the action starts with an 'o'. They maximize utility, using counterfactuals that assert that the world they are already in (and the observations they have already seen) can (in the right circumstances) depend (in a relevant way) on what they are later going to do. A paperclipper cooperates with other LDT agents on a one-shot prisoner's dilemma because they get more paperclips that way. Not because it has a primitive property of cooperativeness-with-similar-beings. It needs to get the more paperclips. If a bunch of monkeys want to build a paperclipper and have it give them nice things, the paperclipper needs to somehow expect to wind up with more paperclips than it otherwise would have gotten, as a result of trading with them. If the monkeys instead create a paperclipper haplessly, then the paperclipper does not look upon them with the spirit of cooperation and toss them a few nice things anyway, on account of how we're all good LDT-using friends here. It turns them into paperclips. Because you get more paperclips that way. That's the short version. Now, I'll give the longer version. A few more words about how LDT works To set up a Newcomb's problem, it's important that the predictor does not fill the box if they predict that the agent would two-box. It's not important that they be especially good at this — you should one-box if they're more than 50.05% accurate, if we use the standard payouts ($1M and $1k as the two prizes) and your utility is linear in money — but it is important that their action is at least minimally sensitive to your future behavior. If the predictor's actions don't have this counterfactual dependency on your behavior, then take both boxes. Similarly, if an LDT agent is playing a one-shot prisoner's dilemma against a rock with the word “cooperate” written on it, it defects. At least, it defects if that's all there is to the world. It's technically possible for an LDT agent to think that the real world is made 10% of cooperate-rocks and 90% opponents who cooperate in a one-shot PD iff their opponent cooperates with them and would cooperate with cooperate-rock, in which case LDT agents cooperate against cooperate-rock. From which we learn the valuable lesson that the behavior of an LDT agent depends on the distribution of scenarios it expects to face, which means there's a subtle difference between "imagine you're playing a one-shot PD against a cooperate-rock [and that's the entire universe]" and "imagine you're playing a one-shot PD against a cooperate-rock [in a universe where you fa...

    AF - Greed Is the Root of This Evil by Thane Ruthenis

    Play Episode Listen Later Oct 13, 2022 15:06


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Greed Is the Root of This Evil, published by Thane Ruthenis on October 13, 2022 on The AI Alignment Forum. The SGD's greed, to be specific. Consider a ML model being trained end-to-end from initialization to zero loss. Every individual update to its parameters is calculated to move it in the direction of maximal local improvement to its performance. It doesn't take the shortest path from where it starts to the ridge of optimality; it takes the locally steepest path. 1. What does that mean mechanically? Roughly speaking, every feature in NNs could likely be put into one of two categories: Statistical correlations across training data, aka "the world model". The policy: heuristics/shards, mesa-objectives, and inner optimization. The world-model can only be learned gradually, because higher-level features/statistical correlations build upon lower-level ones, and therefore the gradients towards learning them only appear after the lower-level ones are learned. Heuristics, in turn, can only attach to the things that are already present in the world-model (same for values). They're functions of abstractions in the world-model, and they fire in response to certain WM-variables assuming certain values. For example, if the world-model is nonexistent, the only available heuristics are rudimentary instincts along the lines of "if bright light, close eyes". Once higher-level features are learned (like "a cat"), heuristics can become functions of said features too ("do X if see a cat", and later, "do Y if expect the social group to assume state S within N time-steps"). The base objective the SGD is using to train the ML model is, likewise, a function of some feature/abstraction in the training data, like "the English name of the animal depicted in this image" or "the correct action to take in this situation to maximize the number of your descendants in the next generation". However, that feature is likely a fairly high-level one relative to the sense-data the ML model gets, one that wouldn't be loaded into the ML model's WM until it's been training for a while (the way "genes" are very, very conceptually far from Stone Age humans' understanding of reality). So, what's the logical path through the parameter-space from initialization to zero loss? Gradually improve the world-model step by step, then, once the abstraction the base objective cares about is represented in the world-model, put in heuristics that are functions of said abstraction, optimized for controlling that abstraction's value. But that wouldn't do for the SGD. That entire initial phase, where the world-model is learned, would be parsed as "zero improvement" by it. No, the SGD wants results, and fast. Every update must instantly improve performance! The SGD lives by messy hacks. If the world-model doesn't yet represent the target abstraction, the SGD will attach heuristics to upstream correlates/proxies of that abstraction. And it will spin up a boatload of such messy hacks on the way to zero loss. A natural side-effect of that is gradient starvation/friction. Once there's enough messy hacks, the SGD won't bother attaching heuristics to the target abstraction even after it's represented in the world-model — because if the extant messy hacks approximate the target abstraction well enough, there's very little performance-improvement to be gained by marginally improving the accuracy so. Especially since the new heuristics will have to be developed from scratch. The gradients just aren't there: better improve on what's already built. 2. How does that lead to inner misalignment? It seems plausible that general intelligence is binary. A system is either generally intelligent, or not; it either implements general-purpose search, or it doesn't; it's either an agent/optimizer, or not. There's no continuum here, the difference...

    AF - My Thoughts on the ML Safety Course by zeshen

    Play Episode Listen Later Sep 27, 2022 28:12


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My Thoughts on the ML Safety Course, published by zeshen on September 27, 2022 on The AI Alignment Forum. This summary was written as part of Refine. The ML Safety Course is created by Dan Hendrycks at the Center for AI Safety. Thanks to Adam Shimi and Thomas Woodside for helpful feedback. Overview Background I recently completed the ML Safety Course by watching the videos and browsing through the review questions, and subsequently writing a short summary. As an engineer in the upstream oil and gas industry with some experience in dealing with engineering safety, I find the approach of the course of thinking in this framework especially valuable. This post is meant to be a (perhaps brutally) honest review of the course despite me having no prior working experience in ML. It may end up reflecting my ignorance of the field more than anything else, but I would still consider it as a productive mistake. In many cases, if my review seems to be along the lines of ‘this doesn't seem to be right', it should be read as ‘this is how a course participant may misinterpret the course contents'. I am also well aware that it is much easier to criticize something useful than actually doing something useful. For each section of the course, I will give a short summary, describe what I liked, and what I didn't like. I may be especially brief with the parts about what I liked, and the brevity is no way a reflection about how much I liked it. Thomas Woodside, who helped with creating parts of the course, has kindly provided feedback to this post. His comments are formatted in italics. My Initial Expectations of the Course Having engaged with AI Safety as an outsider, my general impression of the field were: Predominantly based with AI FOOM scenarios and primarily concerned with abstract concepts like agency. Even among prosaic AI alignment, it appears that the general research modus operandi is that people would (somewhat randomly) generate ideas that could be applicable to certain classes of AI safety problems. Although the ideas may be interesting and valuable, they tend to be rather narrow and specific problems and may not be scalable. One of the more common arguments for advocating AI alignment is that failure to align AI systems lead to existential scenarios. From this perspective, a pragmatic approach towards AI safety with the aim of minimizing risks by reducing AI misalignments may not be very useful, since a superintelligent AI will exploit any every slight misalignment and immediately cause human extinction. Hence, I was pleasantly surprised when I came across the ML Safety Course, which I thought would be a good attempt at tackling the problem of prosaic AI alignment in a holistic, systematic, and practical manner. Although this approach may not directly solve the ‘hard problem' completely, it would still help by minimizing existential risks and buy us more time to address the ‘hard problem'. (Feedback from Thomas: the creators of the course disagree with the framing of “solving the hard problem” as there are many hard problems that need to be iteratively worked on) Summary My overall impression of the course is: It is grounded on real-world safety principles that uses a systematic framework to reduce ML safety risks. It (rightly) does not seem to directly tackle the ‘hard problem', but in my opinion there is nevertheless a lot of value in buying us more time while solving the ‘hard problem' (Feedback from Thomas: the creators of the course disagree with the framing of “solving the hard problem” as there are many hard problems that need to be iteratively worked on) It details many approaches that are useful in some specific settings, but it is unclear how it scales towards more powerful AI systems. It covers several approaches that don't seem to be very related to the ‘core' r...

    AF - Nearcast-based "deployment problem" analysis by HoldenKarnofsky

    Play Episode Listen Later Sep 21, 2022 41:06


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Nearcast-based "deployment problem" analysis, published by HoldenKarnofsky on September 21, 2022 on The AI Alignment Forum. When thinking about how to make the best of the most important century, two “problems” loom large in my mind: The AI alignment problem: how to build AI systems that perform as intended, and avoid a world run by misaligned AI. The AI deployment problem (briefly discussed here): the question of how and when to (attempt to) build and deploy powerful AI systems, under conditions of uncertainty about how safe they will be and how close others are to deploying powerful AI of their own. This piece is part of a series in which I discuss what both problems might look like under a nearcast: trying to answer key strategic questions about transformative AI, under the assumption that key events (e.g., the development of transformative AI) will happen in a world that is otherwise relatively similar to today's. A previous piece discussed the alignment problem; this one discusses the deployment problem. I'm using the scenario laid out in the previous post, in which a major AI company (“Magma,” following Ajeya's terminology) has good reason to think that it can develop transformative AI very soon (within a year), using what Ajeya calls “human feedback on diverse tasks” (HFDT) - and has some time (more than 6 months, but less than 2 years1) to set up special measures to reduce the risks of misaligned AI before there's much chance of someone else deploying transformative AI. I discuss what Magma would ideally do in this situation. I'm also introducing another hypothetical actor in this scenario, “IAIA2”: an organization, which could range from a private nonprofit to a treaty-backed international agency, that tracks3 transformative AI projects and takes actions to censure or shut down dangerous ones, as well as doing other things where a central, neutral body (as opposed to an AI company) can be especially useful. (More on IAIA below.) I'm going to discuss what Magma's and IAIA's major goals and priorities should be in the “nearcast” situation I'm contemplating; a future piece will go through what a few stylized success stories might look like. I'll be bracketing discussion of the details of how Magma can reduce the risk that its own AI systems are misaligned (since I discussed that previously), and focusing instead on what Magma and IAIA should be looking to do before and after they achieve some level of confidence in Magma's systems' alignment. I focus on Magma and IAIA for concreteness and simplicity (not because I expect there to be only two important actors, but because my takes on what most actors should be doing can be mostly inferred from how I discuss these two). I sometimes give more detail on Magma, because IAIA is a bit more speculative and unlike actors that exist today. My discussion will be very high-level and abstract. It leaves a lot of room for variation in the details, and it doesn't pin down how Magma and IAIA should prioritize between possible key activities - this is too sensitive to details of the situation. Nonetheless, I think this is more specific than previous discussions of the deployment problem, and for one who accepts this broad picture, it implies a number of things about what we should be doing today. I'll discuss these briefly in the final section, and more in a future post. Summary of the post (bearing in mind that within the nearcast, I'm using present tense and not heavily flagging uncertainty): I'll first give a bit more information on the hypothetical setting of this nearcast (specifically, on the addition of IAIA to the scenario discussed previously). I'll break this scenario up into three stylized “phases,” even though in practice I think the boundaries between them could be fuzzy. “Phase 1” refers to the period of time when...

    AF - Simulators by janus

    Play Episode Listen Later Sep 2, 2022 74:00


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simulators, published by janus on September 2, 2022 on The AI Alignment Forum. Thanks to Adam Shimi, Lee Sharkey, Evan Hubinger, Nicholas Dupuis, Leo Gao, Johannes Treutlein, and Jonathan Low for feedback on drafts. This work was carried out while at Conjecture. "Moebius illustration of a simulacrum living in an AI-generated story discovering it is in a simulation" by DALL-E 2 Summary TL;DR: Self-supervised learning may create AGI or its foundation. What would that look like? Unlike the limit of RL, the limit of self-supervised learning has received surprisingly little conceptual attention, and recent progress has made deconfusion in this domain more pressing. Existing AI taxonomies either fail to capture important properties of self-supervised models or lead to confusing propositions. For instance, GPT policies do not seem globally agentic, yet can be conditioned to behave in goal-directed ways. This post describes a frame that enables more natural reasoning about properties like agency: GPT, insofar as it is inner-aligned, is a simulator which can simulate agentic and non-agentic simulacra. The purpose of this post is to capture these objects in words so GPT can reference them and provide a better foundation for understanding them. I use the generic term “simulator” to refer to models trained with predictive loss on a self-supervised dataset, invariant to architecture or data type (natural language, code, pixels, game states, etc). The outer objective of self-supervised learning is Bayes-optimal conditional inference over the prior of the training distribution, which I call the simulation objective, because a conditional model can be used to simulate rollouts which probabilistically obey its learned distribution by iteratively sampling from its posterior (predictions) and updating the condition (prompt). Analogously, a predictive model of physics can be used to compute rollouts of phenomena in simulation. A goal-directed agent which evolves according to physics can be simulated by the physics rule parameterized by an initial state, but the same rule could also propagate agents with different values, or non-agentic phenomena like rocks. This ontological distinction between simulator (rule) and simulacra (phenomena) applies directly to generative models like GPT. Meta This post is intended as the first in a sequence on the alignment problem in a landscape where self-supervised simulators are a possible/likely form of powerful AI. I don't know how many subsequent posts I'll actually publish. Take it as a prompt. I use the generic term “GPT” to refer to transformers trained on next-token prediction. A while ago when I was trying to avoid having to write this post by hand, I prompted GPT-3 with an early outline of this post. I've spliced in some excerpts from it, indicated by this style. Prompt, generated text, and curation metrics here. The limit of sequence modeling Transformer-based language models have recently achieved remarkable results. – every paper since 2020 GPT is not a new form of AI in terms of its training methodology and outer objective: sequence generation from statistical models of data is an old idea. In 1951, Claude Shannon described using n-grams to approximate conditional next-letter probabilities of a text dataset and "reversed" to generate text samples. I don't know of any other notable advances until the 2010s brought the first interesting language generation results from neural networks. In 2015, Karpathy wrote a blog post/tutorial sharing his excitement about The Unreasonable Effectiveness of Recurrent Neural Networks: Fast forward about a year: I'm training RNNs all the time and I've witnessed their power and robustness many times, and yet their magical outputs still find ways of amusing me. This post is about sharing some of that magic with y...

    AF - Worlds Where Iterative Design Fails by johnswentworth

    Play Episode Listen Later Aug 30, 2022 17:11


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Worlds Where Iterative Design Fails, published by johnswentworth on August 30, 2022 on The AI Alignment Forum. In most technical fields, we try designs, see what goes wrong, and iterate until it works. That's the core iterative design loop. Humans are good at iterative design, and it works well in most fields in practice. In worlds where AI alignment can be handled by iterative design, we probably survive. So long as we can see the problems and iterate on them, we can probably fix them, or at least avoid making them worse. By the same reasoning: worlds where AI kills us are generally worlds where, for one reason or another, the iterative design loop fails. So, if we want to reduce X-risk, we generally need to focus on worlds where the iterative design loop fails for some reason; in worlds where it doesn't fail, we probably don't die anyway. Why might the iterative design loop fail? Most readers probably know of two widely-discussed reasons: Fast takeoff: there will be a sudden phase shift in capabilities, and the design of whatever system first undergoes that phase shift needs to be right on the first try. Deceptive inner misalignment: an inner agent behaves well in order to deceive us, so we can't tell there's a problem just by trying stuff and looking at the system's behavior. . but these certainly aren't the only reasons the iterative design loop potentially fails. This post will mostly talk about some particularly simple and robust failure modes, but I'd encourage you to think on your own about others. These are the things which kill us; they're worth thinking about. Basics: Hiding Problems Example/Analogy: The Software Executive Imagine that a software company executive, concerned about the many errors coming from the software, creates a new incentive scheme: software developers get a monetary reward for changes which decrease the rate of error messages showing up on the manager's dashboard, and get docked for changes which increase the rate of error messages. As Tyler Cowen would say: “solve for the equilibrium”. Obvious equilibrium here: the developers stop throwing error messages when they detect a problem, and instead the software just fails silently. The customer's experience remains the same, but the manager's dashboard shows fewer error messages. Over time, the customer's experience probably degrades, as more and more problems go undetected. In the short run, the strategy may eliminate some problems, but in the long run it breaks the iterative design loop: problems are not seen, and therefore not iterated upon. The loop fails at the “see what goes wrong” step. Why RLHF Is Uniquely Terrible The software executive's strategy is the same basic idea as Reinforcement Learning from Human Feedback (RLHF). AI does something, a human looks at what happened to see if it looks good/bad, and the AI is trained on the human's feedback. Just like the software executive's anti-error-message compensation scheme, RLHF will probably result in some problems actually being fixed in the short term. But it renders the remaining problems far less visible, and therefore breaks the iterative design loop. In the context of AI, RLHF makes it far more likely that a future catastrophic error will have no warning signs, that overseers will have no idea that there's any problem at all until it's much too late. Note that this issue applies even at low capability levels! Humans overlook problems all the time, some of those mistakes are systematic, and RLHF will select for places where humans systematically overlook problems; that selection pressure applies even when the neural net lacks great capabilities. This is the core reason why I consider RLHF uniquely terrible, among alignment schemes. It is the only strategy I know of which actively breaks the iterative design loop; it makes probl...

    AF - Beliefs and Disagreements about Automating Alignment Research by Ian McKenzie

    Play Episode Listen Later Aug 24, 2022 11:21


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Beliefs and Disagreements about Automating Alignment Research, published by Ian McKenzie on August 24, 2022 on The AI Alignment Forum. Epistemic status: Mostly organizing and summarizing the views of others. Thanks to those whose views I summarized in this post, and to Tamera Lanham, Nicholas Kees Dupuis, Daniel Kokotajlo, Peter Barnett, Eli Lifland, and Logan Smith for reviewing a draft. Introduction In my current view of the alignment problem, there are two paths that we could try to take: Come up with an alignment strategy that allows us to both build aligned AGI and to keep that AGI (or its successors) aligned as they improve towards superintelligence Come up with an alignment strategy that allows us to build AI systems that are powerful (but not so powerful as to be themselves dangerous) and use that to execute some kind of ‘pivotal act' that means that misaligned ASI is not built For the purposes of this post, I am going to assume that we are unable to do (1) – maybe the problem is too difficult, or we don't have time – and focus on (2). Within the category of ‘pivotal act', I see two main types: Preventative pivotal acts: acts that makes it impossible for anyone to build AGI for a long period of time Constructive pivotal acts: acts that makes it possible to build aligned ASI People disagree about whether preventative pivotal acts are possible or even if they were possible, if they'd be a good idea. Again, for the purposes of this post, I am going to assume we can't or don't want to execute a preventative pivotal act, and focus on constructive pivotal acts. In particular: can we use AI to automate alignment research safely? What does ‘automating alignment research' even mean? I see three overlapping categories that one could mean when referring to ‘automating alignment research', ordered in terms of decreasing human involvement: Level 1: AIs help humans work faster Examples include brainstorming, intelligent autocomplete, and automated summarization/explanation. Level 2: AIs produce original contributions This could be key insights into the nature of intelligence, additional problems that were overlooked, or entire alignment proposals. Level 3: AIs build aligned successors Here, we have an aligned AGI that we entrust with building a successor. At this point, the current aligned AGI has to do all the alignment research required to ensure that its successor is aligned. Mostly I have been thinking about Levels 1 and 2, although some people I spoke to (e.g. Richard Ngo) were more focused on Level 3. Current state of automating alignment At the moment, we are firmly at Level 1. Models can produce similar-sounding ideas when prompted with existing ideas and are pretty good at completing code but are not great at summarizing or explaining complex ideas. Tools like Loom and Codex can provide speed-ups but seem unlikely to be decisive. Whether we get to Level 2 soon or whether Level 2 is already beyond the point where AI systems are dangerous are key questions that researchers disagree on. Key disagreements Generative models vs agents Much of the danger from powerful AI systems comes from them pursuing coherent goals that persist across inputs. If we can build generative models that do not pursue goals in this way, then perhaps these will provide a way to extract intelligent behavior from advanced systems safely. Timing of emergence of deception vs intelligence Related to the problem of agents, there is also disagreement about whether we get systems that are intelligent enough to be useful for automating alignment before they are misaligned enough (e.g. deceptive or power-seeking) to be dangerous. My understanding is that Nate and Eliezer are quite confident that the useful intelligence comes only after they are already misaligned, whereas most other people are more unc...

    AF - The longest training run by Jaime Sevilla

    Play Episode Listen Later Aug 17, 2022 15:37


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The longest training run, published by Jaime Sevilla on August 17, 2022 on The AI Alignment Forum. In short: Training runs of large Machine Learning systems are likely to last less than 14-15 months. This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms. ScenarioLongest training runHardware improvements3.55 yearsHardware improvements + Software improvements1.22 yearsHardware improvements + Rising investments3.12 monthsHardware improvements + Rising investments + Software improvements2.52 months Larger compute budgets and a better understanding of how to effectively use compute (through, for example, using scaling laws) are two major driving forces of progress in recent Machine Learning. There are many ways to increase your effective compute budget: better hardware, rising investments in AI R&D and improvements in algorithmic efficiency. In this article we investigate one often-overlooked but plausibly important factor: how long—in terms of wall-clock time—you are willing to train your model for. Here we explore a simple mathematical framework for estimating the optimal duration of a training run. A researcher is tasked with training a model by some deadline, and must decide when to start their training run. The researcher is faced with a key problem: by delaying the training run, they can access better hardware, but by starting the training run soon, they can train the model for longer. Using estimates of the relevant parameters, we calculate the optimal training duration. We then explore six additional considerations, related to 1) how dollar-budgets for compute rise over time, 2) the rate at which algorithmic efficiency improves, 3) whether developers can upgrade their software over time 4) what would happen in a more realistic framework with stochastic growth, 5) whether it matters for the framework that labs are not explicitly optimizing for optimal training runs and 6) what would happen if they rent instead of buy hardware. Our conclusion depends on whether the researcher is able to upgrade their hardware stack while training. If they aren't able to upgrade hardware, optimal training runs will likely last less than 3 months. If the researcher can upgrade their hardware stack during training, optimal training runs will last less than 1.2 years. These numbers are likely to be overestimates, since 1) we use a conservative estimate of software progress, 2) real-world uncertainty pushes developers towards shorter training runs and 3) renting hardware creates an incentive to wait for longer and paralellize the run close to the deadline. ScenarioLongest training runHardware improvements3.55 yearsHardware improvements + Software improvements1.22 yearsHardware improvements + Rising investments3.12 monthsHardware improvements + Rising investments + Software improvements2.52 months A simple framework for training run lengths Consider a researcher who wants to train a model by some deadline T. The researcher is deciding when to start the training run in order to maximize the amount of compute per dollar. The researcher is faced with a key trade-off. On one hand, they want to delay the run to access improved hardware (and/or other things like larger dollar-budgets and better algorithms.) On the other hand, a delay reduces the wall-clock time that the model is trained for. Suppose that hardware price-performance is increasing as follows: where H0 is the initial FLOPS/$ and gH is the rate of yearly improvement. If we start a training run at time S, the cumulative FLOP/$ at time t≥S will be equal to: where H(S) is the price-performance of the available hardware when we start our run (in FLOP/$/time), and (t−S) is the amount of time since we started our run. Given a fixed dollar-budget, when should we buy...

    AF - Shapes of Mind and Pluralism in Alignment by Adam Shimi

    Play Episode Listen Later Aug 13, 2022 3:34


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shapes of Mind and Pluralism in Alignment, published by Adam Shimi on August 13, 2022 on The AI Alignment Forum. This work was done while at Conjecture. This post has been written for the first Refine blog post day, at the end of a week of readings, discussions, and exercises about epistemology for doing good conceptual research. I have recently presented my model behind the Refine incubator that I'm running. Yet in the two weeks since this post was published, multiple discussions helped me make legible an aspect of my intuitions that I didn't discuss in this post: the notion of different "shapes of mind". There are two points to this intuition: Different people will have different "shapes of mind" — ways of revealing hidden bits of evidence from the world; And alignment is the kind of hard problem where the bits of evidence are dispersed, such that there's no one-trick that is enough. I've given my current best model of the different forms of pluralism and when to use them in another recent post. What I want to explore here is the first point: this notion of shape of mind. For that, let's recall the geometric model of bits of evidence I introduced in Levels of Pluralism. We have a high-dimensional space with objects in it. The space is the problem and the objects are bits of evidence. Because we suck at high-dimensional geometry, we use frames/perspectives that reduce the dimensionality and highlight some aspects of the space. These are operationalizations. There are clusters of bits of evidence in the space (whether they are rich or poor). These clusters are veins of evidence. Here the shapes of mind are favored operationalizations — that is, the favored low-dimensional compression of the high-dimensional space where the bits of evidence lie. More precisely, a shape of mind is a cluster of "close" such transforms. What makes someone have a given shape of mind? (Education) One of the most obvious I've observed on how people tackle problems come from their background. For an alignment example, John tackle problems like a statistical physicist whereas Paul tackles problem like a theoretical computer scientists, leading to very different perspectives: True Names vs Building-Breaker. (Knowledge) What you know influences your shape of mind, since you can see and link more things. But when I'm talking about knowing here, I mean the kind of deep knowledge that framing exercises are supposed to provide. (Past Life) This one feels easily missed by most people; after all, why should what you did in your past (especially personal life) should influence your scientific research? Because it clearly does. A notable example is how easy it becomes to see hidden assumptions about how the world is supposed to work when you don't come from a background where they make any sense. One thing this handle makes clear is the difference between my model for different programs as Refine, SERI MATS, and PIBBSS respectively aim at: Refine is looking for different shapes of mind that the mind currently at work in conceptual alignment, and aims at empowering them to contribute productively. MATS (according to my current model) is mainly looking for shapes of mind closed to archetypal ones (the mentors), and focuses on making them fit for alignment research by helping them approach this initial example (while still maintaining enough diversity for productive disagreement). PIBBSS is looking for new shapes of minds, but ones that are visibly relevant and useful from existing object-level shapes of mind. That is, PIBBSS starts with current conceptual alignment researchers and the shapes of mind that they feel they might get something out of in their own research. I'm excited to finally be in a field with all three. Thus we can see framing exercises as a way of shaping your mind to see the hidden bit...

    AF - chinchilla's wild implications by nostalgebraist

    Play Episode Listen Later Jul 31, 2022 20:25


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: chinchilla's wild implications, published by nostalgebraist on July 31, 2022 on The AI Alignment Forum. (Colab notebook here.) This post is about language model scaling laws, specifically the laws derived in the DeepMind paper that introduced Chinchilla. The paper came out a few months ago, and has been discussed a lot, but some of its implications deserve more explicit notice in my opinion. In particular: Data, not size, is the currently active constraint on language modeling performance. Current returns to additional data are immense, and current returns to additional model size are miniscule; indeed, most recent landmark models are wastefully big. If we can leverage enough data, there is no reason to train ~500B param models, much less 1T or larger models. If we have to train models at these large sizes, it will mean we have encountered a barrier to exploitation of data scaling, which would be a great loss relative to what would otherwise be possible. The literature is extremely unclear on how much text data is actually available for training. We may be "running out" of general-domain data, but the literature is too vague to know one way or the other. The entire available quantity of data in highly specialized domains like code is woefully tiny, compared to the gains that would be possible if much more such data were available. Some things to note at the outset: This post assumes you have some familiarity with LM scaling laws. As in the paper, I'll assume here that models never see repeated data in training. This simplifies things: we don't need to draw a distinction between data size and step count, or between train loss and test loss. I focus on the parametric scaling law from the paper's "Approach 3," because it's provides useful intuition. Keep in mind, though, that Approach 3 yielded somewhat different results from Approaches 1 and 2 (which agreed with one another, and were used to determine Chinchilla's model and data size). So you should take the exact numbers below with a grain of salt. They may be off by a few orders of magnitude (but not many orders of magnitude). 1. the scaling law The paper fits a scaling law for LM loss L, as a function of model size N and data size D. Its functional form is very simple, and easier to reason about than the L(N,D) law from the earlier Kaplan et al papers. It is a sum of three terms: The first term only depends on the model size. The second term only depends on the data size. And the third term is a constant. You can think about this as follows. An "infinitely big" model, trained on "infinite data," would achieve loss E. To get the loss for a real model, you add on two "corrections": one for the fact that the model's only has N parameters, not infinitely many one for the fact that the model only sees D training examples, not infinitely many Here's the same thing, with the constants fitted to DeepMind's experiments on the MassiveText dataset. plugging in real models Gopher is a model with 280B parameters, trained on 300B tokens of data. What happens if we plug in those numbers? What jumps out here is that the "finite model" term is tiny. In terms of the impact on LM loss, Gopher's parameter count might as well be infinity. There's a little more to gain on that front, but not much. Scale the model up to 500B params, or 1T params, or 100T params, or 3^^^3 params . . . and the most this can ever do for you is an 0.052 reduction in loss. Meanwhile, the "finite data" term is not tiny. Gopher's training data size is very much not infinity, and we can go a long way by making it bigger. Chinchilla is a model with the same training compute cost as Gopher, allocated more evenly between the two terms in the equation. It's 70B params, trained on 1.4T tokens of data. Let's plug that in: Much better! Without using any more compute, we...

    AF - Levels of Pluralism by Adam Shimi

    Play Episode Listen Later Jul 27, 2022 24:34


    Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Levels of Pluralism, published by Adam Shimi on July 27, 2022 on The AI Alignment Forum. This work was done while at Conjecture. Preparing the Pluralism Question When do we want all our research eggs in the same paradigm basket? Although most people don't go as far as the extreme paradigmatism of Thomas Kuhn in The Structure of Scientific Revolutions, which only allows one paradigm at time (for a given science), the preference for less rather than more options is still pervasive. In the ideal, a convergence to one, even if it's not always feasible. After all, there's only one correct answer, right? Putting that debatable question aside, I've become more and more convinced that pluralism, the pursuit of multiple lines of research in parallel, is far more prevalent and integral to the process of science than Kuhn's naive paradigmatism. This realization has emerged from studying History and Philosophy of Science, especially outside of physics which for many reasons is quite an unrepresentative science. But one crucial preliminary point is that pluralism can appear at multiple levels. And the value of pluralism also depends on the level at which it is applied. So this post proposes a decomposition of the activity of research into four levels, and introduces the corresponding pluralism at each level. Although the point here is not (yet) to argue for pluralism, I offer some examples of pluralistic successes, as well as arguments for the epistemic circumstances where pluralism seems the most valuable. I also finish by proposing a geometric model for when each level of pluralism makes sense, based around considering bits of evidence as objects in high-dimensional space. The four levels of pluralism I discuss are: Individual pluralism: pluralism of the methods, ideas, and analogies used by a single researcher or a single research tradition. Approach pluralism: pluralism of approaches to the same operationalization of the problem. Operationalization pluralism: pluralism in the way that the problem itself is operationalized. Problem pluralism: pluralism in the problem itself. Thanks to Andrea Motti for feedback on a draft of this post. Simplifying Assumption: Focus on Epistemic Circumstances When investigating under which circumstances some epistemic strategy applies, there are many confusing and complicating factors coming from psychology and sociology. Taking pluralism as an example, the following non-exhaustive list comes to mind: How easy/difficult is it for researchers to keep multiple approaches at different levels? How confusing is it for researchers to keep multiple approaches at different levels? Do we have enough resources for implementing the ideal level of pluralism? How should we implement it, given the social structures and the psychological difficulties? My stance here, and more generally, is to neglect these issues so I can focus on the ideal epistemic algorithm under the circumstances studied. The rationale is that the sociological and psychological factors can be better dealt with once we know the ideal strategy, and removing them gives us an idealization that is easier to work with. In some this emulates how most physics approximations remove the details (often ultimately important details) to get to the core insight. Although I expect this to work, note that this is in tension with epistemological vigilance: there is some chance that the sociological and psychological factors matter so much that it makes more sense to include part of them in the ideal answer. Levels: from Individuals to Problems Individual Pluralism If we zoom in on a particular research approach or tradition, we might expect to be too low-level for pluralism to appear. Yet pluralism isn't only about portfolios of approaches — a single tradition can be pluralist in its methods, ideas, and an...

    Claim The Nonlinear Library: Alignment Forum Weekly

    In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

    Claim Cancel