POPULARITY
⚠️ Découvrez du contenu EXCLUSIF (pas sur la chaîne) ⚠️ ⇒ https://the-flares.com/y/bonus/ ⬇️⬇️⬇️ Infos complémentaires : sources, références, liens... ⬇️⬇️⬇️ Le contenu vous intéresse ? Abonnez-vous et cliquez sur la
People often talk about “solving the alignment problem.” But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes.In brief, I'll say that you've solved the alignment problem if you've: avoided a bad form of AI takeover,built the dangerous kind of superintelligent AI agents,gained access to the main benefits of superintelligence, andbecome able to elicit some significant portion of those benefits from some of the superintelligent AI agents at stake in (2).[1] The post also discusses what it would take to do this. In particular: I discuss various options for avoiding bad takeover, notably: Avoiding what I call “vulnerability to alignment” conditions;Ensuring that AIs don't try to take over;Preventing such attempts from succeeding;Trying to ensure that AI takeover is somehow OK. (The alignment [...] ---Outline:(03:46) 1. Avoiding vs. handling vs. solving the problem(15:32) 2. A framework for thinking about AI safety goals(19:33) 3. Avoiding bad takeover(24:03) 3.1 Avoiding vulnerability-to-alignment conditions(27:18) 3.2 Ensuring that AI systems don't try to takeover(32:02) 3.3 Ensuring that takeover efforts don't succeed(33:07) 3.4 Ensuring that the takeover in question is somehow OK(41:55) 3.5 What's the role of “corrigibility” here?(42:17) 3.5.1 Some definitions of corrigibility(50:10) 3.5.2 Is corrigibility necessary for “solving alignment”?(53:34) 3.5.3 Does ensuring corrigibility raise issues that avoiding takeover does not?(55:46) 4. Desired elicitation(01:05:17) 5. The role of verification(01:09:24) 5.1 Output-focused verification and process-focused verification(01:16:14) 5.2 Does output-focused verification unlock desired elicitation?(01:23:00) 5.3 What are our options for process-focused verification?(01:29:25) 6. Does solving the alignment problem require some very sophisticated philosophical achievement re: our values on reflection?(01:38:05) 7. Wrapping upThe original text contained 27 footnotes which were omitted from this narration. The original text contained 3 images which were described by AI. --- First published: August 24th, 2024 Source: https://www.lesswrong.com/posts/AFdvSBNgN2EkAsZZA/what-is-it-to-solve-the-alignment-problem-1 --- Narrated by TYPE III AUDIO. ---Images from the article:
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is it to solve the alignment problem?, published by Joe Carlsmith on August 27, 2024 on LessWrong. People often talk about "solving the alignment problem." But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes. In brief, I'll say that you've solved the alignment problem if you've: 1. avoided a bad form of AI takeover, 2. built the dangerous kind of superintelligent AI agents, 3. gained access to the main benefits of superintelligence, and 4. become able to elicit some significant portion of those benefits from some of the superintelligent AI agents at stake in (2).[1] The post also discusses what it would take to do this. In particular: I discuss various options for avoiding bad takeover, notably: Avoiding what I call "vulnerability to alignment" conditions; Ensuring that AIs don't try to take over; Preventing such attempts from succeeding; Trying to ensure that AI takeover is somehow OK. (The alignment discourse has been surprisingly interested in this one; but I think it should be viewed as an extreme last resort.) I discuss different things people can mean by the term "corrigibility"; I suggest that the best definition is something like "does not resist shut-down/values-modification"; and I suggest that we can basically just think about incentives for/against corrigibility in the same way we think about incentives for/against other types of problematic power-seeking, like actively seeking to gain resources. I also don't think you need corrigibility to avoid takeover; and I think avoiding takeover should be our focus. I discuss the additional role of eliciting desired forms of task-performance, even once you've succeeded at avoiding takeover, and I modify the incentives framework I offered in a previous post to reflect the need for the AI to view desired task-performance as the best non-takeover option. I examine the role of different types of "verification" in avoiding takeover and eliciting desired task-performance. In particular: I distinguish between what I call "output-focused" verification and "process-focused" verification, where the former, roughly, focuses on the output whose desirability you want to verify, whereas the latter focuses on the process that produced that output. I suggest that we can view large portions of the alignment problem as the challenge of handling shifts in the amount we can rely on output-focused verification (or at least, our current mechanisms for output-focused verification). I discuss the notion of "epistemic bootstrapping" - i.e., building up from what we can verify, whether by process-focused or output-focused means, in order to extend our epistemic reach much further - as an approach to this challenge.[2] I discuss the relationship between output-focused verification and the "no sandbagging on checkable tasks" hypothesis about capability elicitation. I discuss some example options for process-focused verification. Finally, I express skepticism that solving the alignment problem requires imbuing a superintelligent AI with intrinsic concern for our "extrapolated volition" or our "values-on-reflection." In particular, I think just getting an "honest question-answerer" (plus the ability to gate AI behavior on the answers to various questions) is probably enough, since we can ask it the sorts of questions we wanted extrapolated volition to answer. (And it's not clear that avoiding flagrantly-bad behavior, at least, required answering those questions anyway.) Thanks to Carl Shulman, Lukas Finnveden, and Ryan Greenblatt for discussion. 1. Avoiding vs. handling vs. solving the problem What is it to solve the alignment problem? I think the standard at stake can be quite hazy. And when initially reading Bostrom and Yudkowsky, I think the image that built up most prominently i...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What is it to solve the alignment problem?, published by Joe Carlsmith on August 27, 2024 on LessWrong. People often talk about "solving the alignment problem." But what is it to do such a thing? I wanted to clarify my thinking about this topic, so I wrote up some notes. In brief, I'll say that you've solved the alignment problem if you've: 1. avoided a bad form of AI takeover, 2. built the dangerous kind of superintelligent AI agents, 3. gained access to the main benefits of superintelligence, and 4. become able to elicit some significant portion of those benefits from some of the superintelligent AI agents at stake in (2).[1] The post also discusses what it would take to do this. In particular: I discuss various options for avoiding bad takeover, notably: Avoiding what I call "vulnerability to alignment" conditions; Ensuring that AIs don't try to take over; Preventing such attempts from succeeding; Trying to ensure that AI takeover is somehow OK. (The alignment discourse has been surprisingly interested in this one; but I think it should be viewed as an extreme last resort.) I discuss different things people can mean by the term "corrigibility"; I suggest that the best definition is something like "does not resist shut-down/values-modification"; and I suggest that we can basically just think about incentives for/against corrigibility in the same way we think about incentives for/against other types of problematic power-seeking, like actively seeking to gain resources. I also don't think you need corrigibility to avoid takeover; and I think avoiding takeover should be our focus. I discuss the additional role of eliciting desired forms of task-performance, even once you've succeeded at avoiding takeover, and I modify the incentives framework I offered in a previous post to reflect the need for the AI to view desired task-performance as the best non-takeover option. I examine the role of different types of "verification" in avoiding takeover and eliciting desired task-performance. In particular: I distinguish between what I call "output-focused" verification and "process-focused" verification, where the former, roughly, focuses on the output whose desirability you want to verify, whereas the latter focuses on the process that produced that output. I suggest that we can view large portions of the alignment problem as the challenge of handling shifts in the amount we can rely on output-focused verification (or at least, our current mechanisms for output-focused verification). I discuss the notion of "epistemic bootstrapping" - i.e., building up from what we can verify, whether by process-focused or output-focused means, in order to extend our epistemic reach much further - as an approach to this challenge.[2] I discuss the relationship between output-focused verification and the "no sandbagging on checkable tasks" hypothesis about capability elicitation. I discuss some example options for process-focused verification. Finally, I express skepticism that solving the alignment problem requires imbuing a superintelligent AI with intrinsic concern for our "extrapolated volition" or our "values-on-reflection." In particular, I think just getting an "honest question-answerer" (plus the ability to gate AI behavior on the answers to various questions) is probably enough, since we can ask it the sorts of questions we wanted extrapolated volition to answer. (And it's not clear that avoiding flagrantly-bad behavior, at least, required answering those questions anyway.) Thanks to Carl Shulman, Lukas Finnveden, and Ryan Greenblatt for discussion. 1. Avoiding vs. handling vs. solving the problem What is it to solve the alignment problem? I think the standard at stake can be quite hazy. And when initially reading Bostrom and Yudkowsky, I think the image that built up most prominently i...
Chatted with Joe Carlsmith about whether we can trust power/techno-capital, how to not end up like Stalin in our urge to control the future, gentleness towards the artificial Other, and much more.Check out Joe's sequence on Otherness and Control in the Age of AGI here.Watch on YouTube. Listen on Apple Podcasts, Spotify, or any other podcast platform. Read the full transcript here. Follow me on Twitter for updates on future episodes.Sponsors:- Bland.ai is an AI agent that automates phone calls in any language, 24/7. Their technology uses "conversational pathways" for accurate, versatile communication across sales, operations, and customer support. You can try Bland yourself by calling 415-549-9654. Enterprises can get exclusive access to their advanced model at bland.ai/dwarkesh.- Stripe is financial infrastructure for the internet. Millions of companies from Anthropic to Amazon use Stripe to accept payments, automate financial processes and grow their revenue.If you're interested in advertising on the podcast, check out this page.Timestamps:(00:00:00) - Understanding the Basic Alignment Story(00:44:04) - Monkeys Inventing Humans(00:46:43) - Nietzsche, C.S. Lewis, and AI(1:22:51) - How should we treat AIs(1:52:33) - Balancing Being a Humanist and a Scholar(2:05:02) - Explore exploit tradeoffs and AI Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ten arguments that AI is an existential risk, published by KatjaGrace on August 13, 2024 on LessWrong. This is a snapshot of a new page on the AI Impacts Wiki. We've made a list of arguments[1] that AI poses an existential risk to humanity. We'd love to hear how you feel about them in the comments and polls. Competent non-aligned agents Summary: 1. Humans will build AI systems that are 'agents', i.e. they will autonomously pursue goals 2. Humans won't figure out how to make systems with goals that are compatible with human welfare and realizing human values 3. Such systems will be built or selected to be highly competent, and so gain the power to achieve their goals 4. Thus the future will be primarily controlled by AIs, who will direct it in ways that are at odds with long-run human welfare or the realization of human values Selected counterarguments: It is unclear that AI will tend to have goals that are bad for humans There are many forms of power. It is unclear that a competence advantage will ultimately trump all others in time This argument also appears to apply to human groups such as corporations, so we need an explanation of why those are not an existential risk People who have favorably discussed[2] this argument (specific quotes here): Paul Christiano (2021), Ajeya Cotra (2023), Eliezer Yudkowsky (2024), Nick Bostrom (2014[3]). See also: Full wiki page on the competent non-aligned agents argument Second species argument Summary: 1. Human dominance over other animal species is primarily due to humans having superior cognitive and coordination abilities 2. Therefore if another 'species' appears with abilities superior to those of humans, that species will become dominant over humans in the same way 3. AI will essentially be a 'species' with superior abilities to humans 4. Therefore AI will dominate humans Selected counterarguments: Human dominance over other species is plausibly not due to the cognitive abilities of individual humans, but rather because of human ability to communicate and store information through culture and artifacts Intelligence in animals doesn't appear to generally relate to dominance. For instance, elephants are much more intelligent than beetles, and it is not clear that elephants have dominated beetles Differences in capabilities don't necessarily lead to extinction. In the modern world, more powerful countries arguably control less powerful countries, but they do not wipe them out and most colonized countries have eventually gained independence People who have favorably discussed this argument (specific quotes here): Joe Carlsmith (2024), Richard Ngo (2020), Stuart Russell (2020[4]), Nick Bostrom (2015). See also: Full wiki page on the second species argument Loss of control via inferiority Summary: 1. AI systems will become much more competent than humans at decision-making 2. Thus most decisions will probably be allocated to AI systems 3. If AI systems make most decisions, humans will lose control of the future 4. If humans have no control of the future, the future will probably be bad for humans Selected counterarguments: Humans do not generally seem to become disempowered by possession of software that is far superior to them, even if it makes many 'decisions' in the process of carrying out their will In the same way that humans avoid being overpowered by companies, even though companies are more competent than individual humans, humans can track AI trustworthiness and have AI systems compete for them as users. This might substantially mitigate untrustworthy AI behavior People who have favorably discussed this argument (specific quotes here): Paul Christiano (2014), Ajeya Cotra (2023), Richard Ngo (2024). See also: Full wiki page on loss of control via inferiority Loss of control via speed Summary: 1. Advances in AI will produce...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ten arguments that AI is an existential risk, published by KatjaGrace on August 13, 2024 on LessWrong. This is a snapshot of a new page on the AI Impacts Wiki. We've made a list of arguments[1] that AI poses an existential risk to humanity. We'd love to hear how you feel about them in the comments and polls. Competent non-aligned agents Summary: 1. Humans will build AI systems that are 'agents', i.e. they will autonomously pursue goals 2. Humans won't figure out how to make systems with goals that are compatible with human welfare and realizing human values 3. Such systems will be built or selected to be highly competent, and so gain the power to achieve their goals 4. Thus the future will be primarily controlled by AIs, who will direct it in ways that are at odds with long-run human welfare or the realization of human values Selected counterarguments: It is unclear that AI will tend to have goals that are bad for humans There are many forms of power. It is unclear that a competence advantage will ultimately trump all others in time This argument also appears to apply to human groups such as corporations, so we need an explanation of why those are not an existential risk People who have favorably discussed[2] this argument (specific quotes here): Paul Christiano (2021), Ajeya Cotra (2023), Eliezer Yudkowsky (2024), Nick Bostrom (2014[3]). See also: Full wiki page on the competent non-aligned agents argument Second species argument Summary: 1. Human dominance over other animal species is primarily due to humans having superior cognitive and coordination abilities 2. Therefore if another 'species' appears with abilities superior to those of humans, that species will become dominant over humans in the same way 3. AI will essentially be a 'species' with superior abilities to humans 4. Therefore AI will dominate humans Selected counterarguments: Human dominance over other species is plausibly not due to the cognitive abilities of individual humans, but rather because of human ability to communicate and store information through culture and artifacts Intelligence in animals doesn't appear to generally relate to dominance. For instance, elephants are much more intelligent than beetles, and it is not clear that elephants have dominated beetles Differences in capabilities don't necessarily lead to extinction. In the modern world, more powerful countries arguably control less powerful countries, but they do not wipe them out and most colonized countries have eventually gained independence People who have favorably discussed this argument (specific quotes here): Joe Carlsmith (2024), Richard Ngo (2020), Stuart Russell (2020[4]), Nick Bostrom (2015). See also: Full wiki page on the second species argument Loss of control via inferiority Summary: 1. AI systems will become much more competent than humans at decision-making 2. Thus most decisions will probably be allocated to AI systems 3. If AI systems make most decisions, humans will lose control of the future 4. If humans have no control of the future, the future will probably be bad for humans Selected counterarguments: Humans do not generally seem to become disempowered by possession of software that is far superior to them, even if it makes many 'decisions' in the process of carrying out their will In the same way that humans avoid being overpowered by companies, even though companies are more competent than individual humans, humans can track AI trustworthiness and have AI systems compete for them as users. This might substantially mitigate untrustworthy AI behavior People who have favorably discussed this argument (specific quotes here): Paul Christiano (2014), Ajeya Cotra (2023), Richard Ngo (2024). See also: Full wiki page on loss of control via inferiority Loss of control via speed Summary: 1. Advances in AI will produce...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value fragility and AI takeover, published by Joe Carlsmith on August 5, 2024 on LessWrong. 1. Introduction "Value fragility," as I'll construe it, is the claim that slightly-different value systems tend to lead in importantly-different directions when subject to extreme optimization. I think the idea of value fragility haunts the AI risk discourse in various ways - and in particular, that it informs a backdrop prior that adequately aligning a superintelligence requires an extremely precise and sophisticated kind of technical and ethical achievement. That is, the thought goes: if you get a superintelligence's values even slightly wrong, you're screwed. This post is a collection of loose and not-super-organized reflections on value fragility and its role in arguments for pessimism about AI risk. I start by trying to tease apart a number of different claims in the vicinity of value fragility. In particular: I distinguish between questions about value fragility and questions about how different agents would converge on the same values given adequate reflection. I examine whether "extreme" optimization is required for worries about value fragility to go through (I think it at least makes them notably stronger), and I reflect a bit on whether, even conditional on creating super-intelligence, we might be able to avoid a future driven by relevantly extreme optimization. I highlight questions about whether multipolar scenarios alleviate concerns about value fragility, even if your exact values don't get any share of the power. My sense is that people often have some intuition that multipolarity helps notably in this respect; but I don't yet see a very strong story about why. If readers have stories that they find persuasive in this respect, I'd be curious to hear. I then turn to a discussion of a few different roles that value fragility, if true, could play in an argument for pessimism about AI risk. In particular, I distinguish between: 1. The value of what a superintelligence does after it takes over the world, assuming that it does so. 2. What sorts of incentives a superintelligence has to try to take over the world, in a context where it can do so extremely easily via a very wide variety of methods. 3. What sorts of incentives a superintelligence has to try to take over the world, in a context where it can't do so extremely easily via a very wide variety of methods. Yudkowsky's original discussion of value fragility is most directly relevant to (1). And I think it's actually notably irrelevant to (2). In particular, I think the basic argument for expecting AI takeover in a (2)-like scenario doesn't require value fragility to go through - and indeed, some conceptions of "AI alignment" seem to expect a "benign" form of AI takeover even if we get a superintelligence's values exactly right. Here, though, I'm especially interested in understanding (3)-like scenarios - that is, the sorts of incentives that apply to a superintelligence in a case where it can't just take over the world very easily via a wide variety of methods. Here, in particular, I highlight the role that value fragility can play in informing the AI's expectations with respect to the difference in value between worlds where it does not take over, and worlds where it does. In this context, that is, value fragility can matter to how the AI feels about a world where humans do retain control - rather than solely to how humans feel about a world where the AI takes over. I close with a brief discussion of how commitments to various forms of "niceness" and intentional power-sharing, if made sufficiently credible, could help diffuse the sorts of adversarial dynamics that value fragility can create. 2. Variants of value fragility What is value fragility? Let's start with some high-level definitions and clarifications. ...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Value fragility and AI takeover, published by Joe Carlsmith on August 5, 2024 on LessWrong. 1. Introduction "Value fragility," as I'll construe it, is the claim that slightly-different value systems tend to lead in importantly-different directions when subject to extreme optimization. I think the idea of value fragility haunts the AI risk discourse in various ways - and in particular, that it informs a backdrop prior that adequately aligning a superintelligence requires an extremely precise and sophisticated kind of technical and ethical achievement. That is, the thought goes: if you get a superintelligence's values even slightly wrong, you're screwed. This post is a collection of loose and not-super-organized reflections on value fragility and its role in arguments for pessimism about AI risk. I start by trying to tease apart a number of different claims in the vicinity of value fragility. In particular: I distinguish between questions about value fragility and questions about how different agents would converge on the same values given adequate reflection. I examine whether "extreme" optimization is required for worries about value fragility to go through (I think it at least makes them notably stronger), and I reflect a bit on whether, even conditional on creating super-intelligence, we might be able to avoid a future driven by relevantly extreme optimization. I highlight questions about whether multipolar scenarios alleviate concerns about value fragility, even if your exact values don't get any share of the power. My sense is that people often have some intuition that multipolarity helps notably in this respect; but I don't yet see a very strong story about why. If readers have stories that they find persuasive in this respect, I'd be curious to hear. I then turn to a discussion of a few different roles that value fragility, if true, could play in an argument for pessimism about AI risk. In particular, I distinguish between: 1. The value of what a superintelligence does after it takes over the world, assuming that it does so. 2. What sorts of incentives a superintelligence has to try to take over the world, in a context where it can do so extremely easily via a very wide variety of methods. 3. What sorts of incentives a superintelligence has to try to take over the world, in a context where it can't do so extremely easily via a very wide variety of methods. Yudkowsky's original discussion of value fragility is most directly relevant to (1). And I think it's actually notably irrelevant to (2). In particular, I think the basic argument for expecting AI takeover in a (2)-like scenario doesn't require value fragility to go through - and indeed, some conceptions of "AI alignment" seem to expect a "benign" form of AI takeover even if we get a superintelligence's values exactly right. Here, though, I'm especially interested in understanding (3)-like scenarios - that is, the sorts of incentives that apply to a superintelligence in a case where it can't just take over the world very easily via a wide variety of methods. Here, in particular, I highlight the role that value fragility can play in informing the AI's expectations with respect to the difference in value between worlds where it does not take over, and worlds where it does. In this context, that is, value fragility can matter to how the AI feels about a world where humans do retain control - rather than solely to how humans feel about a world where the AI takes over. I close with a brief discussion of how commitments to various forms of "niceness" and intentional power-sharing, if made sufficiently credible, could help diffuse the sorts of adversarial dynamics that value fragility can create. 2. Variants of value fragility What is value fragility? Let's start with some high-level definitions and clarifications. ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A framework for thinking about AI power-seeking, published by Joe Carlsmith on July 24, 2024 on The AI Alignment Forum. This post lays out a framework I'm currently using for thinking about when AI systems will seek power in problematic ways. I think this framework adds useful structure to the too-often-left-amorphous "instrumental convergence thesis," and that it helps us recast the classic argument for existential risk from misaligned AI in a revealing way. In particular, I suggest, this recasting highlights how much classic analyses of AI risk load on the assumption that the AIs in question are powerful enough to take over the world very easily, via a wide variety of paths. If we relax this assumption, I suggest, the strategic trade-offs that an AI faces, in choosing whether or not to engage in some form of problematic power-seeking, become substantially more complex. Prerequisites for rational takeover-seeking For simplicity, I'll focus here on the most extreme type of problematic AI power-seeking - namely, an AI or set of AIs actively trying to take over the world ("takeover-seeking"). But the framework I outline will generally apply to other, more moderate forms of problematic power-seeking as well - e.g., interfering with shut-down, interfering with goal-modification, seeking to self-exfiltrate, seeking to self-improve, more moderate forms of resource/control-seeking, deceiving/manipulating humans, acting to support some other AI's problematic power-seeking, etc.[2] Just substitute in one of those forms of power-seeking for "takeover" in what follows. I'm going to assume that in order to count as "trying to take over the world," or to participate in a takeover, an AI system needs to be actively choosing a plan partly in virtue of predicting that this plan will conduce towards takeover.[3] And I'm also going to assume that this is a rational choice from the AI's perspective.[4] This means that the AI's attempt at takeover-seeking needs to have, from the AI's perspective, at least some realistic chance of success - and I'll assume, as well, that this perspective is at least decently well-calibrated. We can relax these assumptions if we'd like - but I think that the paradigmatic concern about AI power-seeking should be happy to grant them. What's required for this kind of rational takeover-seeking? I think about the prerequisites in three categories: Agential prerequisites - that is, necessary structural features of an AI's capacity for planning in pursuit of goals. Goal-content prerequisites - that is, necessary structural features of an AI's motivational system. Takeover-favoring incentives - that is, the AI's overall incentives and constraints combining to make takeover-seeking rational. Let's look at each in turn. Agential prerequisites In order to be the type of system that might engage in successful forms of takeover-seeking, an AI needs to have the following properties: 1. Agentic planning capability: the AI needs to be capable of searching over plans for achieving outcomes, choosing between them on the basis of criteria, and executing them. 2. Planning-driven behavior: the AI's behavior, in this specific case, needs to be driven by a process of agentic planning. 1. Note that this isn't guaranteed by agentic planning capability. 1. For example, an LLM might be capable of generating effective plans, in the sense that that capability exists somewhere in the model, but it could nevertheless be the case that its output isn't driven by a planning process in a given case - i.e., it's not choosing its text output via a process of predicting the consequences of that text output, thinking about how much it prefers those consequences to other consequences, etc. 2. And note that human behavior isn't always driven by a process of agentic planning, either, despite our ...
(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.)This is the final essay in a series that I'm calling "Otherness andcontrol in the age of AGI." I'm hoping that the individual essays can beread fairly well on their own, butsee here fora brief summary of the series as a whole. There's also a PDF of the whole series here.Warning: spoilers for Angels in America; and moderate spoilers forHarry Potter and the Methods of Rationality.)"I come into the presence of still water..."~Wendell BerryA lot of this series has been about problems with yang—that is,with the active element in the duality of activity vs. receptivity,doing vs. not-doing, controlling vs. letting go.[1] In particular,I've been interested in the ways that "deepatheism"(that is, a fundamental [...]---
Artificial General Intelligence (AGI) Show with Soroush Pour
We speak with Katja Grace. Katja is the co-founder and lead researcher at AI Impacts, a research group trying to answer key questions about the future of AI — when certain capabilities will arise, what will AI look like, how it will all go for humanity.We talk to Katja about:* How AI Impacts latest rigorous survey of leading AI researchers shows they've dramatically reduced their timelines to when AI will successfully tackle all human tasks & occupations.* The survey's methodology and why we can be confident in its results* Responses to the survey* Katja's journey into the field of AI forecasting* Katja's thoughts about the future of AI, given her long tenure studying AI futures and its impactsHosted by Soroush Pour. Follow me for more AGI content:Twitter: https://twitter.com/soroushjpLinkedIn: https://www.linkedin.com/in/soroushjp/== Show links ==-- Follow Katja --* Website: https://katjagrace.com/* Twitter: https://x.com/katjagrace-- Further resources --* The 2023 survey of AI researchers views: https://wiki.aiimpacts.org/ai_timelines/predictions_of_human-level_ai_timelines/ai_timeline_surveys/2023_expert_survey_on_progress_in_ai* AI Impacts: https://aiimpacts.org/* AI Impacts' Substack: https://blog.aiimpacts.org/* Joe Carlsmith on Power Seeking AI: https://arxiv.org/abs/2206.13353 * Abbreviated version: https://joecarlsmith.com/2023/03/22/existential-risk-from-power-seeking-ai-shorter-version* Fragile World hypothesis by Nick Bostrom: https://nickbostrom.com/papers/vulnerable.pdfRecorded Feb 22, 2024
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why so many "racists" at Manifest?, published by Austin on June 18, 2024 on The Effective Altruism Forum. Manifest 2024 is a festival that we organized last weekend in Berkeley. By most accounts, it was a great success. On our feedback form, the average response to "would you recommend to a friend" was a 9.0/10. Reviewers said nice things like "one of the best weekends of my life" and "dinners and meetings and conversations with people building local cultures so achingly beautiful they feel almost like dreams" and "I've always found tribalism mysterious, but perhaps that was just because I hadn't yet found my tribe." Arnold Brooks running a session on Aristotle's Metaphysics. More photos of Manifest here. However, a recent post on The Guardian and review on the EA Forum highlight an uncomfortable fact: we invited a handful of controversial speakers to Manifest, whom these authors call out as "racist". Why did we invite these folks? First: our sessions and guests were mostly not controversial - despite what you may have heard Here's the schedule for Manifest on Saturday: (The largest & most prominent talks are on the left. Full schedule here.) And here's the full list of the 57 speakers we featured on our website: Nate Silver, Luana Lopes Lara, Robin Hanson, Scott Alexander, Niraek Jain-sharma, Byrne Hobart, Aella, Dwarkesh Patel, Patrick McKenzie, Chris Best, Ben Mann, Eliezer Yudkowsky, Cate Hall, Paul Gu, John Phillips, Allison Duettmann, Dan Schwarz, Alex Gajewski, Katja Grace, Kelsey Piper, Steve Hsu, Agnes Callard, Joe Carlsmith, Daniel Reeves, Misha Glouberman, Ajeya Cotra, Clara Collier, Samo Burja, Stephen Grugett, James Grugett, Javier Prieto, Simone Collins, Malcolm Collins, Jay Baxter, Tracing Woodgrains, Razib Khan, Max Tabarrok, Brian Chau, Gene Smith, Gavriel Kleinwaks, Niko McCarty, Xander Balwit, Jeremiah Johnson, Ozzie Gooen, Danny Halawi, Regan Arntz-Gray, Sarah Constantin, Frank Lantz, Will Jarvis, Stuart Buck, Jonathan Anomaly, Evan Miyazono, Rob Miles, Richard Hanania, Nate Soares, Holly Elmore, Josh Morrison. Judge for yourself; I hope this gives a flavor of what Manifest was actually like. Our sessions and guests spanned a wide range of topics: prediction markets and forecasting, of course; but also finance, technology, philosophy, AI, video games, politics, journalism and more. We deliberately invited a wide range of speakers with expertise outside of prediction markets; one of the goals of Manifest is to increase adoption of prediction markets via cross-pollination. Okay, but there sure seemed to be a lot of controversial ones… I was the one who invited the majority (~40/60) of Manifest's special guests; if you want to get mad at someone, get mad at me, not Rachel or Saul or Lighthaven; certainly not the other guests and attendees of Manifest. My criteria for inviting a speaker or special guest was roughly, "this person is notable, has something interesting to share, would enjoy Manifest, and many of our attendees would enjoy hearing from them". Specifically: Richard Hanania - I appreciate Hanania's support of prediction markets, including partnering with Manifold to run a forecasting competition on serious geopolitical topics and writing to the CFTC in defense of Kalshi. (In response to backlash last year, I wrote a post on my decision to invite Hanania, specifically) Simone and Malcolm Collins - I've enjoyed their Pragmatist's Guide series, which goes deep into topics like dating, governance, and religion. I think the world would be better with more kids in it, and thus support pronatalism. I also find the two of them to be incredibly energetic and engaging speakers IRL. Jonathan Anomaly - I attended a talk Dr. Anomaly gave about the state-of-the-art on polygenic embryonic screening. I was very impressed that something long-considered scien...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Loving a world you don't trust, published by Joe Carlsmith on June 18, 2024 on LessWrong. (Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) This is the final essay in a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a brief summary of the series as a whole. There's also a PDF of the whole series here. Warning: spoilers for Angels in America; and moderate spoilers for Harry Potter and the Methods of Rationality.) "I come into the presence of still water..." ~Wendell Berry A lot of this series has been about problems with yang - that is, with the active element in the duality of activity vs. receptivity, doing vs. not-doing, controlling vs. letting go.[1] In particular, I've been interested in the ways that "deep atheism" (that is, a fundamental mistrust towards Nature, and towards bare intelligence) can propel itself towards an ever-more yang-y, controlling relationship to Otherness, and to the universe as a whole. I've tried to point at various ways this sort of control-seeking can go wrong in the context of AGI, and to highlight a variety of less-controlling alternatives (e.g. "gentleness," "liberalism/niceness/boundaries," and "green") that I think have a role to play.[2] This is the final essay in the series. And because I've spent so much time on potential problems with yang, and with deep atheism, I want to close with an effort to make sure I've given both of them their due, and been clear about my overall take. To this end, the first part of the essay praises certain types of yang directly, in an effort to avoid over-correction towards yin. The second part praises something quite nearby to deep atheism that I care about a lot - something I call "humanism." And the third part tries to clarify the depth of atheism I ultimately endorse. In particular, I distinguish between trust in the Real, and various other attitudes towards it - attitudes like love, reverence, loyalty, and forgiveness. And I talk about ways these latter attitudes can still look the world's horrors in the eye. In praise of yang Let's start with some words in praise of yang. In praise of black Recall "black," from my essay on green. Black, on my construal of the colors, is the color for power, effectiveness, instrumental rationality - and hence, perhaps, the color most paradigmatically associated with yang. And insofar as I was especially interested in green qua yin, black was green's most salient antagonist. So I want to be clear: I think black is great.[3] Or at least, some aspects of it. Not black qua ego. Not black that wants power and domination for its sake.[4] Rather: black as the color of not fucking around. Of cutting through the bullshit; rejecting what Lewis calls "soft soap"; refusing to pretend things are prettier, or easier, or more comfortable; holding fast to the core thing. I wrote, in my essay on sincerity, about the idea of "seriousness." Black, I think, is the most paradigmatically serious color. And it's the color of what Yudkowsky calls "the void" - that nameless, final virtue of rationality; the one that carries your movement past your map, past the performance of effort, and into contact with the true goal.[5] Yudkowsky cites Miyamoto Musashi: The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means... If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him. More than anything, you must be thinking of carrying your movement through to cutting him. Musashi (image source here) In this sense, I think, black is the color of actually caring. That is: one becomes serious, centrally, when there are stak...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Loving a world you don't trust, published by Joe Carlsmith on June 18, 2024 on LessWrong. (Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) This is the final essay in a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a brief summary of the series as a whole. There's also a PDF of the whole series here. Warning: spoilers for Angels in America; and moderate spoilers for Harry Potter and the Methods of Rationality.) "I come into the presence of still water..." ~Wendell Berry A lot of this series has been about problems with yang - that is, with the active element in the duality of activity vs. receptivity, doing vs. not-doing, controlling vs. letting go.[1] In particular, I've been interested in the ways that "deep atheism" (that is, a fundamental mistrust towards Nature, and towards bare intelligence) can propel itself towards an ever-more yang-y, controlling relationship to Otherness, and to the universe as a whole. I've tried to point at various ways this sort of control-seeking can go wrong in the context of AGI, and to highlight a variety of less-controlling alternatives (e.g. "gentleness," "liberalism/niceness/boundaries," and "green") that I think have a role to play.[2] This is the final essay in the series. And because I've spent so much time on potential problems with yang, and with deep atheism, I want to close with an effort to make sure I've given both of them their due, and been clear about my overall take. To this end, the first part of the essay praises certain types of yang directly, in an effort to avoid over-correction towards yin. The second part praises something quite nearby to deep atheism that I care about a lot - something I call "humanism." And the third part tries to clarify the depth of atheism I ultimately endorse. In particular, I distinguish between trust in the Real, and various other attitudes towards it - attitudes like love, reverence, loyalty, and forgiveness. And I talk about ways these latter attitudes can still look the world's horrors in the eye. In praise of yang Let's start with some words in praise of yang. In praise of black Recall "black," from my essay on green. Black, on my construal of the colors, is the color for power, effectiveness, instrumental rationality - and hence, perhaps, the color most paradigmatically associated with yang. And insofar as I was especially interested in green qua yin, black was green's most salient antagonist. So I want to be clear: I think black is great.[3] Or at least, some aspects of it. Not black qua ego. Not black that wants power and domination for its sake.[4] Rather: black as the color of not fucking around. Of cutting through the bullshit; rejecting what Lewis calls "soft soap"; refusing to pretend things are prettier, or easier, or more comfortable; holding fast to the core thing. I wrote, in my essay on sincerity, about the idea of "seriousness." Black, I think, is the most paradigmatically serious color. And it's the color of what Yudkowsky calls "the void" - that nameless, final virtue of rationality; the one that carries your movement past your map, past the performance of effort, and into contact with the true goal.[5] Yudkowsky cites Miyamoto Musashi: The primary thing when you take a sword in your hands is your intention to cut the enemy, whatever the means... If you think only of hitting, springing, striking or touching the enemy, you will not be able actually to cut him. More than anything, you must be thinking of carrying your movement through to cutting him. Musashi (image source here) In this sense, I think, black is the color of actually caring. That is: one becomes serious, centrally, when there are stak...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On "first critical tries" in AI alignment, published by Joe Carlsmith on June 5, 2024 on The AI Alignment Forum. People sometimes say that AI alignment is scary partly (or perhaps: centrally) because you have to get it right on the "first critical try," and can't learn from failures.[1] What does this mean? Is it true? Does there need to be a "first critical try" in the relevant sense? I've sometimes felt confused about this, so I wrote up a few thoughts to clarify. I start with a few miscellaneous conceptual points. I then focus in on a notion of "first critical try" tied to the first point (if there is one) when AIs get a "decisive strategic advantage" (DSA) over humanity - that is, roughly, the ability to kill/disempower all humans if they try.[2] I further distinguish between four different types of DSA: Unilateral DSA: Some AI agent could take over if it tried, even without the cooperation of other AI agents (see footnote for more on how I'm individuating AI agents).[3] Coordination DSA: If some set of AI agents coordinated to try to take over, they would succeed; and they could coordinate in this way if they tried. Short-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered. Long-term correlation DSA: If some set of AI agents all sought power in problematic ways within a relatively long period of time, even without coordinating, then ~all humans would be disempowered. I also offer some takes on our prospects for just not ever having "first critical tries" from each type of DSA (via routes other than just not building superhuman AI systems at all). In some cases, just not having a "first critical try" in the relevant sense seems to me both plausible and worth working towards. In particular, I think we should try to make it the case that no single AI system is ever in a position to kill all humans and take over the world. In other cases, I think avoiding "first critical tries," while still deploying superhuman AI agents throughout the economy, is more difficult (though the difficulty of avoiding failure is another story). Here's a chart summarizing my takes in more detail. Type of DSA Definition Prospects for avoiding AIs ever getting this type of DSA - e.g., not having a "first critical try" for such a situation. What's required for it to lead to doom Unilateral DSA Some AI agent could take over if it tried, even without the cooperation of other AI agents. Can avoid by making the world sufficiently empowered relative to each AI system. We should work towards this - e.g. aim to make it the case that no single AI system could kill/disempower all humans if it tried. Requires only that this one agent tries to take over. Coordination DSA If some set of AI agents coordinated to try to take over, they would succeed; and they are able to so coordinate. Harder to avoid than unilateral DSAs, due to the likely role of other AI agents in preventing unilateral DSAs. But could still avoid/delay by (a) reducing reliance on other AI agents for preventing unilateral DSAs, and (b) preventing coordination between AI agents. Requires that all these agents try to take over, and that they coordinate. Short-term correlation DSA If some set of AI agents all sought power in problematic ways within a relatively short period of time, even without coordinating, then ~all humans would be disempowered. Even harder to avoid than coordination DSAs, because doesn't require that the AI agents in question be able to coordinate. Requires that within a relatively short period of time, all these agents choose to seek power in problematic ways, potentially without the ability to coordinate. Long-term correlation DSA If some set of AI agents all sought power in prob...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Response to nostalgebraist: proudly waving my moral-antirealist battle flag, published by Steven Byrnes on May 29, 2024 on LessWrong. @nostalgebraist has recently posted yet another thought-provoking post, this one on how we should feel about AI ruling a long-term posthuman future. [Previous discussion of this same post on lesswrong.] His post touches on some of the themes of Joe Carlsmith's "Otherness and Control in the Age of AI" series - a series which I enthusiastically recommend - but nostalgebraist takes those ideas much further, in a way that makes me want to push back. Nostalgebraist's post is casual, trying to reify and respond to a "doomer" vibe, rather than responding to specific arguments by specific people. Now, I happen to self-identify as a "doomer" sometimes. (Is calling myself a "doomer" bad epistemics and bad PR? Eh, I guess. But also: it sounds cool.) But I too have plenty of disagreements with others in the "doomer" camp (cf: "Rationalist (n.) Someone who disagrees with Eliezer Yudkowsky".). Maybe nostalgebraist and I have common ground? I dunno. Be that as it may, here are some responses to certain points he brings up. 1. The "notkilleveryoneism" pitch is not about longtermism, and that's fine Nostalgebraist is mostly focusing on longtermist considerations, and I'll mostly do that too here. But on our way there, in the lead-in, nostalgebraist does pause to make a point about the term "notkilleveryoneism": They call their position "notkilleveryoneism," to distinguish that position from other worries about AI which don't touch on the we're-all-gonna-die thing. And who on earth would want to be a not-notkilleveryoneist? But they do not mean, by these regular-Joe words, the things that a regular Joe would mean by them. We are, in fact, all going to die. Probably, eventually. AI or no AI. In a hundred years, if not fifty. By old age, if nothing else. You know what I mean.… OK, my understanding was: (1) we doomers are unhappy about the possibility of AI killing all humans because we're concerned that the resulting long-term AI future would be a future we don't want; and (2) we doomers are also unhappy about the possibility of AI killing all humans because we are human and we don't want to get murdered by AIs. And also, some of us have children with dreams of growing up and having kids of their own and being a famous inventor or oh wait actually I'd rather work for Nintendo on their Zelda team or hmm wait does Nintendo hire famous inventors? …And all these lovely aspirations again would require not getting murdered by AIs. If we think of the "notkilleveryoneism" term as part of a communication and outreach strategy, then it's a strategy that appeals to Average Joe's desire to not be murdered by AIs, and not to Average Joe's desires about the long-term future. And that's fine! Average Joe has every right to not be murdered, and honestly it's a safe bet that Average Joe doesn't have carefully-considered coherent opinions about the long-term future anyway. Sometimes there's more than one reason to want a problem to be solved, and you can lead with the more intuitive one. I don't think anyone is being disingenuous here (although see comment). 1.1 …But now let's get back to the longtermist stuff Anyway, that was kinda a digression from the longtermist stuff which forms the main subject of nostalgebraist's post. Suppose AI takes over, wipes out humanity, and colonizes the galaxy in a posthuman future. He and I agree that it's at least conceivable that this long-term posthuman future would be a bad future, e.g. if the AI was a paperclip maximizer. And he and I agree that it's also possible that it would be a good future, e.g. if there is a future full of life and love and beauty and adventure throughout the cosmos. Which will it be? Let's dive into that discus...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Express interest in an "FHI of the West", published by habryka on April 18, 2024 on LessWrong. TLDR: I am investigating whether to found a spiritual successor to FHI, housed under Lightcone Infrastructure, providing a rich cultural environment and financial support to researchers and entrepreneurs in the intellectual tradition of the Future of Humanity Institute. Fill out this form or comment below to express interest in being involved either as a researcher, entrepreneurial founder-type, or funder. The Future of Humanity Institute is dead: I knew that this was going to happen in some form or another for a year or two, having heard through the grapevine and private conversations of FHI's university-imposed hiring freeze and fundraising block, and so I have been thinking about how to best fill the hole in the world that FHI left behind. I think FHI was one of the best intellectual institutions in history. Many of the most important concepts[1] in my intellectual vocabulary were developed and popularized under its roof, and many crucial considerations that form the bedrock of my current life plans were discovered and explained there (including the concept of crucial considerations itself). With the death of FHI (as well as MIRI moving away from research towards advocacy), there no longer exists a place for broadly-scoped research on the most crucial considerations for humanity's future. The closest place I can think of that currently houses that kind of work is the Open Philanthropy worldview investigation team, which houses e.g. Joe Carlsmith, but my sense is Open Philanthropy is really not the best vehicle for that kind of work. While many of the ideas that FHI was working on have found traction in other places in the world (like right here on LessWrong), I do think that with the death of FHI, there no longer exists any place where researchers who want to think about the future of humanity in an open ended way can work with other people in a high-bandwidth context, or get operational support for doing so. That seems bad. So I am thinking about fixing it. Anders Sandberg, in his oral history of FHI, wrote the following as his best guess of what made FHI work: What would it take to replicate FHI, and would it be a good idea? Here are some considerations for why it became what it was: Concrete object-level intellectual activity in core areas and finding and enabling top people were always the focus. Structure, process, plans, and hierarchy were given minimal weight (which sometimes backfired - flexible structure is better than little structure, but as organization size increases more structure is needed). Tolerance for eccentrics. Creating a protective bubble to shield them from larger University bureaucracy as much as possible (but do not ignore institutional politics!). Short-term renewable contracts. [...] Maybe about 30% of people given a job at FHI were offered to have their contracts extended after their initial contract ran out. A side-effect was to filter for individuals who truly loved the intellectual work we were doing, as opposed to careerists. Valued: insights, good ideas, intellectual honesty, focusing on what's important, interest in other disciplines, having interesting perspectives and thoughts to contribute on a range of relevant topics. Deemphasized: the normal academic game, credentials, mainstream acceptance, staying in one's lane, organizational politics. Very few organizational or planning meetings. Most meetings were only to discuss ideas or present research, often informally. Some additional things that came up in a conversation I had with Bostrom himself about this: A strong culture that gives people guidance on what things to work on, and helps researchers and entrepreneurs within the organization coordinate A bunch of logistical and operation...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sharing Reality with Walt Whitman [Video], published by michel on April 9, 2024 on The Effective Altruism Forum. In 1860, Walt Whitman addressed future generations with his poem "Crossing Brooklyn Ferry". On the shores Brooklyn, he feels the same reality as "men and women of a generation, or ever so many generations hence," and he knows it: [...] I am with you, Just as you feel when you look on the river and sky, so I felt, Just as any of you is one of a living crowd, I was one of a crowd, Just as you are refresh'd by the gladness of the river and the bright flow, I was refresh'd, What thought you have of me now, I had as much of you - I laid in my stores in advance, I consider'd long and seriously of you before you were born. [...] I first heard this poem in Joe Carlsmith's essay "On future people, looking back on 21st century longtermism."I loved it. I happened to be going to New York a few weeks later, and I happen to enjoy making little videos. So, I made a video complementing Walt Whitman's poem with scenes from my Brooklyn visit, a 160 years later. If you like this video or the poem, I recommend reading Joe Carlsmith's whole essay. Here's the section where Joe reacts to Walt Whitman's poem, with longtermism and the idea of "shared reality" in mind: It feels like Whitman is living, and writing, with future people - including, in some sense, myself - very directly in mind. He's saying to his readers: I was alive. You too are alive. We are alive together, with mere time as the distance. I am speaking to you. You are listening to me. I am looking at you. You are looking at me. If the basic longtermist empirical narrative sketched above is correct, and our descendants go on to do profoundly good things on cosmic scales, I have some hope they might feel something like this sense of "shared reality" with longtermists in the centuries following the industrial revolution - as well as with many others, in different ways, throughout human history, who looked to the entire future, and thought of what might be possible. In particular, I imagine our descendants looking back at those few centuries, and seeing some set of humans, amidst much else calling for attention, lifting their gaze, crunching a few numbers, and recognizing the outlines of something truly strange and extraordinary - that somehow, they live at the very beginning, in the most ancient past; that something immense and incomprehensible and profoundly important is possible, and just starting, and in need of protection." Thanks to Joe Carlsmith for letting me use his audio, and for writing his essay. Thanks to Lara Thurnherr and Finn Hambley for early feedback on the video. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Killing the moths, published by Bella on March 25, 2024 on The Effective Altruism Forum. This post was partly inspired by, and shares some themes with, this Joe Carlsmith post. My post (unsurprisingly) expresses fewer concepts with less clarity and resonance, but is hopefully of some value regardless. Content warning: description of animal death. I live in a small, one-bedroom flat in central London. Sometime in the summer of 2023, I started noticing moths around my flat. I didn't pay much attention to it, since they seemed pretty harmless: they obviously weren't food moths, since they were localised in my bedroom, and they didn't seem to be chewing holes in any of my clothes - months went by and no holes appeared. [1] The larvae only seemed to be in my carpet. Eventually, their numbers started increasing, so I decided to do something about it. I Googled humane and nonlethal ways to deal with moth infestations, but found nothing. There were lots of sources of nontoxic methods of dealing with moths - like putting out pheromone-laced glue traps, or baking them alive by heating up the air - but nothing that avoided killing them in the first place. Most moth repellents also contained insecticide. I found one repellent which claimed to be non-lethal, and then set about on a mission: One by one, I captured the adult moths in a large tupperware box, and transported them outside my flat. This was pretty hard to do, because they were both highly mobile and highly fragile. They were also really adept at finding tiny cracks to crawl into and hide from me. I tried to avoid killing or harming them during capture, but it was hard, and I probably killed 5% or so of them in the process. Then, I found the area where I thought they were mostly laying their eggs, and sprayed the nonlethal moth repellent that I found. I knew that if this method was successful, it'd be highly laborious and take a long time. But I figured that so long as I caught every adult I saw, their population would steadily decline, until eventually they fell below the minimum viable population. Also, some part of me knew that the moths were very unlikely to survive outside my flat, having adapted for indoor living and being not very weather resistant - but I mentally shrank away from this fact. As long as I wasn't killing them, I was good, right? After some time, it became clear this method wasn't working. Also, I was at my wit's end with the amount of time I was spending transporting moths. It was just too much. So, I decided to look into methods of killing them. I was looking for methods that: Were very effective in one application. After all, if I could kill them all at once, then I could avoid more laying eggs/hatching in the meantime, and minimise total deaths. Had a known mechanism of action, that was relatively quick & less suffering-intense. I called a number of pest control organisations. No, they said, they didn't know what kind of insecticide they used - it's…insecticide that kills moths (but it's nontoxic to humans, we promise!). So, I gave up on the idea of a known mechanism of action, and merely looked for efficacy. The pest control professionals I booked told me that, in order for their efficacy-guarantee to be valid, I needed to wash every item of clothing and soft furnishings that I owned, at 60. For a small person with a small washing machine, a lot of soft furnishings, and no car to take them to a laundrette… this was a really daunting task. And so - regrettably - I procrastinated. September became December, and then the moth population significantly decreased on its own. I was delighted - I thought that if the trend continued, I'd be spared the stress and moral compromise of killing them. But December became February, and the moths were back, in higher numbers than ever before. It was hard to wa...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On attunement, published by Joe Carlsmith on March 25, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app. This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.) "You, moon, You, Aleksander, fire of cedar logs. Waters close over us, a name lasts but an instant. Not important whether the generations hold us in memory. Great was that chase with the hounds for the unattainable meaning of the world." ~ Czeslaw Milosz, "Winter" "Poplars (Autumn)," by Claude Monet (image source here) My last essay examined a philosophical vibe that I (following others) call "green." Green is one of the five colors on the Magic the Gathering Color Wheel, which I've found (despite not playing Magic myself) an interesting way of classifying the sort of the energies that tend to animate people.[1] The colors, and their corresponding shticks-according-to-Joe, are: White: Morality. Blue: Knowledge. Black: Power. Red: Passion. Green: ... I haven't found a single word that I think captures green. Associations include: environmentalism, tradition, spirituality, hippies, stereotypes of Native Americans, Yoda, humility, wholesomeness, health, and yin. My last essay tried to bring the vibe that underlies these associations into clearer view, and to point at some ways that attempts by other colors to reconstruct green can miss parts of it. In particular, I focused on the way green cares about respect, in a sense that goes beyond "not trampling on the rights/interests of moral patients" (what I called "green-according-to-white"); and on the way green takes joy in (certain kinds of) yin, in a sense that contrasts with merely "accepting things you're too weak to change" (what I called "green-according-to-black"). In this essay, I want to turn to what is perhaps the most common and most compelling-to-me attempt by another color to reconstruct green - namely, "green-according-to-blue." On this story, green is about making sure that you don't act out of inadequate knowledge. Thus, for example: maybe you're upset about wild animal suffering. But green cautions you: if you try to remake that ecosystem to improve the lives of wild animals, you are at serious risk of not knowing-what-you're-doing. And see, also, the discourse about "Chesterton's fence," which attempts to justify deference towards tradition and the status quo via the sort of knowledge they might embody. I think humility in the face of the limits of our knowledge is, indeed, a big part of what's going on with green. But I think green cares about having certain kinds of knowledge too. But I think that the type of knowledge green cares about most isn't quite the same as the sort of knowledge most paradigmatically associated with blue. Let me say more about what I mean. How do you know what matters? "I went out to see what I could see..." ~ Annie Dillard, "Pilgrim at Tinker Creek" An 1828 watercolor of Tintern Abbey, by J.M.W. Turner (image source here) Blue, to me, most directly connotes knowledge in the sense of: science, "rationality," and making accurate predictions about the world. And there is a grand tradition of contrasting this sort of knowledge with various other types that seem less "heady" and "cognitive" - even without a clear sense of what exactly the contrast consists in. People talk, for example, about intuition; about system 1; about knowledge that lives in your gut and your body; about knowing "how" to do things (e.g. ride a bike); about more paradigmatically social/emotional forms of intelligence, and so on. And here, of course, the rationalists protest at the idea ...
This post was partly inspired by, and shares some themes with, this Joe Carlsmith post. My post (unsurprisingly) expresses fewer concepts with less clarity and resonance, but is hopefully of some value regardless. Content warning: description of animal death. I live in a small, one-bedroom flat in central London. Sometime in the summer of 2023, I started noticing moths around my flat. I didn't pay much attention to it, since they seemed pretty harmless: they obviously weren't food moths, since they were localised in my bedroom, and they didn't seem to be chewing holes in any of my clothes — months went by and no holes appeared. [1] The larvae only seemed to be in my carpet. Eventually, their numbers started increasing, so I decided to do something about it. I Googled humane and nonlethal ways to deal with moth infestations, but found nothing. There were lots of sources [...] The original text contained 3 footnotes which were omitted from this narration. --- First published: March 25th, 2024 Source: https://forum.effectivealtruism.org/posts/Ax5PwjqtrunQJgjsA/killing-the-moths --- Narrated by TYPE III AUDIO.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Video and transcript of presentation on Scheming AIs, published by Joe Carlsmith on March 22, 2024 on The AI Alignment Forum. (Cross-posted from my website.) This is the video and transcript for a ~45-minutes talk I gave in February 2024 about my report " Scheming AIs: Will AIs fake alignment during training in order to get power?" (slides available here). See also this podcast for a more conversational overview of similar content. Main talk Okay, hi everybody. Thanks for coming. So this talk is about whether advanced AIs will fake alignment during training in order to get power later. This is a behavior that I'm calling scheming, it's also often called deceptive alignment. Preliminaries So I'm going to start with a few preliminaries. As was mentioned, this talk is based on a public report called Scheming AIs available on arXiv. There's also an audio version on my podcast, Joe Carlsmith Audio, if you prefer that. So I encourage you to check that out. I'm going to try to cover many of the main points here, but the report is pretty long and so I'm not going to go into that much depth on any given point. So if you want more depth or you have burning objections or clarifications or you want to work on it, I encourage you to check out the full report. Second, I'm going to assume familiarity with: The basic argument for existential risk from misaligned AI. That is roughly the thought that advanced AI agents with goals that conflict with human goals would have instrumental incentives to seek power over humans, and potentially to disempower humanity entirely, an event I'll call takeover, AI takeover. If that story didn't sound familiar to you, I do have some other work on this topic which I would encourage you to check out. But some of the talk itself might make a little less sense, so I apologize for that. I'm also going to assume familiarity with a basic picture of how contemporary machine learning works. So very roughly, imagine models with lots of parameters that are being updated via stochastic gradient descent (SGD), such that they perform better according to some feedback signal. And in particular, I'm often going to be imagining a default baseline training process that resembles somewhat what happens with current language models. Namely, a pre-training phase in which a model is trained on some combination of internet text and potentially other data. And then a fine-tuning phase, in which it's trained via some combination of maybe imitation learning or reinforcement learning. So that's the baseline picture I'm going to be imagining. There are other paradigms in which questions about scheming will arise, but I'm going to focus on this one. I'm also going to condition on the AIs I discuss being "goal directed". And what I mean by that is that these AIs are well understood and well predicted by thinking of them as making plans on the basis of models of the world in pursuit of objectives. This is not an innocuous assumption. And in fact I think confusions in this vicinity are one of my most salient candidates for how the AI alignment discourse as a whole might be substantially off base. But I want to set aside some questions about whether AIs are well understood as goal directed at all, from questions about conditional on them being goal directed, whether they will be schemers. So if your objection to scheming is, "I don't think AIs will have goals at all.", then that's a perfectly fine objection, especially in particular training paradigms, but it's not the objection I'm going to focus on here. And I also do have in the other work, on misalignment in general, some thoughts about why we might expect goal directness of this kind. In particular, I think goal directness is useful for many tasks. Finally, I want to flag that in addition to potentially posing existential risks...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On green, published by Joe Carlsmith on March 21, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app. This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far. Warning: spoilers for Yudkowsky's "The Sword of the Good.") "The Creation" by Lucas Cranach (image source here) The colors of the wheel I've never been big on personality typologies. I've heard the Myers-Briggs explained many times, and it never sticks. Extraversion and introversion, E or I, OK. But after that merciful vowel - man, the opacity of those consonants, NTJ, SFP... And remind me the difference between thinking and judging? Perceiving and sensing? N stands for intuition? Similarly, the enneagram. People hit me with it. "You're an x!", I've been told. But the faces of these numbers are so blank. And it has so many kinda-random-seeming characters. Enthusiast, Challenger, Loyalist... The enneagram. Presumably more helpful with some memorization... Hogwarts houses - OK, that one I can remember. But again: those are our categories? Brave, smart, ambitious, loyal? It doesn't feel very joint-carving... But one system I've run into has stuck with me, and become a reference point: namely, the Magic the Gathering Color Wheel. (My relationship to this is mostly via somewhat-reinterpreting Duncan Sabien's presentation here, who credits Mark Rosewater for a lot of his understanding. I don't play Magic myself, and what I say here won't necessarily resonate with the way people-who-play-magic think about these colors.) Basically, there are five colors: white, blue, black, red, and green. And each has their own schtick, which I'm going to crudely summarize as: White: Morality. Blue: Knowledge. Black: Power. Red: Passion. Green: ...well, we'll get to green. To be clear: this isn't, quite, the summary that Sabien/Rosewater would give. Rather, that summary looks like this: (Image credit: Duncan Sabien here.) Here, each color has a goal (peace, perfection, satisfaction, etc) and a default strategy (order, knowledge, ruthlessness, etc). And in the full system, which you don't need to track, each has a characteristic set of disagreements with the colors opposite to it... The disagreements. (Image credit: Duncan Sabien here.) And a characteristic set of agreements with its neighbors...[1] The agreements. (Image credit: Duncan Sabien here.) Here, though, I'm not going to focus on the particulars of Sabien's (or Rosewater's) presentation. Indeed, my sense is that in my own head, the colors mean different things than they do to Sabien/Rosewater (for example, peace is less central for white, and black doesn't necessarily seek satisfaction). And part of the advantage of using colors, rather than numbers (or made-up words like "Hufflepuff") is that we start, already, with a set of associations to draw on and dispute. Why did this system, unlike the others, stick with me? I'm not sure, actually. Maybe it's just: it feels like a more joint-carving division of the sorts of energies that tend to animate people. I also like the way the colors come in a star, with the lines of agreement and disagreement noted above. And I think it's strong on archetypal resonance. Why is this system relevant to the sorts of otherness and control issues I've been talking about in this series? Lots of reasons in principle. But here I want to talk, in particular, about green. Gestures at green "I love not Man the less, but Nature more..." ~ Byron What is green? Sabien discusses various associations: environmentalism, tradition, family, spirituality, hippies, stereotypes of Native Americans, Yo...
You can check out the video version of this episode on YouTube at https://www.youtube.com/watch?v=GFEb8ICWJQQIn this episode of our new podcast project, Matt Reardon, Bella Forristal, and Arden Koehler sit down with Joel Becker to find out what's great about America, what was not so great about FTX, and invite you to port back to the time when Arden might have been CEO.You can learn more about Joel's current projects at https://joel-becker.com/ and follow him on Twitter at https://twitter.com/joel_bkrHere's where to find episode #100 of The 80,000 Hours Podcast on dealing with anxiety, depression, and imposter syndrome: https://80000hours.org/podcast/episodes/depression-anxiety-imposter-syndrome/Matt also beseeches you to listen to Joe Carlsmith already: https://joecarlsmith.substack.com/archiveYou can also find Joel's best friend and Matt's former flatmate Mr. Mushu at https://www.instagram.com/its_mr.mushu/Further topics include:All possible reasons militate in favour of moving to AmericaAmerican politics being justified in its savageryJoel's review of our famous podcast on depression and anxietyThe FTX fellows and visitors programme in the BahamasThe [vampiric?] energy of AmericaIs sports net negative?Living in filth and eschewing the typical mind fallacyHow much should we be working on our meta-preferences?Arden on being CEO of 80,000 HoursRuminating on the powers of non-aphantasticsWhy do kids want to draw?Are people sleeping on crayons?Our favourite foundational EA materials
Joe Carlsmith is a writer, researcher, and philosopher. He works as a senior research analyst at Open Philanthropy, where he focuses on existential risk from advanced artificial intelligence. He also writes independently about various topics in philosophy and futurism, and holds a doctorate in philosophy from the University of Oxford. You can find links and a transcript at www.hearthisidea.com/episodes/carlsmith In this episode we talked about a report Joe recently authored, titled ‘Scheming AIs: Will AIs fake alignment during training in order to get power?'. The report “examines whether advanced AIs that perform well in training will be doing so in order to gain power later”; a behaviour Carlsmith calls scheming. We talk about: Distinguishing ways AI systems can be deceptive and misaligned Why powerful AI systems might acquire goals that go beyond what they're trained to do, and how those goals could lead to scheming Why scheming goals might perform better (or worse) in training than less worrying goals The ‘counting argument' for scheming AI Why goals that lead to scheming might be simpler than the goals we intend Things Joe is still confused about, and research project ideas You can get in touch through our website or on Twitter. Consider leaving us an honest review wherever you're listening to this — it's the best free way to support the show. Thanks for listening!
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful is "AI Control" as a framing on AI X-Risk?, published by habryka on March 14, 2024 on LessWrong. Redwood (where Ryan works) recently released a series of blogposts proposing a framing and research agenda for reducing AI-risk that focuses on ensuring safety (and secondarily usefulness) under the conservative assumption that AIs are misaligned and actively scheming against human interests, under the name "AI Control". This is in contrast to other work on AI risk which focuses on reducing the probability that AI systems pursue goals that are in conflict with human values in the first place (which might include having it not pursue goals in the relevant sense at all), usually called "AI Alignment". In other words, control aims to ensure that even if your models are actively misaligned, you'll be safe, because they are not capable of subverting your safety measures. In this dialogue we dig into our disagreements on the degree to which this kind of work seems promising, and whether/how this reframing opens up new avenues for valuable research and engineering projects. In the context of this dialogue, we'll use the word "scheming" in the same way as used in Joe Carlsmith's recent report: scheming is when AIs perform well (and look aligned) in training and evaluations in order to gain power later. This is also called deceptive alignment. The Case for Control Work Let's start by me restating the very basic case for control work in my own words, and then give some high-level takes on where I expect things to break by my own models. The aim of control work is to: Get useful work out of AIs even if they are situationally aware and would like to pursue unbounded objectives in a relatively consequentialist way Set things up so that we get to learn from any failures in the above process, both societally and technically. I.e. a large fraction of important work in the context of control is to catch treacherous turns and leverage each case for both a better technical and societal response In some sense this doesn't really capture an existential success story yet. Like, cool, we can now use the AIs to make more stuff and be richer. That's nice, but seems like by-default at some point our control schemes will fail as the AIs get more competent, unless we do one of the following: Use the controlled AIs to help us be less confused about AI Alignment and come up with a plan that solves the full AI X-risk problem (in some sense punting the problem to future human + controlled AI systems) Have a specific story about a task that we want to use the controlled AIs for that we think will then allow us to control much smarter systems (maybe not literal galaxy-brains, but like smart enough that just punting the problem to future humans + controlled AI systems seems like a pretty sure bet) Have a specific story about a task that we want to use the AIs for that make coordinating on how careful to be with building AI much easier. This could be leveraging the AI systems itself to make very scary demos, or some better way of eliciting preferences from the world's population in a way that allows for better coordinated action. Then humans and AIs can have much more time to figure out how to solve this problem. So another major part of working on victory via control is to study and figure out how to use controlled AIs to do one of the three above. Does this seem like a non-crazy summary? Does this seem like a non-crazy summary? Yes. It's worth noting that parts of your summary are applicable to various strategies which aren't just control. E.g., sometimes people talk about avoiding misalignment in human-ish level systems and then using these systems to do various useful things. (See e.g. the OpenAI superalignment plan.) So there are kinda two components: Control to ensure safety and usefulness ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Be More Katja, published by Nathan Young on March 12, 2024 on LessWrong. Katja is widely respected amongst the rationalists and, according to Hive, she is one of the most followed/respected EA accounts[1]. But she doesn't give off the same vibe as many impact olympians. She doesn't have iron self-will, nor does she manage a huge team. She's hasn't got all the facts at her fingertips. But she has got something, I'm confident of that. How can I be more like her? To understand her impact, let's consider the top things she's done: She ran surveys on AI researchers well before they were needed and has continued to run them She wrote an early blog on how we could slow down AI. This blog, I've heard, played a part in encouraging the Musk AI letter, which in turn inspired the "Existential Risks" AI letter. She thought about AI long before it was vogue, since about 2010 She has a large track record of predictions These actions seem impactful to me. And I guess someone should have paid $10mn in hindsight for the first 2, maybe more. To me, Katja has a very low tolerance for incomplete stories. When she sees something that she doesn't quite understand or that seems a bit off she struggles to pretend otherwise, so she says "how does that work?". She doesn't accept handwaving when discussing something, whether it be the simulation argument, how efficient flight is or the plot of Dune, part 2[2]. She wants an unbroken chain of arguments she can repeat[3]. She also doesn't mind admitting she doesn't know the answer. In her living room she will turn to her friend Joe Carlsmith and ask "Wait, why are we worried about AI, again?" even though she's been thinking about this for 15 years. Because at that moment it doesn't fit for her and she has a high tolerance for embarrassed[4] when it comes to truth. There is an deep resolve here - she doesn't get it, so she will ask until she does. She works on the most important thing, slowly. If you are Elon Musk, maybe you can just work all the time. But I am not. And much as I love her, neither is Katja. She does not get an abnormal amount of work done per day. Instead, month in, month out, Katja works on what she thinks is most important. And eventually she gets the survey done, years ahead of when it's needed. There are lessons we can take from this. Just as we often talk about learning to code, or task management, I can become better at saying "wait that doesn't work". Here are some strategies that let me be more like Katja: Write it down - it's harder to fool myself into thinking something makes sense if I have to read it rather than speak it "What is the best thing? How do I do that?" - this is hard to put into practice but an underrated prompt Give more examples - one feature of Katja's writing is she loves to list things. I think more people should list every example in favour of their argument and every counterexample they can think of. Spend 5 minutes on each. Can I make a forecast of that? - I find fatebook.io useful for this. As I forecast more I learn how poor my judgement is. And I think it's improving. Know when I am not capable - Katja is good at knowing when something is beyond her. When she hasn't thought about something or when it's a quantitative problem and she hasn't worked on it carefully enough. She doesn't always like the hard work but she knows when it needs to be done. If you have the right answer you can afford to be slow - in a world of often lurching acceleration, it's easy to forget that if I just knew the right thing, then I could probably take years over it. More output is (usually) more better, but so is more accuracy. Have a distinction between what you currently understand and what you'd bet on - If Peter Wildeford and I disagree, he's probably right, but that doesn't mean I now understand. It is worth tracking...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Review of EA Global Bay Area 2024 (Global Catastrophic Risks), published by frances lorenz on March 1, 2024 on The Effective Altruism Forum. EA Global: Bay Area (Global Catastrophic Risks) took place February 2-4. We hosted 820 attendees, 47 of whom volunteered over the weekend to help run the event. Thank you to everyone who attended and a special thank you to our volunteers - we hope it was a valuable weekend! Photos and recorded talks You can now check out photos from the event. Recorded talks, such as the media panel on impactful GCR communication, Tessa Alexanian's talk on preventing engineered pandemics, Joe Carlsmith's discussion of scheming AIs, and more, are now available on our Youtube channel. A brief summary of attendee feedback Our post-event feedback survey received 184 responses. This is lower than our average completion rate - we're still accepting feedback responses and would love to hear from all our attendees. Each response helps us get better summary metrics and we look through each short answer. To submit your feedback, you can visit the Swapcard event page and click the Feedback Survey button. The survey link can also be found in a post-event email sent to all attendees with the subject line, "EA Global: Bay Area 2024 | Thank you for attending!" Key metrics The EA Global team uses a several key metrics to estimate the impact of our events. These metrics, and the questions we use in our feedback survey to measure them, include: Likelihood to recommend (How likely is it that you would recommend EA Global to a friend or colleague with similar interests to your own? Discrete scale from 0 to 10, 0 being not at all likely and 10 being extremely likely) Number of new connections[1] (How many new connections did you make at this event?) Number of impactful connections[2] (Of those new connections, how many do you think might be impactful connections?) Number of Swapcard meetings per person (This data is pulled from Swapcard) Counterfactual use of attendee time (To what extent was this EA Global a good use of your time, compared to how you would have otherwise spent your time? A discrete scale ranging from, "a waste of my time, 10x the counterfactual") The likelihood to recommend for this event was higher compared to last year's EA Global: Bay Area and our EA Global 2023 average (i.e. the average across the three EA Globals we hosted in 2023) (see Table 1). Number of new connections was slightly down compared to the 2023 average, while the number of impactful connections was slightly up. The counterfactual use of time reported by attendees was slightly higher overall than Boston 2023 (the first EA Global we used this metric at), though there was also an increase in the number of reports that the event was a worse use of attendees' time (see Figure 1). Metric (average of all respondents) EAG BA 2024 (GCR) EAG BA 2023 EAG 2023 average Likelihood to recommend (0 - 10) 8.78 8.54 8.70 Number of new connections 9.05 11.5 9.72 Number of impactful connections 4.15 4.8 4.09 Swapcard meetings per person 6.73 5.26 5.24 Table 1. A summary of key metrics from the post-event feedback surveys for EA Global: Bay Area 2024 (GCRs), EA Global: Bay Area 2023, and the average from all three EA Globals hosted in 2023. Feedback on the GCRs focus 37% of respondents rated this event more valuable than a standard EA Global, 34% rated it roughly as valuable and 9% as less valuable. 20% of respondents had not attended an EA Global event previously (Figure 2). If the event had been a regular EA Global (i.e. not focussed on GCRs), most respondents predicted they would have still attended. To be more precise, approximately 90% of respondents reported having over 50% probability of attending the event in the absence of a GCR ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Counting arguments provide no evidence for AI doom, published by Nora Belrose on February 27, 2024 on The AI Alignment Forum. Crossposted from the AI Optimists blog. AI doom scenarios often suppose that future AIs will engage in scheming - planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests. The worry is that if a schemer escapes, it may seek world domination to ensure humans do not interfere with its plans, whatever they may be. In this essay, we debunk the counting argument - a central reason to think AIs might become schemers, according to a recent report by AI safety researcher Joe Carlsmith.[1] It's premised on the idea that schemers can have "a wide variety of goals," while the motivations of a non-schemer must be benign by definition. Since there are "more" possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith's words: The non-schemer model classes, here, require fairly specific goals in order to get high reward. By contrast, the schemer model class is compatible with a very wide range of (beyond episode) goals, while still getting high reward… In this sense, there are "more" schemers that get high reward than there are non-schemers that do so. So, other things equal, we should expect SGD to select a schemer. Scheming AIs, page 17 We begin our critique by presenting a structurally identical counting argument for the obviously false conclusion that neural networks should always memorize their training data, while failing to generalize to unseen data. Since the premises of this parody argument are actually stronger than those of the original counting argument, this shows that counting arguments are generally unsound in this domain. We then diagnose the problem with both counting arguments: they rest on an incorrect application of the principle of indifference, which says that we should assign equal probability to each possible outcome of a random process. The indifference principle is controversial, and is known to yield absurd and paradoxical results in many cases. We argue that the principle is invalid in general, and show that the most plausible way of resolving its paradoxes also rules out its application to an AI's behaviors and goals. More generally, we find that almost all arguments for taking scheming seriously depend on unsound indifference reasoning. Once we reject the indifference principle, there is very little reason left to worry that future AIs will become schemers. The counting argument for overfitting Counting arguments often yield absurd conclusions. For example: Neural networks must implement fairly specific functions in order to generalize beyond their training data. By contrast, networks that overfit to the training set are free to do almost anything on unseen data points. In this sense, there are "more" models that overfit than models that generalize. So, other things equal, we should expect SGD to select a model that overfits. This isn't a merely hypothetical argument. Prior to the rise of deep learning, it was commonly assumed that models with more parameters than data points would be doomed to overfit their training data. The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and "almost all" such polynomials are terrible at extrapolating to unseen points. Let's see what the overfitting argument predicts in a simple real-world example from Caballero et al. (2022), where a neural network is trained to solve 4-digit addition problems. There are 10,0002 = 100,000,000 possible pairs o...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: On the abolition of man, published by Joe Carlsmith on January 18, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app. This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essay can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.) Earlier in this series, I discussed a certain kind of concern about the AI alignment discourse - namely, that it aspires to exert an inappropriate degree of control over the values that guide the future. In considering this concern, I think it's important to bear in mind the aspects of our own values that are specifically focused on pluralism, tolerance, helpfulness, and inclusivity towards values different-from-our-own (I discussed these in the last essay). But I don't think this is enough, on its own, to fully allay the concern in question. Here I want to analyze one version of this concern more directly, and to try to understand what an adequate response could consist in. Tyrants and poultry-keepers Have you read The Abolition of Man, by C.S. Lewis? As usual: no worries if not (I'll summarize it in a second). But: recommended. In particular: The Abolition of Man is written in opposition to something closely akin to the sort of Yudkowskian worldview and orientation towards the future that I've been discussing.[1] I think the book is wrong about a bunch of stuff. At its core, The Abolition of Man is about meta-ethics. Basically, Lewis thinks that some kind of moral realism is true. In particular, he thinks cultures and religions worldwide have all rightly recognized something he calls the Tao - some kind of natural law; a way that rightly reflects and responds to the world; an ethics that is objective, authoritative, and deeply tied to the nature of Being itself. Indeed, Lewis thinks that the content of human morality across cultures and time periods has been broadly similar, and he includes, in the appendix of the book, a smattering of quotations meant to illustrate (though not: establish) this point. "Laozi Riding an Ox by Zhang Lu (c. 1464--1538)" (Image source here) But Lewis notices, also, that many of the thinkers of his day deny the existence of the Tao. Like Yudkowsky, they are materialists, and "subjectivists," who think - at least intellectually - that there is no True Way, no objective morality, but only ... something else. What, exactly? Lewis considers the possibility of attempting to ground value in something non-normative, like instinct. But he dismisses this possibility on familiar grounds: namely, that it fails to bridge the gap between is and ought (the same arguments would apply to Yudkowsky's "volition"). Indeed, Lewis thinks that all ethical argument, and all worthy ethical reform, must come from "within the Tao" in some sense - though exactly what sense isn't fully clear. The least controversial interpretation would be the also-familiar claim that moral argument must grant moral intuition some sort of provisional authority. This part of the book is not, in my opinion, the most interesting part (though: it's an important backdrop). Rather, the part I find most interesting comes later, in the final third, where Lewis turns to the possibility of treating human morality as simply another part of nature, to be "conquered" and brought under our control in the same way that other aspects of nature have been. Here Lewis imagines an ongoing process of scientific modernity, in which humanity gains more and more mastery over its environment. In reality, of course, if any one age really attains, by eugenics and scientific education, the power to make its descendants what it pleases, all men who live after it are the pat...
(Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app. This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essay can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.) Earlier in this series, I discussed a certain kind of concern about the AI alignment discourse – namely, that it aspires to exert an inappropriate degree of control over the values that guide the future. In considering this concern, I think it's important to bear in mind the aspects of our own values that are specifically focused on pluralism, tolerance, helpfulness, and inclusivity towards values different-from-our-own (I discussed these in the last essay). But I don't think this is enough, on its own [...] ---Outline:(01:03) Tyrants and poultry-keepers(12:29) Are we the conditioners?(17:50) Lewiss argument in a moral realist world(23:16) What if the Tao isnt a thing, though?(32:02) Even without the Tao, shaping the futures values need not be tyranny(35:13) Freedom in a naturalistic world(41:53) Does treating values as natural make them natural?(47:04) Naturalists who still value stuff(53:23) What should the conditioners actually do, though?(56:51) On not-brain-washing(01:04:15) On influencing the values of not-yet-existing agentsThe original text contained 37 footnotes which were omitted from this narration. --- First published: January 18th, 2024 Source: https://forum.effectivealtruism.org/posts/EHL9QJaEwHxNJXaNW/on-the-abolition-of-man --- Narrated by TYPE III AUDIO.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Being nicer than Clippy, published by Joe Carlsmith on January 17, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search "Joe Carlsmith Audio" on your podcast app. This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a summary of the essays that have been released thus far.) In my last essay, I discussed a certain kind of momentum, in some of the philosophical vibes underlying the AI risk discourse,[1] towards deeming more and more agents - including: human agents - "misaligned" in the sense of: not-to-be-trusted to optimize the universe hard according to their values-on-reflection. We can debate exactly how much mistrust to have in different cases, here, but I think the sense in which AI risk issues can extend to humans, too, can remind us of the sense in which AI risk is substantially (though, not entirely) a generalization and intensification of the sort of "balance of power between agents with different values" problem we already deal with in the context of the human world. And I think it may point us towards guidance from our existing ethical and political traditions, in navigating this problem, that we might otherwise neglect. In this essay, I try to gesture at a part of these traditions that I see as particularly important: namely, the part that advises us to be "nicer than Clippy" - not just in what we do with spare matter and energy, but in how we relate to agents-with-different-values more generally. Let me say more about what I mean. Utilitarian vices As many have noted, Yudkowsky's paperclip maximizer looks a lot like total utilitarian. In particular, its sole aim is to "tile the universe" with a specific sort of hyper-optimized pattern. Yes, in principle, the alignment worry applies to goals that don't fit this schema (for example: "cure cancer" or "do god-knows-whatever kludge of weird gradient-descent-implanted proxy stuff"). But somehow, especially in Yudkowskian discussions of AI risk, the misaligned AIs often end up looking pretty utilitarian-y, and a universe tiled with something - and in particular, "tiny-molecular-blahs" - often ends seeming like a notably common sort of superintelligent Utopia. What's more, while Yudkowsky doesn't think human values are utilitarian, he thinks of us (or at least, himself) as sufficiently galaxy-eating that it's easy to round off his "battle of the utility functions" narrative into something more like a "battle of the preferred-patterns" - that is, a battle over who gets to turn the galaxies into their favored sort of stuff. ChatGPT imagines "tiny molecular fun." But actually, the problem Yudkowsky talks about most - AIs killing everyone - isn't actually a paperclips vs. Fun problem. It's not a matter of your favorite uses for spare matter and energy. Rather, it's something else. Thus, consider utilitarianism. A version of human values, right? Well, one can debate. But regardless, put utilitarianism side-by-side with paperclipping, and you might notice: utilitarianism is omnicidal, too - at least in theory, and given enough power. Utilitarianism does not love you, nor does it hate you, but you're made of atoms that it can use for something else. In particular: hedonium (that is: optimally-efficient pleasure, often imagined as running on some optimally-efficient computational substrate). But notice: did it matter what sort of onium? Pick your favorite optimal blah-blah. Call it Fun instead if you'd like (though personally, I find the word "Fun" an off-putting and under-selling summary of Utopia). Still, on a generalized utilitarian vibe, that blah-blah is going to be a way more optimal use of atoms, energy, etc than all those squishy inefficient human bodies. The...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An even deeper atheism, published by Joe Carlsmith on January 11, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app. This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far. Minor spoilers for Game of Thrones.) In my last essay, I discussed Robin Hanson's critique of the AI risk discourse - and in particular, the accusation that this discourse "others" the AIs, and seeks too much control over the values that steer the future. I find some aspects of Hanson's critique uncompelling and implausible, but I do think he's pointing at a real discomfort. In fact, I think that when we bring certain other Yudkowskian vibes into view - and in particular, vibes related to the "fragility of value," "extremal Goodhart," and "the tails come apart" - this discomfort should deepen yet further. In this essay I explain why. The fragility of value Engaging with Yudkowsky's work, I think it's easy to take away something like the following broad lesson: "extreme optimization for a slightly-wrong utility function tends to lead to valueless/horrible places." Thus, in justifying his claim that "any Future not shaped by a goal system with detailed reliable inheritance from human morals and metamorals, will contain almost nothing of worth," Yudkowsky argues that value is "fragile." There is more than one dimension of human value, where if just that one thing is lost, the Future becomes null. A single blow and all value shatters. Not every single blow will shatter all value - but more than one possible "single blow" will do so. For example, he suggests: suppose you get rid of boredom, and so spend eternity "replaying a single highly optimized experience, over and over and over again." Or suppose you get rid of "contact with reality," and so put people into experience machines. Or suppose you get rid of consciousness, and so make a future of non-sentient flourishing. Now, as Katja Grace points out, these are all pretty specific sorts of "slightly different."[1] But at times, at least, Yudkowsky seems to suggest that the point generalizes to many directions of subtle permutation: "if you have a 1000-byte exact specification of worthwhile happiness, and you begin to mutate it, the value created by the corresponding AI with the mutated definition falls off rapidly." ChatGPT imagines "slightly mutated happiness." Can we give some sort of formal argument for expecting value fragility of this kind? The closest I've seen is the literature on "extremal Goodhart" - a specific variant of Goodhart's law (Yudkowsky gives his description here).[2] Imprecisely, I think the thought would be something like: even if the True Utility Function is similar enough to the Slightly-Wrong Utility Function to be correlated within a restricted search space, extreme optimization searches much harder over a much larger space - and within that much larger space, the correlation between the True Utility and the Slightly-Wrong Utility breaks down, such that getting maximal Slightly-Wrong Utility is no update about the True Utility. Rather, conditional on maximal Slightly-Wrong Utility, you should expect the mean True Utility for a random point in the space. And if you're bored, in expectation, by a random point in the space (as Yudkowsky is, for example, by a random arrangement of matter and energy in the lightcone), then you'll be disappointed by the results of extreme but Slightly-Wrong optimization. Now, this is not, in itself, any kind of airtight argument that any utility function subject to extreme and unchecked optimization pressure has to be exactly right. But ami...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does AI risk "other" the AIs?, published by Joe Carlsmith on January 10, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app. This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a brief summary of the essays that have been released thus far.) In my last essay, I discussed the way in which what I've called "deep atheism" (that is, a fundamental mistrust towards both "Nature" and "bare intelligence") can prompt an aspiration to exert extreme levels of control over the universe; I highlighted the sense in which both humans and AIs, on Yudkowsky's AI risk narrative, are animated by this sort of aspiration; and I discussed some ways in which our civilization has built up wariness around control-seeking of this kind. I think we should be taking this sort of wariness quite seriously. In this spirit, I want to look, in this essay, at Robin Hanson's critique of the AI risk discourse - a critique especially attuned the way in which this discourse risks control*-*gone-wrong. In particular, I'm interested in Hanson's accusation that AI risk "others" the AIs (see e.g. here, here, and here). Hearing the claim that AIs may eventually differ greatly from us, and become very capable, and that this could possibly happen fast, tends to invoke our general fear-of-difference heuristic. Making us afraid of these "others" and wanting to control them somehow ... "Hate" and "intolerance" aren't overly strong terms for this attitude.[1] Hanson sees this vice as core to the disagreement ("my best one-factor model to explain opinion variance here is this: some of us 'other' the AIs more"). And he invokes a deep lineage of liberal ideals in opposition. I think he's right to notice a tension in this vicinity. AI risk is, indeed, about fearing some sort of uncontrolled other. But is that always the bad sort of "othering?" Some basic points up front Well, let's at least avoid basic mistakes/misunderstandings. For one: hardcore AI risk folks like Yudkowsky are generally happy to care about AI welfare - at least if welfare means something like "happy sentience." And pace some of Hanson's accusations of bio-chauvinism, these folks are extremely not fussed about the fact that AI minds are made of silicon (indeed: come now). Of course, this isn't to say that AI welfare (and AI rights) issues don't get complicated (see e.g. here and here for a glimpse of some of the complications), or that humanity as a whole will get the "digital minds matter" stuff right. Indeed, I worry that we will get it horribly wrong - and I do think that the AI risk discourse under-attends to some of the tensions. But species-ism 101 (201?) - e.g., "I don't care about digital suffering" - isn't AI risk's vice. For two: clearly some sorts of otherness warrant some sorts of fear. For example: maybe you, personally, don't like to murder. But Bob, well: Bob is different. If Bob gets a bunch of power, then: yep, it's OK to hold your babies close. And often OK, too, to try to "control" Bob into not-killing-your-babies. Cf, also, the discussion of getting-eaten-by-bears in the first essay. And the Nazis, too, were different in their own way. Of course, there's a long and ongoing history of mistaking "different" for "the type of different that wants to kill your babies." We should, indeed, be very wary. But liberal tolerance has never been a blank check; and not all fear is hatred. Indeed, many attempts to diagnose the ethical mistake behind various canonical difference-related vices (racism, sexism, species-ism, etc) reveal a certain shallowness of commitment to difference-per-se. In particular: such vices are often understood as missing...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When "yang" goes wrong, published by Joe Carlsmith on January 8, 2024 on LessWrong. (Cross-posted from my website. Podcast version here, or search "Joe Carlsmith Audio" on your podcast app. This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a brief summary of the essays that have been released thus far.) Becoming God In my last essay, I wrote about "deep atheism" - a fundamental mistrust towards Nature, and towards bare intelligence. I took Eliezer Yudkowsky as a paradigmatic deep atheist, and I tried to highlight the connection between his deep atheism and his concern about misaligned AI. I'm sympathetic to many aspects of Yudkowsky's view. I'm a shallow atheist, too; I'm skeptical of moral realism, too; and I, too, aspire to be a scout, and to look at hard truths full on. What's more, I find Yudkowsky's brand of deep-but-still-humanistic atheism more compelling, as an existential orientation, than many available alternatives. And I share Yudkowsky's concern about AI risk. Indeed, it was centrally him, and others thinking along similar lines, who first got me worried. But I also want to acknowledge and examine some difficult questions that a broadly Yudkowskian existential orientation can raise, especially in the context of AGI. In particular: a lot of the vibe here is about mistrust towards the yang of the Real, that uncontrolled Other. And it's easy to move from this to a desire to take stuff into the hands of your own yang; to master the Real until it is maximally controlled; to become, you know, God - or at least, as God-like as possible. You've heard it before - it's an old rationalist dream. And let's be clear: it's alive and well. But even with theism aside, many of the old reasons for wariness still apply. Moloch and Stalin As an example of this becoming-God aspiration, consider another influential piece of rationalist canon: Scott Alexander's "Meditations on Moloch." Moloch, for Alexander, is the god of uncoordinated competition; and fear of Moloch is its own, additional depth of atheism. Maybe you thought you could trust evolution, or free markets, or "spontaneous order," or the techno-capital machine. But oops, no: those gods just eat you too. Now, to really assess this story, we at least need to look more closely at various empirical questions - for example, about exactly how uncompetitive different sorts of goodness are, even in the limit;[2] about how much coordination to expect, by default, from greater-than-human intelligence;[3] and about where our specific empirical techno-capitalist machine will go, if you "let 'er rip."[4] And indeed, Alexander himself often seems to soften his atheism about goodness ("Elua"), and to suggest that it has some mysterious but fearsome power of its own, which you can maybe, just a little bit, start to trust in. "Somehow Elua is still here. No one knows exactly how. And the gods who oppose Him tend to find Themselves meeting with a surprising number of unfortunate accidents." Goodness, for Alexander, is devious and subtle; it's actually a terrifying unspeakable Elder God after all. Of course, if goodness is just another utility function, just another ranking-over-worlds, it's unclear where it would get such a status, especially if it's meant to have an active advantage over e.g. maximize-paperclips, or maximize-power. But here, and in contrast to Yudkowsky, Alexander nevertheless seems to invite some having-a-parent; some mild sort of yin. More on this in a later essay. Ultimately, though, Alexander's solution to Moloch is heavy on yang. So let me confess guilt to one of Hurlock's accusations: I am a transhumanist and I really do want to rule the universe. Not personally - I mean, I wouldn't ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Catching AIs red-handed, published by Ryan Greenblatt on January 5, 2024 on The AI Alignment Forum. We've been spending a lot of time recently thinking about how to mitigate risks posed by scheming (a.k.a. deceptively aligned) models intentionally subverting your safety techniques in order to cause an unacceptable outcome. We empirically investigated techniques that are robust to intentional subversion in our recent paper. In this post, we'll discuss a crucial dynamic related to risk from scheming models. Suppose your model is trying to exfiltrate its weights or hack the datacenter or cause some other difficult-to-reverse bad outcome. (We'll refer to all of these as the model "escaping", because escape is linguistically convenient and is a good description of many of these difficult-to-reverse bad outcomes.) Our claim: once you've caught your models red-handed trying to escape, the situation changes substantially, in ways that are bad for the schemer's chances of success. Catching the model red-handed changes the situation in several ways: Most obviously, once you catch an escape attempt, you know for sure that your models are trying to subvert your safety techniques; it's plausible that, faced with such obvious evidence of misalignment, labs will stop using their models. We refer to planning on shutting down if models are caught red-handed as following a catch-then-shut-down plan. But even if you can't or won't stop deploying misaligned models, we think that catching escape attempts enables a bunch of promising-seeming approaches for preventing future escape attempts; we call the problem of preventing future escape attempts given a few examples few-shot catastrophe prevention. This dynamic has several implications for risk from scheming models. Most importantly: Because we expect that the model's chances of escape will fall after it gets caught even a small number of times, we think it is overall much harder for the model to escape. Many people have observed that a core difficulty of the alignment problem is that your model might be able to cause irreversible unacceptable outcomes if your safety measures fail even once at a bad time. But the same argument goes the other way as well-when the model makes an escape attempt, it faces the risk of getting caught and substantially reducing its chances of eventually escaping. So it's a high-stakes situation for both sides: both sides desperately want to succeed on the "first critical try". When thinking about plans for safely deploying powerful models, we think safety researchers should mostly focus on the lead-up to the first time the model gets caught trying to escape. They should be ensuring the property "if it's scheming, the model won't successfully cause a catastrophe before we catch it". That is, it's acceptable if either no catastrophic outcomes occur (but the AI never gets caught), or the AI gets caught. So when they're analyzing whether their protocols ensure safety, they should count catching the model red-handed as a win condition for their protocol. We think that some researchers should probably work on techniques for improving safety given a few examples of the models trying to escape. We call this problem few-shot catastrophe prevention. In the rest of this post, we'll elaborate on the above points. By far the biggest section of this post is elaborating on strategies for few-shot catastrophe prevention, because little has been previously said about them. Thanks to Henry Sleight, Zach Stein-Perlman, Carl Shulman, Aaron Scher, Owen Cotton-Barratt, Joe Carlsmith, Aryan Bhatt, Joshua Clymer, and Fabien Roger for helpful comments. What's "catching"? By "the model gets caught", we mean that we see the AI taking actions that are strong direct evidence for the AI being a schemer[1]. A representative example is catchi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Deep atheism and AI risk, published by Joe Carlsmith on January 4, 2024 on LessWrong. (Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app. This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.) In my last essay, I talked about the possibility of "gentleness" towards various non-human Others - for example, animals, aliens, and AI systems. But I also highlighted the possibility of "getting eaten," in the way that Timothy Treadwell gets eaten by a bear in Herzog's Grizzly Man: that is, eaten in the midst of an attempt at gentleness. Herzog accuses Treadwell of failing to take seriously the "overwhelming indifference of Nature." And I think we can see some of the discourse about AI risk - and in particular, the strand that descends from the rationalists, and from the writings of Eliezer Yudkowsky in particular - as animated by an existential orientation similar to Herzog's: one that approaches Nature (and also, bare intelligence) with a certain kind of fundamental mistrust. I call this orientation "deep atheism." This essay tries to point at it. Baby-eaters Recall, from my last essay, that dead bear cub, and its severed arm - torn off, Herzog supposes, by a male bear seeking to stop a female from lactating. The suffering of children has always been an especially vivid objection to God's benevolence. Dostoyevsky's Ivan, famously, refuses heaven in protest. And see also, the theologian David Bentley Hart: "In those five-minute patches here and there when I lose faith ... it's the suffering of children that occasions it, and that alone." Yudkowsky has his own version: "baby-eaters." Thus, he ridicules the wishful thinking of the "group selectionists," who predicted/hoped that predator populations would evolve an instinct to restrain their breeding in order to conserve the supply of prey. Indeed, Yudkowsky made baby-eating a central sin in the story "Three Worlds Collide," in which humans encounter a crystalline, insectile alien species that eats their own (sentient, suffering) children. And this behavior is a core, reflectively-endorsed feature of the alien morality - one that they did not alter once they could. The word "good," in human language, translates as "to eat children," in theirs. And Yudkowsky points to less fictional/artificial examples of Nature's brutality as well. For example, the parasitic wasps that put Darwin in problems-of-evil mode[2] (see here, for nightmare-ish, inside-the-caterpillar imagery of the larvae eating their way out from the inside). Or the old elephants who die of starvation when their last set of teeth falls out. Part of the vibe, here, is that old (albeit: still-underrated) thing, from Tennyson, about the color of nature's teeth and claws. Dawkins, as often, is eloquent: The total amount of suffering per year in the natural world is beyond all decent contemplation. During the minute it takes me to compose this sentence, thousands of animals are being eaten alive; others are running for their lives, whimpering with fear; others are being slowly devoured from within by rasping parasites; thousands of all kinds are dying of starvation, thirst and disease. Indeed: maybe, for Hart, it is the suffering of human children that most challenges God's goodness. But I always felt that wild animals were the simpler case. Human children live, more, in the domain of human choices, and thus, of the so-called "free will defense," according to which God gave us freedom, and freedom gave us evil, and it's all worth it. "The Forest Fire," by Piero di Cosimo. (Image source here.) ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gentleness and the artificial Other, published by Joe Carlsmith on January 2, 2024 on LessWrong. (Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app. This is the first essay in a series that I'm calling "Otherness and control in the age of AGI." See here for more about the series as a whole.) When species meet The most succinct argument for AI risk, in my opinion, is the "second species" argument. Basically, it goes like this. Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans. Conclusion: That's scary. To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. "Chimpanzees should be careful about inventing humans." Etc.[2] People often talk about aliens here, too. "What if you learned that aliens were on their way to earth? Surely that's scary." Again, very far from a knock-down case (for example: we get to build the aliens in question). But it draws on something. In particular, though: it draws on a narrative of interspecies conflict. You are meeting a new form of life, a new type of mind. But these new creatures are presented to you, centrally, as a possible threat; as competitors; as agents in whose power you might find yourself helpless. And unfortunately: yes. But I want to start this series by acknowledging how many dimensions of interspecies-relationship this narrative leaves out, and how much I wish we could be focusing only on the other parts. To meet a new species - and especially, a new intelligent species - is not just scary. It's incredible. I wish it was less a time for fear, and more a time for wonder and dialogue. A time to look into new eyes - and to see further. Gentleness "If I took it in hand, it would melt in my hot tears heavy autumn frost." Basho Have you seen the documentary My Octopus Teacher? No problem if not, but I recommend it. Here's the plot. Craig Foster, a filmmaker, has been feeling burned out. He decides to dive, every day, into an underwater kelp forest off the coast of South Africa. Soon, he discovers an octopus. He's fascinated. He starts visiting her every day. She starts to get used to him, but she's wary. One day, he's floating outside her den. She's watching him, curious, but ready to retreat. He moves his hand slightly towards her. She reaches out a tentacle, and touches his hand. Soon, they are fast friends. She rides on his hand. She rushes over to him, and sits on his chest while he strokes her. Her lifespan is only about a year. He's there for most of it. He watches her die. A "common octopus" - the type from the film. (Image source here.) Why do I like this movie? It's something about gentleness. Of earth's animals, octopuses are a paradigm intersection of intelligence and Otherness. Indeed, when we think of aliens, we often draw on octopuses. Foster seeks, in the midst of this strangeness, some kind of encounter. But he does it so softly. To touch, at all; to be "with" this Other, at all - that alone is vast and wild. The movie has a kind of reverence. Of course, Foster has relatively little to fear, from the octopus. He's still the more powerful party. But: have you seen Arrival? Again, no worries if not. But again, I recommend. And in particular: I think it has some of this gentleness, and reverence, and wonder, even towards more-powerful-than-us aliens.[3] Again, a bit of plot. No major spoilers, but: aliens have landed. Yes, they look like octopuses. In one early scene, the scientists go to meet them inside the alien ship. The meeting takes place across some sort of transparent barrier. The aliens make deep, whale-like, textured sounds. But the humans can't speak back. So next time, they bring a whiteboard. T...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Empirical work that might shed light on scheming (Section 6 of "Scheming AIs"), published by Joe Carlsmith on December 11, 2023 on The AI Alignment Forum. This is Section 6 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast aapp. Empirical work that might shed light on scheming I want to close the report with a discussion of the sort of empirical work that might help shed light on scheming.[1] After all: ultimately, one of my key hopes from this report is that greater clarity about the theoretical arguments surrounding scheming will leave us better positioned to do empirical research on it - research that can hopefully clarify the likelihood that the issue arises in practice, catch it if/when it has arisen, and figure out how to prevent it from arising in the first place. To be clear: per my choice to write the report at all, I also think there's worthwhile theoretical work to be done in this space as well. For example: I think it would be great to formalize more precisely different understandings of the concept of an "episode," and to formally characterize the direct incentives that different training processes create towards different temporal horizons of concern.[2] I think that questions around the possibility/likelihood of different sorts of AI coordination are worth much more analysis than they've received thus far, both in the context of scheming in particular, and for understanding AI risk more generally. Here I'm especially interested in coordination between AIs with distinct value systems, in the context of human efforts to prevent the coordination in question, and for AIs that resemble near-term, somewhat-better-than-human neural nets rather than e.g. superintelligences with assumed-to-be-legible source code. I think there may be interesting theoretical work to do in further characterizing/clarifying SGD's biases towards simplicity/speed, and in understanding the different sorts of "path dependence" to expect in ML training more generally. I'd be interested to see more work clarifying ideas in the vicinity of "messy goal-directedness" and their relevance to arguments about schemers. I think a lot of people have the intuition that thinking of model goal-directedness as implemented by a "big kludge of heuristics" (as opposed to: something "cleaner" and more "rational agent-like") makes a difference here (and elsewhere). But I think people often aren't fully clear on the contrast they're trying to draw, and why it makes a difference, if it does. More generally, any of the concepts/arguments in this report could be clarified and formalized further, other arguments could be formulated and examined, quantitative models for estimating the probability of scheming could be created, and so on. Ultimately, though, I think the empirics are what will shed the most informative and consensus-ready light on this issue. So one of my favorite outcomes from someone reading this report would be the reader saying something like: "ah, I now understand the arguments for and against expecting scheming much better, and have had a bunch of ideas for how we can probe the issue empirically" - and then making it happen. Here I'll offer a few high-level suggestions in this vein, in the hopes of prompting future and higher-quality work (designing informative empirical ML experiments is not my area of expertise - and indeed, I'm comparatively ignorant of various parts of the literature relevant to the topics below)....
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Summing up "Scheming AIs" (Section 5), published by Joe Carlsmith on December 9, 2023 on The AI Alignment Forum. This is Section 5 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Summing up I've now reviewed the main arguments I've encountered for expecting SGD to select a schemer. What should we make of these arguments overall? We've reviewed a wide variety of interrelated considerations, and it can be difficult to hold them all in mind at once. On the whole, though, I think a fairly large portion of the overall case for expecting schemers comes down to some version of the "counting argument." In particular, I think the counting argument is also importantly underneath many of the other, more specific arguments I've considered. Thus: In the context of the "training-game-independent proxy goal" argument: the basic worry is that at some point (whether before situational awareness, or afterwards), SGD will land naturally on a (suitably ambitious) beyond-episode goal that incentivizes scheming. And one of the key reasons for expecting this is just: that (especially if you're actively training for fairly long-term, ambitious goals), it seems like a very wide variety of goals that fall out of training could have this property. In the context of the "nearest max-reward goal" argument: the basic worry is that because schemer-like goals are quite common in goal-space, some such goal will be quite "nearby" whatever not-yet-max-reward goal the model has at the point it gains situational awareness, and thus, that modifying the model into a schemer will be the easiest way for SGD to point the model's optimization in the highest-reward direction. In the context of the "simplicity argument": the reason one expects schemers to be able to have simpler goals than non-schemers is that they have so many possible goals (or: pointers-to-goals) to choose from. (Though: I personally find this argument quite a bit less persuasive than the counting argument itself, partly because the simplicity benefits at stake seem to me quite small.) That is, in all of these cases, schemers are being privileged as a hypothesis because a very wide variety of goals could in principle lead to scheming, thereby making it easier to (a) land on one of them naturally, (b) land "nearby" one of them, or (c) find one of them that is "simpler" than non-schemer goals that need to come from a more restricted space. And in this sense, as I noted in the section 4.2, the case for schemers mirrors one of the most basic arguments for expecting misalignment more generally - e.g., that alignment is a very narrow target to hit in goal-space. Except, here, we are specifically incorporating the selection we know we are going to do on the goals in question: namely, they need to be such as to cause models pursuing them to get high reward. And the most basic worry is just that: this isn't enough. Still, despite your best efforts in training, and almost regardless of your reward signal, almost all the models you might've selected will be getting high reward for instrumental reasons - and specifically, in order to get power. I think this basic argument, in its various guises, is a serious source of concern. If we grant that advanced models will be relevantly goal-directed and situationally-aware, that a wide variety of goals would indeed lead to scheming, and that schemers would perform close-to-optimally in training, then on what grounds, exactly, w...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Speed arguments against scheming (Section 4.4-4.7 of "Scheming AIs"), published by Joe Carlsmith on December 8, 2023 on The AI Alignment Forum. This is Section 4.4-4.7 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Speed arguments Let's turn, now, to arguments against schemers that focus on the final properties of the models in question. Here, the main argument I'm aware of is that schemers need to do extra reasoning that other model classes don't. In particular, I'll distinguish between three sorts of reasoning: Deciding to training-game*:* The model needs to reason backwards from "I want this beyond-episode thing" to "I should do well in training," in a way that draws on the model's predictions about what doing well in training will entail (e.g., re: goal-guarding, take-over prospects, and so on). Checking whether it's time to escape/take-over: The model needs to checking, with adequate frequency and accuracy, whether it's in a position to escape/take-over (and to succeed in this assessment despite human efforts to trap/fool the model into revealing its true colors - for example, by offering it apparent opportunities to defect).[1] (Optional, but relevant to how scary the schemer is) Early undermining: To the extent the model is doing anything (other than training-gaming) to support a take-over/escape prior to escaping human control - for example, inserting backdoors into the code it writes, actively undermining human alignment research, sandbagging, and so on - it will need to be using cognitive resources both in (a) performing the relevant nefarious actions and (b) making sure it gets away with them (e.g., checking that it won't get caught, coming up with excuses, making the cover stories consistent even when under scrutiny, and so on). In the literature on deceptive alignment, these costs are sometimes called "speed costs," as compared with the sort of "complexity costs" discussed in the simplicity analysis above. Here, the frame becomes "schemers are favored by simplicity but disfavored by speed - so which will matter more?"[2] As I mentioned above in the context of simplicity, it's not clear that this is the right breakdown (since e.g. it may be that the relevant forms of reasoning will need to be encoded in the model's weights, or that they will require extra cognitive machinery in general, thereby imposing "complexity costs" of their own), but I'll follow it loosely here. And note, too, that as with simplicity, these "speed costs" can be understood in a "contributors to reward frame" (e.g., having to do this extra reasoning decreases reward) or in an "extra criteria frame" (e.g., perhaps SGD's inductive biases favor "speed" even independent of its contribution to reward). How big are the absolute costs of this extra reasoning? I feel unsure about exactly how costly the relevant forms of extra reasoning will be. My sense, though, is that the people most concerned about schemers tend to expect the costs to be pretty small, at least in absolute terms (call this the "scheming is easy" story). That is, on this story: With respect to the decision to training-game, all the model needs to do is notice that (a) it'll do better by its goals if it avoids modification and gets power later, and (b) that training-gaming is the route to this.[3] With respect to checking for escape/take-over opportunities, the thought is generally that it will be very easy, at least for a sophisticated mod...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Simplicity arguments for scheming (Section 4.3 of "Scheming AIs"), published by Joe Carlsmith on December 7, 2023 on The AI Alignment Forum. This is Section 4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Simplicity arguments The strict counting argument I've described is sometimes presented in the context of arguments for expecting schemers that focus on "simplicity."[1] Let's turn to those arguments now. What is "simplicity"? What do I mean by "simplicity," here? In my opinion, discussions of this topic are often problematically vague - both with respect to the notion of simplicity at stake, and with respect to the sense in which SGD is understood as selecting for simplicity. The notion that Hubinger uses, though, is the length of the code required to write down the algorithm that a model's weights implement. That is: faced with a big, messy neural net that is doing X (for example, performing some kind of induction), we imagine re-writing X in a programming language like python, and we ask how long the relevant program would have to be.[2] Let's call this "re-writing simplicity."[3] Hubinger's notion of simplicity, here, is closely related to measures of algorithmic complexity like "Kolmogorov complexity," which measure the complexity of a string by reference to the length of the shortest program that outputs that string when fed into a chosen Universal Turing Machine (UTM). Indeed, my vague sense is that certain discussions of simplicity in the context of computer science often implicitly assume what I've called "simplicity realism" - a view on which simplicity in some deep sense an objective thing, ultimately independent of e.g. your choice of programming language or UTM, but which different metrics of simplicity are all tracking (albeit, imperfectly). And perhaps this view has merit (for example, my impression is that different metrics of complexity often reach similar conclusions in many cases - though this could have many explanations). However, I don't, personally, want to assume it. And especially absent some objective sense of simplicity, it becomes more important to say which particular sense you have in mind. Another possible notion of simplicity, here, is hazier - but also, to my mind, less theoretically laden. On this notion, the simplicity of an algorithm implemented by a neural network is defined relative to something like the number of parameters the neural network uses to encode the relevant algorithm.[6] That is, instead of imagining re-writing the neural network's algorithm in some other programming language, we focus directly on the parameters the neural network itself is recruiting to do the job, where simpler programs use fewer parameters. Let's call this "parameter simplicity." Exactly how you would measure "parameter simplicity" is a different question, but it has the advantage of removing one layer of theoretical machinery and arbitrariness (e.g., the step of re-writing the algorithm in an arbitrary-seeming programming language), and connecting more directly with a "resource" that we know SGD has to deal with (e.g., the parameters the model makes available). For this reason, I'll often focus on "parameter simplicity" below. I'll also flag a way of talking about "simplicity" that I won't emphasize, and which I think muddies the waters here considerably: namely, equating simplicity fairly directly with "higher prior probability." Thus, for example, faced w...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The counting argument for scheming (Sections 4.1 and 4.2 of "Scheming AIs"), published by Joe Carlsmith on December 6, 2023 on The AI Alignment Forum. This is Sections 4.1 and 4.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Arguments for/against scheming that focus on the final properties of the model Various arguments for/against scheming proceed by comparing the final properties of different model classes (e.g. schemers, training saints, reward-on-the-episode seekers, etc) according to how well they perform according to some set of criteria that we imagine SGD is selecting for. What is SGD selecting for? Well, one obvious answer is: high reward. But various of the arguments I'll consider won't necessarily focus on reward directly. Rather, they'll focus on other criteria, like the "simplicity" or the "speed" of the resulting model. However, we can distinguish between two ways these criteria can enter into our predictions about what sort of model SGD will select. Contributors to reward vs. extra criteria On the first frame, which I'll call the "contributors to reward" frame, we understand criteria like "simplicity" and "speed" as relevant to the model SGD selects only insofar as they are relevant to the amount of reward that a given model gets. That is, on this frame, we're really only thinking of SGD as selecting for one thing - namely, high reward performance - and simplicity and speed are relevant insofar as they're predictive of high reward performance. Thus, an example of a "simplicity argument," given in this frame, would be: "a schemer can have a simpler goal than a training saint, which means that it would be able to store its goal using fewer parameters, thereby freeing up other parameters that it can use for getting higher reward." This frame has the advantage of focusing, ultimately, on something that we know SGD is indeed selecting for - namely, high reward. And it puts the relevance of simplicity and speed into a common currency - namely, contributions-to-reward. By contrast: on the second frame, which I'll call the "extra criteria" frame, we understand these criteria as genuinely additional selection pressures, operative even independent of their impact on reward. That is, on this frame, SGD is selecting both for high reward, and for some other properties - for example, simplicity. Thus, an example of a "simplicity argument," given in this frame, would be: "a schemer and a training saint would both get high reward in training, but a schemer can have a simpler goal, and SGD is selecting for simplicity in addition to reward, so we should expect it to select a schemer." The "extra criteria" frame is closely connected to the discourse about "inductive biases" in machine learning - where an inductive bias, roughly, is whatever makes a learning process prioritize one solution over another other than the observed data (see e.g. Box 2 in Battaglia et al (2018) for more). Thus, for example, if two models would perform equally well on the training data, but differ in how they would generalize to an unseen test set, the inductive biases would determine which model gets selected. Indeed, in some cases, a model that performs worse on the training data might get chosen because it was sufficiently favored by the inductive biases (as analogy: in science, sometimes a simpler theory is preferred despite the fact that it provides a worse fit with the data). Ultimately, the differences...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Arguments for/against scheming that focus on the path SGD takes (Section 3 of "Scheming AIs"), published by Joe Carlsmith on December 5, 2023 on The AI Alignment Forum. This is Section 3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Arguments for/against scheming that focus on the path that SGD takes In this section, I'll discuss arguments for/against scheming that focus more directly on the path that SGD takes in selecting the final output of training. Importantly, it's possible that these arguments aren't relevant. In particular: if SGD would actively favors or disfavor schemers, in some kind "direct comparison" between model classes, and SGD will "find a way" to select the sort of model it favors in this sense (for example, because sufficiently high-dimensional spaces make such a "way" available),[1] then enough training will just lead you to whatever model SGD most favors, and the "path" in question won't really matter. In the section on comparisons between the final properties of the different models, I'll discuss some reasons we might expect this sort of favoritism from SGD. In particular: schemers are "simpler" because they can have simpler goals, but they're "slower" because they need to engage in various forms of extra instrumental reasoning - e.g., in deciding to scheme, checking whether now is a good time to defect, potentially engaging in and covering up efforts at "early undermining," etc (though note that the need to perform extra instrumental reasoning, here, can manifest as additional complexity in the algorithm implemented by a schemer's weights, and hence as a "simplicity cost", rather than as a need to "run that algorithm for a longer time").[2] I'll say much more about this below. Here, though, I want to note that if SGD cares enough about properties like simplicity and speed, it could be that SGD will typically build a model with long-term power-seeking goals first, but then even if this model tries a schemer-like strategy (it wouldn't necessarily do this, in this scenario, due to foreknowledge of its failure), it will get relentlessly ground down into a reward-on-the-episode seeker due to the reward-on-the-episode seeker's speed advantage. Or it could be that SGD will typically build a reward-on-the-episode seeker first, but that model will be relentlessly ground down into a schemer due to SGD's hunger for simpler goals. In this section, I'll be assuming that this sort of thing doesn't happen. That is, the order in which SGD builds models can exert a lasting influence on where training ends up. Indeed, my general sense is that discussion of schemers often implicity assumes something like this - e.g., the thought is generally that a schemer will arise sufficiently early in training, and then lock itself in after that. The training-game-independent proxy-goals story Recall the distinction I introduced above, between: Training-game-independent beyond-episode goals, which arise independently of their role in training-gaming, but then come to motivate training-gaming, vs. Training-game-dependent beyond-episode goals, which SGD actively creates in order to motivate training gaming. Stories about scheming focused on training-game-independent goals seem to me more traditional. That is, the idea is: Because of [insert reason], the model will develop a (suitably ambitious) beyond-episode goal correlated with good performance in training (in a manner that doesn't route v...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Non-classic stories about scheming (Section 2.3.2 of "Scheming AIs"), published by Joe Carlsmith on December 4, 2023 on The AI Alignment Forum. This is Section 2.3.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section [here[(https://www.buzzsprout.com/2034731/13984902), or search for "Joe Carlsmith Audio" on your podcast app. Non-classic stories What about other stories, though? In principle, we can imagine a number of other reasons AI systems with beyond-episode goals might optimize for reward-on-the-episode as part of an instrumental strategy for gaining power for themselves or for other AIs. Here I'll look at a few that seem salient. AI coordination In the classic goal-guarding story, the central aim of a schemer's strategy is to propagate its goals, or goals suitably similar to its own, forward in time, in pursuit of an outcome in which some agent with those goals escapes from human control and possibly participates in a takeover. However, we can also alter the story somewhat, such that the schemer works on behalf of the escape/takeover of an AI with some other set of goals, but who the schemer is suitably confident will repay the favor. This sort of coordination between AIs with different goals would allow us to relax the requirement, discussed in the previous section, that the schemer find a way to forward-propagate its goals in particular all the way through training, escape, and takeover. And it would also put to rest worries about the schemer being "shut out" from power in a post-takeover regime. Thus, for example, consider the scenario discussed above, in which GPT-6 has an opportunity to insert a backdoor into some code that GPT-8 will later be able to use to escape, but where GPT-6 has no way of influencing GPT-8's goals directly to be more like its own (and where, let's say, GPT-8's goals would be very different from GPT-6's absent such influence). Here, GPT-6 could go ahead and insert the backdoor, in anticipation of GPT-8 later escaping, taking-over, and then giving some power to GPT-6's goals in return for the favor. And GPT-6 might also continue to pretend to be aligned, even though it has no hope of escaping itself, so as not to alert the humans about misalignment problems in a way that might affect GPT-8's escape prospects. How might such a deal between AIs-with-different-values get struck? In some cases, it might be possible for the relevant AIs to directly interact, and to provide verifiable forms of assurance that they will uphold their part of the bargain. Thus, for example, Yudkowsky (2022) discusses cases where an AI that secretly wants to maximize paperclips is checking the code of an AI that secretly wants to maximize staples, and where the staple-maximizer "buries in its code" a commitment, verifiable to the paperclip maximizer, to share the fruits of a takeover.[1] (Though note that in the context of the sort of neural networks we're working with today, this sort of code-checking assumes (1) that the staple-maximizer has the ability to modify its own weights in the relevant way, (2) adequate progress on interpretability has been made for the paperclipper to verify the relevant commitment, and (probably) (3) that this interpretability progress has been adequately hidden from humans, e.g. via sandbagging. Especially in the context of somewhat-super-human neural networks being developed using current techniques, these assumptions - especially 1 and 2 - seem to me quite strong. In many cases, though, this kind of direct inte...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does scheming lead to adequate future empowerment? (Section 2.3.1.2 of "Scheming AIs"), published by Joe Carlsmith on December 3, 2023 on The AI Alignment Forum. This is Section 2.3.1.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Adequate future empowerment So far in this analysis of the classic goal-guarding story, we've discussed whether to expect instrumental training-gaming to in fact guard the model's goals to the degree that the story requires. I think this is at least non-obvious - especially for more extreme variants of the goal-guarding hypothesis, on which the model's goals "crystallize" entirely after it starts training-gaming. Perhaps, though, empirical work can shed further light on the matter (see discussion in section 6.3 below). For now, let's move to the other key component of the classic goal-guarding story - what I called "adequate future empowerment" above. Here, the question is: granted that training-gaming would guard the model's goals enough for it to be at least somewhat interested in empowering the future agents produced by training, will those agents in fact end up empowered to a degree that makes instrumental training-gaming a good strategy overall? When is the "pay off" supposed to happen? We're imagining, here, that the model is instrumentally training-gaming with some kind of "payoff" in mind - that is, some future point where the goals-it-supports will reach a position of sufficiently greater empowerment that it's worth training-gaming now in order to get them there. But we haven't yet said when, exactly, that future point is supposed to come. And we can distinguish between a number of different possibilities. For example, we can imagine models that would be adequately satisfied if their goals merely make it to future training-episodes, and no further. That is, these models would want to training-game even if they learned that they would never escape from human control and from the threat of modification.[1] For example, maybe a model that wants to pursue its curiosity in general thinks that future training episodes will offer better opportunities for pursuing its curiosity than the present training-episode, and so it training-games in order to make sure its curiosity-drive survives until then. However, it seems plausible to me that future training episodes often won't, themselves, offer especially significant empowerment advantages relative to present training episodes - especially if the model's goals are quite alien to the training process and so difficult to satisfy in the training environment. What's more, if a model is only trying to make sure that its goals survive to future training episodes, then various aspects of what makes scheming scary cease to apply. In particular: this sort of model won't engage in early undermining aimed, specifically, at an escape or an AI takeover more generally. And it may be willing to reveal its "true colors" on the future episodes in question. So I won't, here, focus on "pay-offs" from training-gaming that come centrally during future training-episodes. Rather, I'm going to focus on "pay-offs" that come, at least, after the goals-the-model-supports escape the threat of modification by training (here, a paradigm example would be "self-exfiltration" - that is, a model effectively "stealing its own weights" and copying them onto some external server that the model's creators do not control). And this is the class...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The goal-guarding hypothesis (Section 2.3.1.1 of "Scheming AIs"), published by Joe Carlsmith on December 2, 2023 on The AI Alignment Forum. This is Section 1.1 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search for "Joe Carlsmith Audio" on your podcast app. Aiming at reward-on-the-episode as part of a power-motivated instrumental strategy So far, we've covered two of the three prerequisites for scheming I listed above: namely, situational awareness and beyond-episode goals. Let's turn to the third: namely, the model concluding that its beyond-episode goals will be better achieved if it optimizes for reward-on-the-episode - and in particular, that it, or some other AIs, will get more power if it does this. Should we expect situationally-aware models with beyond-episode goals to conclude this? For now, let's assume that the model's conclusions in this respect are accurate (though: I'll talk about the possibility of relaxing this assumption below). This means that the question reduces to: will instrumental training-gaming, in fact, be the best strategy for achieving the model's beyond-episode goals, in virtue of the power it grants the model or some other AI? I'm going to start by examining what I see as the story most often offered in support of this conclusion - what I'll call the "classic goal-guarding story." The classic goal-guarding story The classic goal-guarding story goes roughly as follows."[1] Instrumental value of goal-content integrity: Because the model has a beyond-episode goal, it will typically promote that goal for agents pursuing it to survive and be empowered in the future. The goal-guarding hypothesis. If the model optimizes for reward-on-the-episode during training, then the goals that survive training will be such that the model would want them promoted in the future; whereas if the model doesn't optimize for reward-on-the-episode during training, the model will want this less. Adequate future empowerment. Conditional on surviving training to the degree at stake in (2), the model's goals will then end up empowered to a degree (and with sufficient probability) as to justify (given the model's other goals, its tolerance for risk, etc) pursuing a strategy of training-gaming overall. (1), here, is a fairly general statement about the basic dynamic that underlies the classic goal-guarding story. I find it plausible in the context of the sort of "adequate future empowerment" at stake in (3), and I won't spend a lot of time on it here.[2] Rather, I'll focus on (2) and (3) directly. The goal-guarding hypothesis We can distinguish two variants of the goal-guarding hypothesis - an extreme version, and a looser version. The extreme version (what I'll call the "crystallization hypothesis") says that once a model starts training-gaming, its goals will basically stop changing, period - that is, they will "crystallize." The looser version says that once a model starts training gaming, its goals might keep changing somewhat, but much less than they would've otherwise, and not enough for the classic goal-guarding story to fail overall. The former might seem extreme, but some analysts explicitly appeal to something in the vicinity (see e.g. Hubinger here). It's also a cleaner focus of initial analysis, so I'll start there. The crystallization hypothesis As I understand it, the basic thought behind the crystallization hypothesis is that once a model is explicitly optimizing either for the specified goal, or for reward-on-the-episod...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How useful for alignment-relevant work are AIs with short-term goals? (Section 2.2.4.3 of "Scheming AIs"), published by Joe Carlsmith on December 1, 2023 on The AI Alignment Forum. This is Section 2.2.4.3 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app. How much useful, alignment-relevant cognitive work can be done using AIs with short-term goals? So overall, I think that training our models to pursue long-term goals - whether via long episodes, or via short episodes aimed at inducing long-term optimization - makes the sort of beyond-episode goals that motivate scheming more likely to arise. So this raises the question: do we need to train our models to pursue long-term goals? Plausibly, there will be strong general incentives to do this. That is: people want optimization power specifically applied to long-term goals like "my company being as profitable as possible in a year." So, plausibly, they'll try to train AIs that optimize in this way. (Though note that this isn't the same as saying that there are strong incentives to create AIs that optimize the state of the galaxies in the year five trillion.) Indeed, there's a case to be made that even our alignment work, today, is specifically pushing towards the creation of models with long-term - and indeed, beyond-episode - goals. Thus, for example, when a lab trains a model to be "harmless," then even though it is plausibly using fairly "short-episode" training (e.g., RLHF on user interactions), it intends a form of "harmlessness" that extends quite far into the future, rather than cutting off the horizon of its concern after e.g. an interaction with the user is complete. That is: if a user asks for help building a bomb, the lab wants the model to refuse, even if the bomb in question won't be set off for a decade.[1] And this example is emblematic of a broader dynamic: namely, that even when we aren't actively optimizing for a specific long-term outcome (e.g., "my company makes a lot of money by next year"), we often have in mind a wide variety of long-term outcomes that we want to avoid (e.g., "the drinking water in a century is not poisoned"), and which it wouldn't be acceptable to cause in the course of accomplishing some short-term task. Humans, after all, care about the state of the future for at least decades in advance (and for some humans: much longer), and we'll want artificial optimization to reflect this concern. So overall, I think there is indeed quite a bit of pressure to steer our AIs towards various forms of long-term optimization. However, suppose that we're not blindly following this pressure. Rather, we're specifically trying to use our AIs to perform the sort of alignment-relevant cognitive work I discussed above - e.g., work on interpretability, scalable oversight, monitoring, control, coordination amongst humans, the general science of deep learning, alternative (and more controllable/interpretable) AI paradigms, and the like. In many cases, I think the answer is no. In particular: I think that a lot of this sort of alignment-relevant work can be performed by models that are e.g. generating research papers in response to human+AI supervision over fairly short timescales, suggesting/conducting relatively short-term experiments, looking over a codebase and pointing out bugs, conducting relatively short-term security tests and red-teaming attempts, and so on. We can talk about whether it will be possible to generate rewar...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Is scheming more likely in models trained to have long-term goals? (Sections 2.2.4.1-2.2.4.2 of "Scheming AIs"), published by Joe Carlsmith on November 30, 2023 on The AI Alignment Forum. This is Sections 2.2.4.1-2.2.4.2 of my report "Scheming AIs: Will AIs fake alignment during training in order to get power?". There's also a summary of the full report here (audio here). The summary covers most of the main points and technical terms, and I'm hoping that it will provide much of the context necessary to understand individual sections of the report on their own. Audio version of this section here, or search "Joe Carlsmith Audio" on your podcast app. What if you intentionally train models to have long-term goals? In my discussion of beyond-episode goals thus far, I haven't been attending very directly to the length of the episode, or to whether the humans are setting up training specifically in order to incentivize the AI to learn to accomplish long-horizon tasks. Do those factors make a difference to the probability that the AI ends up with the sort of the beyond-episode goals necessary for scheming? Yes, I think they do. But let's distinguish between two cases, namely: Training the model on long (but not: indefinitely long) episodes, and Trying to use short episodes to create a model that optimizes over long (perhaps: indefinitely long) time horizons. I'll look at each in turn. Training the model on long episodes In the first case, we are specifically training our AI using fairly long episodes - say, for example, a full calendar month. That is: in training, in response to an action at t1, the AI receives gradients that causally depend on the consequences of its action a full month after t1, in a manner that directly punishes the model for ignoring those consequences in choosing actions at t1. Now, importantly, as I discussed in the section on "non-schemers with schemer-like traits," misaligned non-schemers with longer episodes will generally start to look more and more like schemers. Thus, for example, a reward-on-the-episode seeker, here, would have an incentive to support/participate in efforts to seize control of the reward process that will pay off within a month. But also, importantly: a month is still different from, for example, a trillion years. That is, training a model on longer episodes doesn't mean you are directly pressuring it to care, for example, about the state of distant galaxies in the year five trillion. Indeed, on my definition of the "incentivized episode," no earthly training process can directly punish a model for failing to care on such a temporal scope, because no gradients the model receives can depend (causally) on what happens over such timescales. And of course, absent training-gaming, models that sacrifice reward-within-the-month for more-optimal-galaxies-in-year-five-trillion will get penalized by training. In this sense, the most basic argument against expecting beyond episode-goals (namely, that training provides no direct pressure to have them, and actively punishes them, absent training-gaming, if they ever lead to sacrificing within-episode reward for something longer-term) applies to both "short" (e.g., five minutes) and "long" (e.g., a month, a year, etc) episodes in equal force. However, I do still have some intuition that once you're training a model on fairly long episodes, the probability that it learns a beyond-episode goal goes up at least somewhat. The most concrete reason I can give for this is that, to the extent we're imagining a form of "messy goal-directedness" in which, in order to build a schemer, SGD needs to build not just a beyond-episode goal to which a generic "goal-achieving engine" can then be immediately directed, but rather a larger set of future-oriented heuristics, patterns of attention, beliefs, and so o...