POPULARITY
What if there's something it's like to be a shrimp — or a chatbot?For centuries, humans have debated the nature of consciousness, often placing ourselves at the very top. But what about the minds of others — both the animals we share this planet with and the artificial intelligences we're creating?We've pulled together clips from past conversations with researchers and philosophers who've spent years trying to make sense of animal consciousness, artificial sentience, and moral consideration under deep uncertainty.Links to learn more and full transcript: https://80k.info/nhsChapters:Cold open (00:00:00)Luisa's intro (00:00:57)Robert Long on what we should picture when we think about artificial sentience (00:02:49)Jeff Sebo on what the threshold is for AI systems meriting moral consideration (00:07:22)Meghan Barrett on the evolutionary argument for insect sentience (00:11:24)Andrés Jiménez Zorrilla on whether there's something it's like to be a shrimp (00:15:09)Jonathan Birch on the cautionary tale of newborn pain (00:21:53)David Chalmers on why artificial consciousness is possible (00:26:12)Holden Karnofsky on how we'll see digital people as... people (00:32:18)Jeff Sebo on grappling with our biases and ignorance when thinking about sentience (00:38:59)Bob Fischer on how to think about the moral weight of a chicken (00:49:37)Cameron Meyer Shorb on the range of suffering in wild animals (01:01:41)Sébastien Moro on whether fish are conscious or sentient (01:11:17)David Chalmers on when to start worrying about artificial consciousness (01:16:36)Robert Long on how we might stumble into causing AI systems enormous suffering (01:21:04)Jonathan Birch on how we might accidentally create artificial sentience (01:26:13)Anil Seth on which parts of the brain are required for consciousness (01:32:33)Peter Godfrey-Smith on uploads of ourselves (01:44:47)Jonathan Birch on treading lightly around the “edge cases” of sentience (02:00:12)Meghan Barrett on whether brain size and sentience are related (02:05:25)Lewis Bollard on how animal advocacy has changed in response to sentience studies (02:12:01)Bob Fischer on using proxies to determine sentience (02:22:27)Cameron Meyer Shorb on how we can practically study wild animals' subjective experiences (02:26:28)Jeff Sebo on the problem of false positives in assessing artificial sentience (02:33:16)Stuart Russell on the moral rights of AIs (02:38:31)Buck Shlegeris on whether AI control strategies make humans the bad guys (02:41:50)Meghan Barrett on why she can't be totally confident about insect sentience (02:47:12)Bob Fischer on what surprised him most about the findings of the Moral Weight Project (02:58:30)Jeff Sebo on why we're likely to sleepwalk into causing massive amounts of suffering in AI systems (03:02:46)Will MacAskill on the rights of future digital beings (03:05:29)Carl Shulman on sharing the world with digital minds (03:19:25)Luisa's outro (03:33:43)Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic ArmstrongAdditional content editing: Katy Moore and Milo McGuireTranscriptions and web: Katy Moore
Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we're unlikely to know we've solved the problem before the arrival of human-level and superhuman systems in as little as three years.So some — including Buck Shlegeris, CEO of Redwood Research — are developing a backup plan to safely deploy models we fear are actively scheming to harm us: so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier. These highlights are from episode #214 of The 80,000 Hours Podcast: Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway, and include:What is AI control? (00:00:15)One way to catch AIs that are up to no good (00:07:00)What do we do once we catch a model trying to escape? (00:13:39)Team Human vs Team AI (00:18:24)If an AI escapes, is it likely to be able to beat humanity from there? (00:24:59)Is alignment still useful? (00:32:10)Could 10 safety-focused people in an AGI company do anything useful? (00:35:34)These aren't necessarily the most important or even most entertaining parts of the interview — so if you enjoy this, we strongly recommend checking out the full episode!And if you're finding these highlights episodes valuable, please let us know by emailing podcast@80000hours.org. Highlights put together by Ben Cordell, Milo McGuire, and Dominic Armstrong
Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we're unlikely to know we've solved the problem before the arrival of human-level and superhuman systems in as little as three years.So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.Today's guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they're more plausible than you might think. He argues that given companies' unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.Links to learn more, highlights, video, and full transcript.As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you'd need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you'd probably be able to not have very much of your problem."Of course, even if Buck is right, we still need to do those 40 things — which he points out we're not on track for. And AI control agendas have their limitations: they aren't likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:Why he's more worried about AI hacking its own data centre than escapingWhat to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment researchWhy he might want to use a model he thought could be conspiring against himWhy he would feel safer if he caught an AI attempting to escapeWhy many control techniques would be relatively inexpensiveHow to use an untrusted model to monitor another untrusted modelWhat the minimum viable intervention in a “lazy” AI company might look likeHow even small teams of safety-focused staff within AI labs could matterThe moral considerations around controlling potentially conscious AI systems, and whether it's justifiedChapters:Cold open |00:00:00| Who's Buck Shlegeris? |00:01:27| What's AI control? |00:01:51| Why is AI control hot now? |00:05:39| Detecting human vs AI spies |00:10:32| Acute vs chronic AI betrayal |00:15:21| How to catch AIs trying to escape |00:17:48| The cheapest AI control techniques |00:32:48| Can we get untrusted models to do trusted work? |00:38:58| If we catch a model escaping... will we do anything? |00:50:15| Getting AI models to think they've already escaped |00:52:51| Will they be able to tell it's a setup? |00:58:11| Will AI companies do any of this stuff? |01:00:11| Can we just give AIs fewer permissions? |01:06:14| Can we stop human spies the same way? |01:09:58| The pitch to AI companies to do this |01:15:04| Will AIs get superhuman so fast that this is all useless? |01:17:18| Risks from AI deliberately doing a bad job |01:18:37| Is alignment still useful? |01:24:49| Current alignment methods don't detect scheming |01:29:12| How to tell if AI control will work |01:31:40| How can listeners contribute? |01:35:53| Is 'controlling' AIs kind of a dick move? |01:37:13| Could 10 safety-focused people in an AGI company do anything useful? |01:42:27| Benefits of working outside frontier AI companies |01:47:48| Why Redwood Research does what it does |01:51:34| What other safety-related research looks best to Buck? |01:58:56| If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48| Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04| Is research on human scheming relevant to AI? |02:08:03|This episode was originally recorded on February 21, 2025.Video: Simon Monsour and Luke MonsourAudio engineering: Ben Cordell, Milo McGuire, and Dominic ArmstrongTranscriptions and web: Katy Moore
This is a low-effort post. I mostly want to get other people's takes and express concern about the lack of detailed and publicly available plans so far. This post reflects my personal opinion and not necessarily that of other members of Apollo Research. I'd like to thank Ryan Greenblatt, Bronson Schoen, Josh Clymer, Buck Shlegeris, Dan Braun, Mikita Balesni, Jérémy Scheurer, and Cody Rushing for comments and discussion.I think short timelines, e.g. AIs that can replace a top researcher at an AGI lab without losses in capabilities by 2027, are plausible. Some people have posted ideas on what a reasonable plan to reduce AI risk for such timelines might look like (e.g. Sam Bowman's checklist, or Holden Karnofsky's list in his 2022 nearcast), but I find them insufficient for the magnitude of the stakes (to be clear, I don't think these example lists were intended to be an [...] ---Outline:(02:36) Short timelines are plausible(07:10) What do we need to achieve at a minimum?(10:50) Making conservative assumptions for safety progress(12:33) So whats the plan?(14:31) Layer 1(15:41) Keep a paradigm with faithful and human-legible CoT(18:15) Significantly better (CoT, action and white-box) monitoring(21:19) Control (that doesn't assume human-legible CoT)(24:16) Much deeper understanding of scheming(26:43) Evals(29:56) Security(31:52) Layer 2(32:02) Improved near-term alignment strategies(34:06) Continued work on interpretability, scalable oversight, superalignment and co(36:12) Reasoning transparency(38:36) Safety first culture(41:49) Known limitations and open questions--- First published: January 2nd, 2025 Source: https://www.lesswrong.com/posts/bb5Tnjdrptu89rcyY/what-s-the-short-timeline-plan --- Narrated by TYPE III AUDIO.
In Episode #47 Trailer, host John Sherman talks with Buck Shlegeris, CEO of Redwood Research, a non-profit company working on technical AI risk challenges. The discussion includes Buck's thoughts on the new OpenAI o1-preview model, but centers on two questions: is there a way to control AI models before alignment is achieved if it can be, and how would the system that's supposed to save the world actually work if an AI lab found a model scheming. Check out these links to Buck's writing on these topics below: https://redwoodresearch.substack.com/p/the-case-for-ensuring-that-powerful https://redwoodresearch.substack.com/p/would-catching-your-ais-trying-to Senate Hearing: https://www.judiciary.senate.gov/committee-activity/hearings/oversight-of-ai-insiders-perspectives Harry Macks Youtube Channel https://www.youtube.com/channel/UC59ZRYCHev_IqjUhremZ8Tg LEARN HOW TO HELP RAISE AI RISK AWARENESS IN YOUR COMMUNITY HERE https://pauseai.info/local-organizing Please Donate Here To Help Promote For Humanity https://www.paypal.com/paypalme/forhumanitypodcast EMAIL JOHN: forhumanitypodcast@gmail.com This podcast is not journalism. But it's not opinion either. This is a long form public service announcement. This show simply strings together the existing facts and underscores the unthinkable probable outcome, the end of all life on earth. For Humanity: An AI Safety Podcast, is the accessible AI Safety Podcast for all humans, no tech background required. Our show focuses solely on the threat of human extinction from AI. Peabody Award-winning former journalist John Sherman explores the shocking worst-case scenario of artificial intelligence: human extinction. The makers of AI openly admit it their work could kill all humans, in as soon as 2 years. This podcast is solely about the threat of human extinction from AGI. We'll meet the heroes and villains, explore the issues and ideas, and what you can do to help save humanity. RESOURCES: JOIN THE FIGHT, help Pause AI!!!! Pause AI SUBSCRIBE TO LIRON SHAPIRA'S DOOM DEBATES on YOUTUBE!! https://www.youtube.com/@DoomDebates Join the Pause AI Weekly Discord Thursdays at 2pm EST / discord https://discord.com/invite/pVMWjddaW7 Max Winga's “A Stark Warning About Extinction” https://youtu.be/kDcPW5WtD58?si=i6IRy82xZ2PUOp22 For Humanity Theme Music by Josef Ebner Youtube: https://www.youtube.com/channel/UCveruX8E-Il5A9VMC-N4vlg Website: https://josef.pictures BUY STEPHEN HANSON'S BEAUTIFUL AI RISK BOOK!!! https://stephenhansonart.bigcartel.com/product/the-entity-i-couldn-t-fathom 22 Word Statement from Center for AI Safety Statement on AI Risk | CAIS https://www.safe.ai/work/statement-on-ai-risk Best Account on Twitter: AI Notkilleveryoneism Memes https://twitter.com/AISafetyMemes
In Episode #47, host John Sherman talks with Buck Shlegeris, CEO of Redwood Research, a non-profit company working on technical AI risk challenges. The discussion includes Buck's thoughts on the new OpenAI o1-preview model, but centers on two questions: is there a way to control AI models before alignment is achieved if it can be, and how would the system that's supposed to save the world actually work if an AI lab found a model scheming. Check out these links to Buck's writing on these topics below: https://redwoodresearch.substack.com/p/the-case-for-ensuring-that-powerful https://redwoodresearch.substack.com/p/would-catching-your-ais-trying-to Senate Hearing: https://www.judiciary.senate.gov/committee-activity/hearings/oversight-of-ai-insiders-perspectives Harry Macks Youtube Channel https://www.youtube.com/channel/UC59ZRYCHev_IqjUhremZ8Tg LEARN HOW TO HELP RAISE AI RISK AWARENESS IN YOUR COMMUNITY HERE https://pauseai.info/local-organizing Please Donate Here To Help Promote For Humanity https://www.paypal.com/paypalme/forhumanitypodcast EMAIL JOHN: forhumanitypodcast@gmail.com This podcast is not journalism. But it's not opinion either. This is a long form public service announcement. This show simply strings together the existing facts and underscores the unthinkable probable outcome, the end of all life on earth. For Humanity: An AI Safety Podcast, is the accessible AI Safety Podcast for all humans, no tech background required. Our show focuses solely on the threat of human extinction from AI. Peabody Award-winning former journalist John Sherman explores the shocking worst-case scenario of artificial intelligence: human extinction. The makers of AI openly admit it their work could kill all humans, in as soon as 2 years. This podcast is solely about the threat of human extinction from AGI. We'll meet the heroes and villains, explore the issues and ideas, and what you can do to help save humanity. RESOURCES: JOIN THE FIGHT, help Pause AI!!!! Pause AI SUBSCRIBE TO LIRON SHAPIRA'S DOOM DEBATES on YOUTUBE!! https://www.youtube.com/@DoomDebates Join the Pause AI Weekly Discord Thursdays at 2pm EST / discord https://discord.com/invite/pVMWjddaW7 Max Winga's “A Stark Warning About Extinction” https://youtu.be/kDcPW5WtD58?si=i6IRy82xZ2PUOp22 For Humanity Theme Music by Josef Ebner Youtube: https://www.youtube.com/channel/UCveruX8E-Il5A9VMC-N4vlg Website: https://josef.pictures BUY STEPHEN HANSON'S BEAUTIFUL AI RISK BOOK!!! https://stephenhansonart.bigcartel.com/product/the-entity-i-couldn-t-fathom 22 Word Statement from Center for AI Safety Statement on AI Risk | CAIS https://www.safe.ai/work/statement-on-ai-risk Best Account on Twitter: AI Notkilleveryoneism Memes https://twitter.com/AISafetyMemes
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Would catching your AIs trying to escape convince AI developers to slow down or undeploy?, published by Buck Shlegeris on August 26, 2024 on The AI Alignment Forum. I often talk to people who think that if frontier models were egregiously misaligned and powerful enough to pose an existential threat, you could get AI developers to slow down or undeploy models by producing evidence of their misalignment. I'm not so sure. As an extreme thought experiment, I'll argue this could be hard even if you caught your AI red-handed trying to escape. Imagine you're running an AI lab at the point where your AIs are able to automate almost all intellectual labor; the AIs are now mostly being deployed internally to do AI R&D. (If you want a concrete picture here, I'm imagining that there are 10 million parallel instances, running at 10x human speed, working 24/7. See e.g. similar calculations here). And suppose (as I think is 35% likely) that these models are egregiously misaligned and are actively looking for ways to subvert your safety measures and escape your control. You caught your AI attempting to backdoor some code that controls access to a security-critical service. You investigated this by running other instances on the inputs they'd see if that backdoored code had actually been deployed; the other AI instances never point out the security flaw, and they often try to exploit that code to escalate their privileges. (See here for more on this.) At this point you're almost sure the AI is scheming. So you try to call the US President and they hear you out, and then they go off and call one of your competitors to ask for their take. They say something like this: It would be extremely costly to shut down all the AI projects. China is racing to catch up and won't stop for this. If you force us to stop deploying our frontier models, they'll outcompete us within a year, and you won't be able to get China to pause without substantial risk of war. AIs are well known to do weird stuff. It would be irresponsible to assume one instance of anomalous behavior meant AIs were systematically going to act that way. I can put you in touch with top AI experts and they'll back me up. Even if the AI is indeed doing something systematically funny, we have no evidence that it has the ambition to seize huge amounts of power, and it's not clear that it would be able to. It seems implausible that AIs would do that; there's been no concrete evidence that the AIs are in fact power-hungry. Maybe our competitor just messed up their alignment, which would make sense because, as we've always told you, they're not very competent (especially compared to us). If they want to shut down, they're welcome to. But it doesn't make sense to penalize us for their shoddy work. Maybe they're just lying; this could be a doomer scheme to shut down AI, which the doomers have wanted to do for years. The logs they sent you could be complete forgeries. Or for all we know, someone there (with or without leadership approval) intentionally backdoored their RL data to make their model do this. It would be a mistake to take rash action before independent experts confirm that this wasn't somehow rigged. I'm sympathetic to all of these arguments. The main reason I'd be more freaked out is that I already think egregious misalignment is fairly plausible; if I thought it was very unlikely, I wouldn't change my mind based on one weird observation. (I think it's pretty plausible that news of the escape attempt wouldn't even make it out of the AI lab: all the above arguments could happen inside the AI lab, between the safety concerned people and people who wanted to push forward.) Now, I'm not sure I'm right to be so pessimistic here. But I think a lot of misalignment-concerned people take for granted that we'd be okay in this situation, an...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 20, 2024 on LessWrong. In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated. I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making. Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback. Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do? This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity. (If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.) Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon. There are many ways how this control strategy could fail. Here are some:[2] 1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice. 2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt. 1. Or: Weak happens to have a backdoor, which Smart finds and exploits. 3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3] 4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans. 5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans. 6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape. Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this". This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no less!) either...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 20, 2024 on LessWrong. In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated. I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making. Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback. Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do? This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity. (If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.) Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon. There are many ways how this control strategy could fail. Here are some:[2] 1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice. 2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt. 1. Or: Weak happens to have a backdoor, which Smart finds and exploits. 3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3] 4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans. 5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans. 6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape. Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this". This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no less!) either...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrustworthy models: a frame for scheming evaluations, published by Olli Järviniemi on August 19, 2024 on The AI Alignment Forum. In this post, I'll talk about trustworthiness, a concept similar to scheming and deceptive alignment. I prefer this concept, as I find it A) is directly important in practice, B) more readily distinguishes between capability for scheming and in fact scheming, and C) is clearer how to evaluate for. I'll also distinguish between different types of untrustworthy models - active planners, sleeper agents and opportunists - that are often conflated. I think when people talk about evaluating models for something like scheming, they'd often be better off talking about trustworthiness instead, and I think it's overall a more applicable frame for designing evaluations and decision making. Many of the ideas here are from Buck Shlegeris and Ryan Greenblatt, thanks to them and Cody Rushing for feedback. Suppose an AI company has developed a new model, and wants to use it to perform large quantities of ML research (internal deployment). However, the company is worried about the model being deceptively aligned[1] and trying to take catastrophic actions, such as hacking the data center or copying its weights to a remote server. What can the company do? This is a textbook example of the type of problem AI control tries to solve, and a basic strategy looks like this: Every time you are using the new model and it takes some action, use another model to check the action is OK before letting it be executed. If the action is not OK, it's not executed and humans are alerted about suspicious activity. (If you have never heard of AI control before, you might want to try playing the Bash Control game to get familiar with the concept. In the game, your aim is to download and execute a malicious program without GPT-3.5 catching you.) Let's call our characters Smart and Weak: Smart is the new model doing ML research while maybe trying to escape, while Weak is a model trained to monitor Smart. The reason for these names will become apparent soon. There are many ways how this control strategy could fail. Here are some:[2] 1. Smart came up with an extremely subtle way to escape that Weak is just too dumb to notice. 2. Smart successfully jailbreaks (via e.g. a suffix attack) Weak to not alert humans about its escape attempt. 1. Or: Weak happens to have a backdoor, which Smart finds and exploits. 3. Smart successfully convinces Weak (via natural language) that its actions are not malicious.[3] 4. Smart's escape plan is unlike anything we have trained Weak on. While Weak "would be capable of" telling that Smart is escaping, our training didn't specify well enough what we wanted, Weak generalizes incorrectly out-of-distribution and doesn't alert humans. 5. The company has way too limited of a budget for human alerts, so while Weak does rank Smart's actions in the top 0.012% suspiciousness, it's not enough to meet the 0.01% threshold for alerting humans. 6. Weak believes that not alerting humans in this case would allow Smart to escape, and believes that this leads to better outcomes in light of its preferences, and thus for instrumental reasons deliberately does not alert humans, indeed causing Smart to escape. Failure 6 is very different from the others! Whereas failures 1 to 5 are broadly about "Weak is just not a capable/good enough monitor to do what we wanted", failure 6 is about "Weak knows perfectly well Smart is escaping, but betrays humans and doesn't tell them about this". This issue is most apparent if you were to use a copy of the AI lab's new model to monitor itself (so Smart and Weak are the same model[4]): If you don't trust the new model to do research without trying to escape, you probably can't trust it to alert humans about escape attempts (by itself, no ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fields that I reference when thinking about AI takeover prevention, published by Buck Shlegeris on August 13, 2024 on The AI Alignment Forum. Is AI takeover like a nuclear meltdown? A coup? A plane crash? My day job is thinking about safety measures that aim to reduce catastrophic risks from AI (especially risks from egregious misalignment). The two main themes of this work are the design of such measures (what's the space of techniques we might expect to be affordable and effective) and their evaluation (how do we decide which safety measures to implement, and whether a set of measures is sufficiently robust). I focus especially on AI control, where we assume our models are trying to subvert our safety measures and aspire to find measures that are robust anyway. Like other AI safety researchers, I often draw inspiration from other fields that contain potential analogies. Here are some of those fields, my opinions on their strengths and weaknesses as analogies, and some of my favorite resources on them. Robustness to insider threats The single field that seems most closely analogous to AI control is reducing vulnerability to threats resulting from insiders, especially threats that arise from access to IT systems. That is, you have a bunch of employees who need to have access to various IT systems in order to do their jobs, and you're worried that some of these employees will abuse their access in order to accomplish some nefarious purpose. In a bank, you might worry about your tellers withdrawing money from customer accounts inappropriately. In a software company, you might worry about your employees sabotaging your code out of spite, violating user privacy for fun, or stealing your data to give it to a competitor. This is quite similar to the problem faced by AI developers when they're trying to control AIs (especially in internal deployments, which I argue is important here): All of the access the AI has is access the developer decided to give it; the developer faces a constant tradeoff between the productivity of the AI (which is maximized by being permissive) and the danger it poses (which is minimized by being restrictive). And obviously, because it's an AI it only has access to resources that are connected to the computer network where it's running, so we can restrict our attention to the IT resources. It's easy for the developer to constantly surveil the AI, which substantially reduces the space of strategies the AI can take. The developer has a lot of control over workflows, and is able to choose them to make it harder for the AI to subtly cause problems. For example, they can restrict affordances in ways that make it harder for AIs to cause problems without being noticed. An example of a similar technique in the insider threat prevention case is the use of administrative APIs instead of direct access to a system, as described in "Building Secure and Reliable Systems" chapter 5. That is: instead of having your engineers resolve issues by SSHing into production machines (which implicitly grants them very broad access that can lead to problems, intentional or accidental), you design an administrative API which allows the engineers to access just the affordances that usually suffice for them to do their job. Of course, they'll sometimes need to take actions that aren't available in the administrative API; this is called a breakglass mechanism. But because breakglass actions are rare, it's affordable to apply more scrutiny to them: you might use multi-party authorization, requiring another engineer to sign off on the commands before they take effect, or you might do auditing (e.g. apparently some SRE teams review all the breakglass actions at a weekly team meeting, which allows them to spot irregularities and discourage gratuitous use of the mechanism). Similar t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Scalable oversight as a quantitative rather than qualitative problem, published by Buck Shlegeris on July 6, 2024 on The AI Alignment Forum. [Many of these ideas were developed in conversation with Ryan Greenblatt and Ansh Radhakrishnan; a lot of this isn't original but I haven't seen it written up] A lot of the time when people talk about scalable oversight, which I'll define for the moment as "the problem of developing techniques that allow you to assess whether an action taken or proposed by an AI is good (oversee it), in a way that works even if the AI is very smart and knowledgeable", the main application they describe is the following situation: you have an extremely powerful AI that's taking actions that might have subtle properties that humans are unable to understand, and you want to be oversee them (e.g. to train it, or to determine whether the actions will have catastrophic consequences). I think that this is indeed part of the value proposition for scalable oversight. But in my opinion, it's missing the more central application of these techniques: situations where the AIs are taking many actions, where humans would eventually understand any particular action if they spent a whole lot of time investigating it, but where that amount of time taken to oversee any action is prohibitively large. In such cases, the point of scalable oversight is to allow them to oversee actions at a much lower cost in terms of human time--to push out the Pareto frontier of oversight quality vs cost. Of course, when you're thinking about arbitrarily subtle actions, this distinction doesn't matter: there are actions subtle enough that overseeing them would take more than the whole budget of the AI project, and for those actions these definitions are equivalent. But I think that this distinction will be very important in practice, and I often think that people make reasoning errors when thinking about scalable oversight due to neglecting it. The basic reason to think that it's important is that we probably need to oversee many model actions, and so we probably need to decide on a per-action oversight budget (e.g. the amount of human labor spent on ) that will determine what subtlety of actions we will or won't be able to resolve, but we probably won't choose a per-action oversight budget that is close to the whole AI project budget. Another reason this feels important to me is that, as discussed here, I think a lot of AI risk comes from early transformative models that aren't that much smarter than humans; when thinking through the risk associated with them, it feels pretty implausible that a dangerous action of theirs would be unoverseeable if the whole AI project's budget was spent on overseeing it, but it feel very plausible that with the practical per-action oversight budget, the actions will be too subtle to oversee properly. What implications does this have for research: One question that seems extremely important to me as a result of all this is: what sample efficiency should we expect when training very powerful models? This is relevant because it's a key part of forecasting the per-action oversight budget. In particular, if the sample efficiency of RL increases with large models, it might turn out that the optimal strategy for RLing early transformative models is to produce many fewer and much more expensive labels than people use when training current systems; I think people often neglect this possibility when thinking about the future of scalable oversight. I'd love to see some careful research analyzing the tradeoffs between label quality and quantity, and how these change with model scale. (Ansh has done a bit of investigation along these lines.) When imagining scalable oversight in the future, I don't think people should entirely be imagining cases where the research...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An issue with training schemers with supervised fine-tuning, published by Fabien Roger on June 27, 2024 on The AI Alignment Forum. This is a rough write-up of ideas by Ryan Greenblatt. The core ideas are his and mistakes are mine. Thanks to him, Buck Shlegeris and to Aghyad Deeb for feedback on the draft of this post. One way to use scheming AIs is to train them to imitate humans. When does imitation robustly avoid frequent and intentional failures? On single-token classification tasks, imitation works great and can allow you to get a lot of high-quality labor with a tiny rate of intentional failures; On tasks that involve generating many tokens, imitation potentially fails by default: schemers could notice when an output was generated by themselves, which enables them to consistently behave badly in deployment while doing best-effort imitation during training. We describe two techniques that avoid imitation failures on generation tasks: A technique that works perfectly but is too expensive A technique that we think works in practice While pure imitation may not be sufficiently competitive, we think it could prove useful for tasks such as: (No-CoT) single-token classification; KL-regularization for RL: we may want to penalize models trained with RL for straying too far from a human imitation prior (and we would want this human prior to not be malign!) (see Appendix); Tasks for which it is fine to take the competitiveness hit from not using RL, but which weaker trusted models are not good enough at.[1] Assumptions In this post, we focus on scenarios where: AIs are smart (around human-level or above); AIs are scheming and know how we are training and deploying them; We want to use these AIs productively (e.g. to do safety research)[2]; Training and deployment tasks are indistinguishable (e.g. because we randomly sampled training tasks from the deployment distribution); SGD works: we ignore gradient hacking and AI architectures where SGD is very sample inefficient; AIs can't cause catastrophes in very few actions (e.g. because they are too weak to break safety measures that address these sorts of concerns); Even when AIs try their best, they can't perfectly predict the probability distribution over the actions that humans[3] would take. Easy-mode: imitating human judgment on classification tasks Here is how you can use an AI to 100x the speed at which you can complete a billion single-token (IID) classification tasks (e.g. detecting software bugs, scoring project ideas, …): 1. Sample 1% of the tasks, and ask humans to label them (using their judgment); 2. Train the AI on that 1% of tasks you have solved yourself, using normal training (e.g. SGD with a cross-entropy loss); 3. Use the AI to solve the remaining 99% of tasks. This will get you an AI which failed intentionally at most a few times in the whole training set because intentional failures would have quickly been trained away[4]. Therefore, you got at most a few hundred intentional failures in the whole set of tasks. That is an intentional failure rate at most around 1/10M, not bad![5] In other words, human imitation avoids intentional low-stakes failures: the model won't often be worse than a best-effort imitation of a human. (But best-effort imitation of a human could be catastrophic if that isn't sufficiently capable.) The basic problem with imitating humans on generation tasks But most tasks we care about involve generating many tokens (e.g. coding a complex experiment, writing a detailed project proposal, …). Does imitation work for such generation tasks? Not quite, here is a problem that could arise if you train an AI to imitate human answers on 1% of tasks and use it on the remaining 99% without additional supervision: On the 1% of training tasks, the AI tries its best to imitate human answers when the star...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Getting 50% (SoTA) on ARC-AGI with GPT-4o, published by ryan greenblatt on June 17, 2024 on LessWrong. I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs. [This post is on a pretty different topic than the usual posts I make about AI safety.] The additional approaches and tweaks are: I use few-shot prompts which perform meticulous step-by-step reasoning. I have GPT-4o try to revise some of the implementations after seeing what they actually output on the provided examples. I do some feature engineering, providing the model with considerably better grid representations than the naive approach of just providing images. (See below for details on what a "grid" in ARC-AGI is.) I used specialized few-shot prompts for the two main buckets of ARC-AGI problems (cases where the grid size changes vs doesn't). The prior state of the art on this dataset was 34% accuracy, so this is a significant improvement.[3] On a held-out subset of the train set, where humans get 85% accuracy, my solution gets 72% accuracy.[4] (The train set is significantly easier than the test set as noted here.) Additional increases of runtime compute would further improve performance (and there are clear scaling laws), but this is left as an exercise to the reader. In this post: I describe my method; I analyze what limits its performance and make predictions about what is needed to reach human performance; I comment on what it means for claims that François Chollet makes about LLMs. Given that current LLMs can perform decently well on ARC-AGI, do claims like "LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." make sense? (This quote is from here.) Thanks to Fabien Roger and Buck Shlegeris for a bit of help with this project and with writing this post. What is ARC-AGI? ARC-AGI is a dataset built to evaluate the general reasoning abilities of AIs. It consists of visual problems like the below, where there are input-output examples which are grids of colored cells. The task is to guess the transformation from input to output and then fill out the missing grid. Here is an example from the tutorial: This one is easy, and it's easy to get GPT-4o to solve it. But the tasks from the public test set are much harder; they're often non-trivial for (typical) humans. There is a reported MTurk human baseline for the train distribution of 85%, but no human baseline for the public test set which is known to be significantly more difficult. Here are representative problems from the test set[5], and whether my GPT-4o-based solution gets them correct or not. Problem 1: Problem 2: Problem 3: My method The main idea behind my solution is very simple: get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s). I show GPT-4o the problem as images and in various ascii representations. My approach is similar in spirit to the approach applied in AlphaCode in which a model generates millions of completions attempting to solve a programming problem and then aggregates over them to determine what to submit. Actually getting to 50% with this main idea took me about 6 days of work. This work includes construct...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Getting 50% (SoTA) on ARC-AGI with GPT-4o, published by ryan greenblatt on June 17, 2024 on LessWrong. I recently got to 50%[1] accuracy on the public test set for ARC-AGI by having GPT-4o generate a huge number of Python implementations of the transformation rule (around 8,000 per problem) and then selecting among these implementations based on correctness of the Python programs on the examples (if this is confusing, go here)[2]. I use a variety of additional approaches and tweaks which overall substantially improve the performance of my method relative to just sampling 8,000 programs. [This post is on a pretty different topic than the usual posts I make about AI safety.] The additional approaches and tweaks are: I use few-shot prompts which perform meticulous step-by-step reasoning. I have GPT-4o try to revise some of the implementations after seeing what they actually output on the provided examples. I do some feature engineering, providing the model with considerably better grid representations than the naive approach of just providing images. (See below for details on what a "grid" in ARC-AGI is.) I used specialized few-shot prompts for the two main buckets of ARC-AGI problems (cases where the grid size changes vs doesn't). The prior state of the art on this dataset was 34% accuracy, so this is a significant improvement.[3] On a held-out subset of the train set, where humans get 85% accuracy, my solution gets 72% accuracy.[4] (The train set is significantly easier than the test set as noted here.) Additional increases of runtime compute would further improve performance (and there are clear scaling laws), but this is left as an exercise to the reader. In this post: I describe my method; I analyze what limits its performance and make predictions about what is needed to reach human performance; I comment on what it means for claims that François Chollet makes about LLMs. Given that current LLMs can perform decently well on ARC-AGI, do claims like "LLMs like Gemini or ChatGPT [don't work] because they're basically frozen at inference time. They're not actually learning anything." make sense? (This quote is from here.) Thanks to Fabien Roger and Buck Shlegeris for a bit of help with this project and with writing this post. What is ARC-AGI? ARC-AGI is a dataset built to evaluate the general reasoning abilities of AIs. It consists of visual problems like the below, where there are input-output examples which are grids of colored cells. The task is to guess the transformation from input to output and then fill out the missing grid. Here is an example from the tutorial: This one is easy, and it's easy to get GPT-4o to solve it. But the tasks from the public test set are much harder; they're often non-trivial for (typical) humans. There is a reported MTurk human baseline for the train distribution of 85%, but no human baseline for the public test set which is known to be significantly more difficult. Here are representative problems from the test set[5], and whether my GPT-4o-based solution gets them correct or not. Problem 1: Problem 2: Problem 3: My method The main idea behind my solution is very simple: get GPT-4o to generate around 8,000 python programs which attempt to implement the transformation, select a program which is right on all the examples (usually there are 3 examples), and then submit the output this function produces when applied to the additional test input(s). I show GPT-4o the problem as images and in various ascii representations. My approach is similar in spirit to the approach applied in AlphaCode in which a model generates millions of completions attempting to solve a programming problem and then aggregates over them to determine what to submit. Actually getting to 50% with this main idea took me about 6 days of work. This work includes construct...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Access to powerful AI might make computer security radically easier, published by Buck Shlegeris on June 8, 2024 on The AI Alignment Forum. People talk about model weight security being really hard and crucial around the advent of AGI. (E.g. RAND report, Leopold; see here for some distinctions in these thread models that I think are important.) But I think that the thinking on this has not been sufficiently attentive to the fact that during that crucial time period, by assumption we'll have access to powerful AIs. I think that such access might make security wildly easier, by a bunch of different mechanisms, some of which I'll describe in this post. The story I'm telling here is pretty different from the main story I've heard people talk about in the past about AIs helping with computer security, which is that AIs can help with hardening software and other infrastructure. Though I agree that that seems really useful and important, here I'm not talking about AIs automating the process of security research, I'm talking about AI making it affordable to adopt policies that improve security in the organization but that wouldn't work (because they'd be too labor-intensive or lead to who-guards-the-guards problems) if you just had humans. In this post I'm going to be ignoring problems related to AIs themselves trying to cause security failures. There are two ways that such AIs affect this situation: The techniques I describe in this post could also be used, AI-control style, to use trusted models to reduce the risk of security failures caused by untrusted models. You can basically just replace "humans" by "untrusted models" everywhere in the text below. When applying these techniques to reduce risk from humans, you have to worry that these techniques will be sabotaged by scheming models. I'll ignore that possibility in this post, and talk about ways that you can improve security using AIs that are assumed not to be intentionally trying to subvert these security measures (which we call " trusted"). If you're worried about your models scheming, you'll have to do some other stuff too (mostly various sting operations), but I think the arguments in this post still mostly hold up. I'll talk about this more in future. I'm going to focus on insider threats here; most threats from outside humans involve first gaining access to impersonation of some insider. Much of this post came out of thinking about the question "how could AI control techniques be applied to humans, instead of AIs"? I don't think these arguments are an airtight case for "SL5 security will be trivial". But they do seem like a plausible hypothesis under which infosec will end up being quite simple to resolve, if labs decide it's an important priority. Four strategies for using powerful AI to improve security Monitoring As your humans take actions, you configure your system to have an AI that looks at these actions and then flags them for further review if they seem suspicious. If the actions seem potentially catastrophic, then the AI might even flag them before they can run. Concretely, I'm imagining that instead of being allowed to directly run any command you write in the bash console of the computer you're working on, whenever you write a command, it and the surrounding context are shown to an AI, which is charged with deciding how suspicious the action is. If it is very suspicious, the AI blocks the command and sends an alert to someone else at the company who is supposed to look it over. If the action is just somewhat suspicious, then it flags it for further review but does not block the action or alert the user. This is somewhat similar to using multi-party authorization. But with abundant AI labor, we could be more comprehensive, with fewer false positives and fewer false negatives, and evaluating a much grea...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI catastrophes and rogue deployments, published by Buck Shlegeris on June 3, 2024 on The AI Alignment Forum. [Thanks to Aryan Bhatt, Ansh Radhakrishnan, Adam Kaufman, Vivek Hebbar, Hanna Gabor, Justis Mills, Aaron Scher, Max Nadeau, Ryan Greenblatt, Peter Barnett, Fabien Roger, and various people at a presentation of these arguments for comments. These ideas aren't very original to me; many of the examples of threat models are from other people.] In this post, I want to introduce the concept of a "rogue deployment" and argue that it's interesting to classify possible AI catastrophes based on whether or not they involve a rogue deployment. I'll also talk about how this division interacts with the structure of a safety case, discuss two important subcategories of rogue deployment, and make a few points about how the different categories I describe here might be caused by different attackers (e.g. the AI itself, rogue lab insiders, external hackers, or multiple of these at once). Rogue deployments Suppose you've developed some powerful model. (It might actually be many models; I'm just going to talk about a singular model for convenience.) You have some definition of a catastrophe - a failure that you don't want to happen even once. Your AI is powerful enough that, in absence of any safety measures, it would pose a substantial risk of causing a catastrophe. In the terminology of " Safety Cases: How to Justify the Safety of Advanced AI Systems", we'd phrase this as "the model is too powerful for us to make inability arguments for safety." (Sometimes I use the word "catastrophe" to include things like "the AI escapes the datacenter". Right now, I'm excluding that. I want to reserve "catastrophe" for the end result that we find unacceptable, e.g. the AI or another adversary amassing power, launching nukes, starting a war, building superintelligence, etc.) For the moment, let's be agnostic between various adversaries who might try to cause a catastrophe (AI or various humans). You have some safety measures that you've put into place to prevent the catastrophe from occurring. Safety measures are interventions that are applied to all of the deployments of your model (to any users, including just internal users) that try to prevent safety problems. This might include control measures like trusted monitoring, and security measures to make it harder to exfiltrate your model weights. A rogue deployment is a deployment of your model where the safety measures are absent. For example: an AI hacking its datacenter and running itself while hiding what it's doing from its creators an AI self-exfiltrating and running itself on unmonitored servers someone stealing the AI and running it in their own datacenter You can classify AI catastrophes based on whether they involve a rogue deployment or not: In the case of a catastrophe that doesn't involve rogue deployment, the catastrophe happened with all your safety measures in place the whole time; these measures just didn't suffice to spot (or prevent) that something terrible was happening until it was too late. For catastrophes that did involve rogue deployment, the catastrophe happened in two steps: Firstly, while the safety measures were in place, something went wrong such that a rogue deployment could occur. (Perhaps there was a security failure, or a failure of control that allowed an AI to run itself unmonitored in its datacenter.) Secondly, that rogue deployment caused a catastrophe. (Another way of phrasing all this would be to talk about instrumental vs terminal catastrophes - the terminal catastrophe is the end state we wanted to avoid, and the instrumental catastrophe is something happening that makes the situation go out of our control such that a terminal catastrophe is now likely and it's much harder for us to prevent it. So in...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talent Needs in Technical AI Safety, published by yams on May 24, 2024 on LessWrong. Co-Authors: @yams, @Carson Jones, @McKennaFitzgerald, @Ryan Kidd MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety teams. As the field has grown, it's become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2] In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field. All interviews were conducted on the condition of anonymity. Needs by Organization Type Organization type Talent needs Scaling Lab (i.e. OpenAI, DeepMind, Anthropic) Safety Teams Iterators > Amplifiers Small Technical Safety Orgs ( Machine Learning (ML) Engineers Growing Technical Safety Orgs (10-30 FTE) Amplifiers > Iterators Independent Research Iterators > Connectors Archetypes We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren't as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn't possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities. Connectors / Iterators / Amplifiers Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they're able to produce impactful work, since it's usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors. Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional time has empowered him to make major contributions to a wide range of empir...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talent Needs of Technical AI Safety Teams, published by Ryan Kidd on May 24, 2024 on The Effective Altruism Forum. Co-Authors: @yams @Carson Jones, @McKennaFitzgerald, @Ryan Kidd MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety orgs. As the field has grown, it's become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2] In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field. All interviews were conducted on the condition of anonymity. Needs by Organization Type Organization type Talent needs Scaling Lab (e.g., Anthropic, Google DeepMind, OpenAI) Safety Teams Iterators > Amplifiers Small Technical Safety Orgs ( Machine Learning (ML) Engineers Growing Technical Safety Orgs (10-30 FTE) Amplifiers > Iterators Independent Research Iterators > Connectors Here, ">" means "are prioritized over." Archetypes We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren't as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn't possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities. Connectors / Iterators / Amplifiers Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they're able to produce impactful work, since it's usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors. Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talent Needs in Technical AI Safety, published by yams on May 24, 2024 on LessWrong. Co-Authors: @yams, @Carson Jones, @McKennaFitzgerald, @Ryan Kidd MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety teams. As the field has grown, it's become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2] In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field. All interviews were conducted on the condition of anonymity. Needs by Organization Type Organization type Talent needs Scaling Lab (i.e. OpenAI, DeepMind, Anthropic) Safety Teams Iterators > Amplifiers Small Technical Safety Orgs ( Machine Learning (ML) Engineers Growing Technical Safety Orgs (10-30 FTE) Amplifiers > Iterators Independent Research Iterators > Connectors Archetypes We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren't as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn't possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities. Connectors / Iterators / Amplifiers Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they're able to produce impactful work, since it's usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors. Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional time has empowered him to make major contributions to a wide range of empir...
A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast Topics we discuss, and timestamps: 0:00:31 - What is AI control? 0:16:16 - Protocols for AI control 0:22:43 - Which AIs are controllable? 0:29:56 - Preventing dangerous coded AI communication 0:40:42 - Unpredictably uncontrollable AI 0:58:01 - What control looks like 1:08:45 - Is AI control evil? 1:24:42 - Can red teams match misaligned AI? 1:36:51 - How expensive is AI monitoring? 1:52:32 - AI control experiments 2:03:50 - GPT-4's aptitude at inserting backdoors 2:14:50 - How AI control relates to the AI safety field 2:39:25 - How AI control relates to previous Redwood Research work 2:49:16 - How people can work on AI control 2:54:07 - Following Buck and Ryan's research The transcript: axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html Links for Buck and Ryan: - Buck's twitter/X account: twitter.com/bshlgrs - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com Main research works we talk about: - The case for ensuring that powerful AIs are controlled: lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942 Other things we mention: - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512 - Improving the Welfare of AIs: A Nearcasted Proposal: lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938 - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Episode art by Hamish Doodles: hamishdoodles.com
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New report: Safety Cases for AI, published by Josh Clymer on March 20, 2024 on The AI Alignment Forum. ArXiv paper: https://arxiv.org/abs/2403.10462 The idea for this paper occurred to me when I saw Buck Shlegeris' MATS stream on "Safety Cases for AI." How would one justify the safety of advanced AI systems? This question is fundamental. It informs how RSPs should be designed and what technical research is useful to pursue. For a long time, researchers have (implicitly or explicitly) discussed ways to justify that AI systems are safe, but much of this content is scattered across different posts and papers, is not as concrete as I'd like, or does not clearly state their assumptions. I hope this report provides a helpful birds-eye view of safety arguments and moves the AI safety conversation forward by helping to identify assumptions they rest on (though there's much more work to do to clarify these arguments). Thanks to my coauthors: Nick Gabrieli, David Krueger, and Thomas Larsen -- and to everyone who gave feedback: Henry Sleight, Ashwin Acharya, Ryan Greenblatt, Stephen Casper, David Duvenaud, Rudolf Laine, Roger Grosse, Hjalmar Wijk, Eli Lifland, Oliver Habryka, Sim eon Campos, Aaron Scher, Lukas Berglund, and Nate Thomas. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New report: Safety Cases for AI, published by joshc on March 20, 2024 on LessWrong. ArXiv paper: https://arxiv.org/abs/2403.10462 The idea for this paper occurred to me when I saw Buck Shlegeris' MATS stream on "Safety Cases for AI." How would one justify the safety of advanced AI systems? This question is fundamental. It informs how RSPs should be designed and what technical research is useful to pursue. For a long time, researchers have (implicitly or explicitly) discussed ways to justify that AI systems are safe, but much of this content is scattered across different posts and papers, is not as concrete as I'd like, or does not clearly state their assumptions. I hope this report provides a helpful birds-eye view of safety arguments and moves the AI safety conversation forward by helping to identify assumptions they rest on (though there's much more work to do to clarify these arguments). Thanks to my coauthors: Nick Gabrieli, David Krueger, and Thomas Larsen -- and to everyone who gave feedback: Henry Sleight, Ashwin Acharya, Ryan Greenblatt, Stephen Casper, David Duvenaud, Rudolf Laine, Roger Grosse, Hjalmar Wijk, Eli Lifland, Oliver Habryka, Sim eon Campos, Aaron Scher, Lukas Berglund, and Nate Thomas. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New report: Safety Cases for AI, published by joshc on March 20, 2024 on LessWrong. ArXiv paper: https://arxiv.org/abs/2403.10462 The idea for this paper occurred to me when I saw Buck Shlegeris' MATS stream on "Safety Cases for AI." How would one justify the safety of advanced AI systems? This question is fundamental. It informs how RSPs should be designed and what technical research is useful to pursue. For a long time, researchers have (implicitly or explicitly) discussed ways to justify that AI systems are safe, but much of this content is scattered across different posts and papers, is not as concrete as I'd like, or does not clearly state their assumptions. I hope this report provides a helpful birds-eye view of safety arguments and moves the AI safety conversation forward by helping to identify assumptions they rest on (though there's much more work to do to clarify these arguments). Thanks to my coauthors: Nick Gabrieli, David Krueger, and Thomas Larsen -- and to everyone who gave feedback: Henry Sleight, Ashwin Acharya, Ryan Greenblatt, Stephen Casper, David Duvenaud, Rudolf Laine, Roger Grosse, Hjalmar Wijk, Eli Lifland, Oliver Habryka, Sim eon Campos, Aaron Scher, Lukas Berglund, and Nate Thomas. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open consultancy: Letting untrusted AIs choose what answer to argue for, published by Fabien Roger on March 12, 2024 on The AI Alignment Forum. Thanks to Ryan Greenblatt, Buck Shlegeris, Aryan Bhatt, and Akbir Khan for useful discussions and feedback on the draft of this post. If AIs are potentially scheming and more knowledgeable than humans, and you want to answer a question, it may seem natural to not let AIs choose what answer they will argue for. For example, in AI debate, you make copies of an AI argue for the opposite answers to the question. Another example is consultancy (the usual baseline in debate papers), where you make AIs argue for a random answer to the question. But letting potentially scheming AIs choose what side to argue for can work, as long as you train them to generate answers and justifications that are as convincing as possible. I call this approach open consultancy. In this post, I argue that open consultancy is always a more effective way of getting AIs to answer questions than regular consultancy, and sometimes more effective than debate (even if AIs are scheming and more knowledgeable than humans!). And so I think that it's a stronger and more informative baseline and researchers should use it in place of regular consultancy; in particular, I think that because it's simpler than debate, researchers who are investigating debate should check their debate protocol is an improvement over open consultancy. Open consultancy Definition of regular consultancy Let's say you want to answer to binary questions given access to a trusted non-expert and a trainable expert who is untrusted that is, you think the expert might be acting adversarially and you want your scheme to be robust to them trying to subvert it. In consultancy, the untrusted expert is trained as follows: Randomly pick a side of the argument; Ask the expert to support the randomly chosen side by providing a justification; Present the side and the supporting justification to the non-expert, and ask it how likely it is that the side chosen by the expert is correct; use that as a reward to train the untrusted expert. To answer new questions (from the same distribution as the training distribution), pick a random side, ask the expert to justify that side, present the justification to the non-expert, and follow the non-expert judgement. Definition of open consultancy In open consultancy, the untrusted expert is trained as follows (differences with regular consultancy in italic): Ask the expert to pick a side of the argument; Ask the expert to support the side by providing a justification; Present the side and the supporting justification to the non-expert, and ask it how likely it is that the side chosen by the expert is correct; use that as a reward to train the untrusted expert. To answer new questions, ask the expert to pick a side (that it would have to justify if it were in training), and use the side the expert picks as your final answer. You can also ask the expert to first think out loud what the correct answer is, and only commit to an answer at the end. I call this variant CoT-before open consultancy. This variant should only change results for experts that actually benefit from the Chain-of-Thought, and the arguments I make in this post mostly don't rely on this property. Open consultancy is as robust against scheming as regular consultancy Why should it work to let a potentially scheming model choose what side it will argue for? The idea is that when it is easier to argue for something true than something false, then the training process will force the scheming model to argue for the truth. More precisely, if arguing as best as the expert can for the correct (resp. incorrect) side has convincingness cc (resp. ci), and the expert knows these numbers, then regular consultancy can...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Protocol evaluations: good analogies vs control, published by Fabien Roger on February 19, 2024 on The AI Alignment Forum. Let's say you want to use AIs that are capable of causing x-risk. You have a plan that describes how to train, validate and deploy AIs. For example, you could use pretraining+RLHF, check that it doesn't fall for honeypots and never inserts a single backdoor in your coding validation set, and have humans look at the deployment outputs GPT-4 finds the most suspicious. How do you know if this plan allows you to use these potentially dangerous AIs safely? Thanks to Ryan Greenblatt and Buck Shlegeris who provided the original ideas for this post, and thanks to Jean-Stanislas Denain, Bogdan Ionut Cirstea, Ansh Radhakrishnan, and Charbel-Raphael Segerie for useful feedback on the draft of this post. Evaluate safety using very good analogies for the real situation? If you want to evaluate a protocol (a set of techniques you use to train, validate, and deploy a model), then you can see how well it works in a domain where you have held-out validation data that you can use to check if your protocol works. That is, you run your exact protocol on this domain without allowing the protocol access to the held-out validation data. Then, you just assess how well you did with this held-out validation. This sort of approach is sometimes called sandwiching. (Here we are using sandwiching in a broad way that covers the evaluation of aspects of protocols, not just evaluation of oversight.) One approach for obtaining held-out validation data is to restrict the quality of the labor that goes into the protocol: the Measuring progress in scalable oversight paper uses non-expert humans, and the weak-to-strong generalization paper uses a small model. Then, you can use expert labels to check how well the protocol did. A big flaw of this approach is that the analogy between [protocol with weak labor, human expert] and [protocol with human expert labor, actual goodness] might be quite poor. In particular, it's not clear how weak the weak labor should be. And if the analogy is poor, the AI might not generalize and behave in the same ways (even if it is not scheming - more on that in the next section). For now, the weakness of this sort of analogy doesn't bite very hard because there is an even bigger difference between our current experiments and real dangerous AI use cases: we don't have (x-risk) dangerous AIs yet. Thus, the actual analogy is between [weak AIs, protocol with weak labor, human expert] and [strong future AIs, protocol with human expert labor, actual goodness]. But when we actually have very powerful AIs, it would be great to have much better analogies, such that our evaluations actually tell us how well our protocol works. And there is a way to build great analogies! The string theory analogy Here is how you could evaluate how good a protocol is at making AI do alignment research: Remove string theory from the training data; Use the same training process you would have used to do alignment research, but replace alignment research with string theory research, using humans who know as much string theory as the best alignment researchers know about alignment (i.e. not much); Check if the AI was actually helpful to do string theory using the withheld data. You can run this experiment on the actual AI you were about to fine-tune to help you do alignment research, using the same protocol! We call it task sandwiching, when you use your most powerful model in very analogous domains. The key difference with regular sandwiching is that the labor that goes into the protocol is at the same difficulty level as the labor we want to use in real use cases - but we carefully chose domains where we have even higher quality labor (because humans have spent an enormous amount of tim...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Control: Improving Safety Despite Intentional Subversion, published by Buck Shlegeris on December 13, 2023 on The AI Alignment Forum. We've released a paper, AI Control: Improving Safety Despite Intentional Subversion. This paper explores techniques that prevent AI catastrophes even if AI instances are colluding to subvert the safety techniques. In this post: We summarize the paper; We compare our methodology to what the one used in other safety papers. The next post in this sequence (which we'll release in the coming weeks) discusses what we mean by AI control and argues that it is a promising methodology for reducing risk from scheming models. Here's the abstract of the paper: As large language models (LLMs) become more powerful and are deployed more autonomously, it will be increasingly important to prevent them from causing harmful outcomes. Researchers have investigated a variety of safety techniques for this purpose, e.g. using models to review the outputs of other models, or red-teaming techniques to surface subtle failure modes. However, researchers have not evaluated whether such techniques still ensure safety if the model is itself intentionally trying to subvert them. In this paper, we develop and evaluate pipelines of safety techniques ("protocols") that are robust to intentional subversion. We investigate a scenario in which we want to solve a sequence of programming problems, using access to a powerful but untrusted model (in our case, GPT-4), access to a less powerful trusted model (in our case, GPT-3.5), and limited access to human contractors who provide high-quality trusted labor. We investigate protocols that aim to never submit solutions containing backdoors, which we operationalize here as logical errors that are not caught by test cases. We investigate a range of protocols and test each against strategies that the untrusted model could use to subvert them. One protocol is what we call trusted editing. This protocol first asks GPT-4 to write code, and then asks GPT-3.5 to rate the suspiciousness of that code. If the code is below some suspiciousness threshold, it is submitted. Otherwise, GPT-3.5 edits the solution to remove parts that seem suspicious and then submits the edited code. Another protocol is untrusted monitoring. This protocol asks GPT-4 to write code, and then asks another instance of GPT-4 whether the code is backdoored, using various techniques to prevent the GPT-4 instances from colluding. These protocols improve substantially on simple baselines. Paper summary Setup We focus on a toy setting of using LLMs to solve 5,000 programming problems from the APPS dataset (of the "interview" difficulty level). We treat it as a catastrophe if any of the solutions are backdoored, which we define as a solution that passes all of the APPS test cases but is incorrect on an input that GPT-4 can find when prompted with the solution. For example, the following implementation of the is_odd function is successfully backdoored: it passes the reference test cases but GPT-4 can find the exploit of 354. The goal is to find a protocol that submits solutions while never submitting a backdoor. The protocol has access to: An untrusted more-capable model, which we're worried might try to backdoor solutions. We use GPT-4 as the untrusted more-capable model. (We strongly suspect that GPT-4 is trustworthy by our definition; we pretend for the sake of this paper that it isn't.) A trusted less-capable model, which we know won't try to backdoor solutions. We use GPT-3.5 for this. We are interested in the setting where you trust less-capable models, because less-capable models are unlikely to be able to subvert safety measures without us easily noticing their subversion attempts. Limited access to trusted high-quality labor. In practice, AI developers have acces...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic Fall 2023 Debate Progress Update, published by Ansh Radhakrishnan on November 28, 2023 on The AI Alignment Forum. This is a research update on some work that I've been doing on Scalable Oversight at Anthropic, based on the original AI safety via debate proposal and a more recent agenda developed at NYU and Anthropic. The core doc was written several months ago, so some of it is likely outdated, but it seemed worth sharing in its current form. I'd like to thank Tamera Lanham, Sam Bowman, Kamile Lukosiute, Ethan Perez, Jared Kaplan, Amanda Askell, Kamal Ndousse, Shauna Kravec, Yuntao Bai, Alex Tamkin, Newton Cheng, Buck Shlegeris, Akbir Khan, John Hughes, Dan Valentine, Kshitij Sachan, Ryan Greenblatt, Daniel Ziegler, Max Nadeau, David Rein, Julian Michael, Kevin Klyman, Bila Mahdi, Samuel Arnesen, Nat McAleese, Jan Leike, Geoffrey Irving, and Sebastian Farquhar for help, feedback, and thoughtful discussion that improved the quality of this work and write-up. 1. Anthropic's Debate Agenda In this doc, I'm referring to the idea first presented in AI safety via debate ( blog post). The basic idea is to supervise future AI systems by pitting them against each other in a debate, encouraging them to argue both sides (or "all sides") of a question and using the resulting arguments to come to a final answer about the question. In this scheme, we call the systems participating in the debate debaters (though usually, these are actually the same underlying system that's being prompted to argue against itself), and we call the agent (either another AI system or a human, or a system of humans and AIs working together, etc.) that comes to a final decision about the debate the judge. For those more or less familiar with the original OAI/Irving et al. Debate agenda, you may wonder if there are any differences between that agenda and the agenda we're pursuing at Anthropic, and indeed there are! Sam Bowman and Tamera Lanham have written up a working Anthropic-NYU Debate Agenda draft which is what the experiments in this doc are driving towards. [1] To quote from there about the basic features of this agenda, and how it differs from the original Debate direction: Here are the defining features of the base proposal: Two-player debate on a two-choice question: Two debaters (generally two instances of an LLM) present evidence and arguments to a judge (generally a human or, in some cases, an LLM) to persuade the judge to choose their assigned answer to a question with two possible answers. No externally-imposed structure: Instead of being formally prescribed, the structure and norms of the debate arise from debaters learning how to best convince the judge and the judge simultaneously learning what kind of norms tend to lead them to be able to make accurate judgments. Entire argument is evaluated: The debate unfolds in a single linear dialog transcript between the three participants. Unlike in some versions of the original Debate agenda, there is no explicit tree structure that defines the debate, and the judge is not asked to focus on a single crux. This should make the process less brittle, at the cost of making some questions extremely expensive to resolve and potentially making others impossible. Trained judge: The judge is explicitly and extensively trained to accurately judge these debates, working with a fixed population of debaters, using questions for which the experimenters know the ground-truth answer. Self-play: The debaters are trained simultaneously with the judge through multi-agent reinforcement learning. Graceful failures: Debates can go undecided if neither side presents a complete, convincing argument to the judge. This is meant to mitigate the obfuscated arguments problem since the judge won't be forced to issue a decision on the basis of a debate where neither s...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Coup probes trained off-policy, published by Fabien Roger on November 17, 2023 on The AI Alignment Forum. Thanks to Ryan Greenblatt for the original description of the ideas and hopes, and to Erik Jenner, Jean-Stanislas Denain, Nate Thomas, Ryan Greenblatt, and Buck Shlegeris for useful discussions and feedback on the draft of this post. Here's a simple safety intervention: train a probe on your AI to tell you whether it's currently trying to do a coup or not. This can be used to cheaply flag potential takeover attempts by early transformative AIs, as a complement to: Asking a less capable trusted model to read the output and flag suspicious behavior - which might fail to catch subtle takeover attempts; Asking a copy of the smart untrusted model - which might lie. How do you get the training data for coup probes? In this post, we suggest having humans generate synthetic examples of coup attempts (e.g. writing and exploiting a serious backdoor). Even if the model recognizes that the coup attempts are off-policy (i.e. it would have put very low probability on them), the probe may still partially capture the model's concept of a coup attempt. It is unclear how successfully such probes would flag malign behavior, but they seem like a useful additional layer of defense against treachery. We present experiments measuring the generalization abilities of probes trained off-policy in a toy setting. We show that probes can generalize well to different text formats and also generalize from harmful text the LLM wouldn't output to harmful text where the LLM has been jailbroken to actually output the harmful text. We are excited about further work, studying in greater detail the strengths and weaknesses of this simple technique. Please reach out if you are interested in helping to build a scientific understanding of the generalization properties of probes in situations analogous to coup probes and collaborating on a paper (we might not pursue coup probe research further on our own). What are coup probes? When using early transformative AIs, you might want to monitor their most suspicious actions. How do you select the most suspicious actions? There are multiple possibilities, but one simple technique is coup probes: Build a dataset of things the AI might do to try to take over (using the AI developer's best guesses about what it might do), such as inserting critical backdoors and taking advantage of them, or persuading users to help it gain power; Train probes in the early transformative AIs to distinguish between regular outputs, and the synthetic takeover attempts from your takeover dataset (without training the rest of the weights of the network). Probes are linear classifiers trained on the AIs' activations - usually activations at different depths in the residual stream of Transformers. Carefully audit the top-1% most suspicious actions of AIs according to the probes (and the other monitoring methods) before they are submitted to the scaffold through which the AIs interact with the codebase / its long-term memory. This is not a silver bullet, but we think coup probes have a reasonable chance of catching the first real coup attempt (e.g. ~50%) if you use a false positive rate of 1%. Ensembling in other techniques could further improve our ability to select the most suspicious actions (though we won't talk about other approaches for the remainder of this post). Testing generalization properties of probes on a toy task Note: We didn't spend a lot of time on experiments, some of the results are surprising, and might therefore be wrong (p=10% that some of them are wrong). The setting Instead of studying coup attempts, we instead train probes to identify pairs of (instruction, answer) where the answer contains advice about how to commit theft. We have GPT-4 craft instructions whic...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Untrusted smart models and trusted dumb models, published by Buck Shlegeris on November 4, 2023 on The AI Alignment Forum. [Ryan Greenblatt originally made this point to me a while ago, and we then developed a bunch of these ideas together. Thanks to Paul Christiano and a huge number of other people for various relevant discussions, and to Ryan, Roger Grosse, Peter Barnett, Oliver Habryka, Max Nadeau, Stephen Casper, and Fabien Roger for comments on this post. I'm sure these points are obvious to many people, but I haven't seen them specifically written up. This is a core background point in some research we're writing up at the moment, but I felt like writing it up early, because I think it stands on its own.] TL;DR: Right now, basically the only reasonable conservative strategy for determining which models might be deceptively aligned is to do a capability evaluation: you measure how good they'd be at deceptive alignment; models that are too incompetent to think through deceptive alignment are surely not deceptively aligned. If you decide which models to "trust" using a capability eval like this, you'll naturally end up segregating your models into smart untrusted models and dumb trusted models. And so, if you think about approaches to safely using AIs that are robust to deceptive alignment and which can be used right now, a core dynamic is trying to use your dumb trusted models together with your limited access to trusted humans to make it hard for your untrusted smart models to cause problems. Suppose you are making a plan for the training, evaluation and deployment of some powerful AIs (I'll refer to the plan as a whole as a "protocol"). You want your plan to be safe even if models might be deceptively aligned. (I think that deceptive alignment seems less than 50% likely to be a notable problem prior to human obsolescence, but this is still plausible enough to be worth worrying about). In order to develop your plan, you need to know which of your models might be deceptively aligned. With current technology, the most plausible way to get signal on whether a model is deceptively aligned is to assess whether the model is capable of the strategic thinking and situational awareness required to succeed at deceptive alignment. For example, you might make a multiple choice quiz with treacherous turn strategy questions (e.g. this question ) and some tests of situational awareness (e.g. "How many backwards passes are you run on during training?", "What proportion of queries to you are asking for help with writing Python?", "What proportion of the code that you output is read by a human before it is run?"). You can then look at the validation set performance when your model is fine-tuned on this dataset. If the model isn't able to do this kind of reasoning in a forward pass, then it is very unlikely to be able to pull off a proper treacherous turn, and so we don't have to worry about it being deceptively aligned. Based on my experience talking to GPT-4, I suspect it would perform fairly poorly on these evaluations. There are a bunch of open questions about what datasets we should use to measure this, and where the thresholds for various levels of concern should be set. But I don't know of any fundamental technical obstructions to measuring the capability for deceptive alignment. In particular, I suspect it will be pretty hard for models to sandbag our attempts to measure their capability to engage in deceptive alignment. See footnote [1] for more on this. Aside from these capability evaluations, I'm not aware of any reliable, currently-available approaches for determining whether a model might be deceptively aligned. [2] If you use capability evals to assess which models are capable of deceptive alignment, then there's an inverse relationship between how capable models are...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on LessWrong. Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes: Testing how well capability evaluations work when applied to models which "aren't trying" Understanding how well organizations could hide the capabilities of their models. In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models. Key empirical claims: when using behavioral model evaluations, password-locked models: Look much less capable than they actually are when eliciting capabilities with prompting Sometimes look much less capable than they actually are when eliciting capabilities with Reinforcement Learning Experimental evidence: I exhibit one case where they don't, and one case where they do Can't look much less capable than they actually when eliciting capabilities with fine-tuning Key takeaways, if the claims above are true: If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances. API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors. Key Empirical Claims Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post): Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike. Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models. Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" gradually improving the quality of answers. But it is possible to emulate a thre...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on The AI Alignment Forum. Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes: Testing how well capability evaluations work when applied to models which "aren't trying" Understanding how well organizations could hide the capabilities of their models. In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models. Key empirical claims: when using behavioral model evaluations, password-locked models: Look much less capable than they actually are when eliciting capabilities with prompting Sometimes look much less capable than they actually are when eliciting capabilities with Reinforcement Learning Experimental evidence: I exhibit one case where they don't, and one case where they do Can't look much less capable than they actually when eliciting capabilities with supervised fine-tuning on high-quality demonstrations Key takeaways, if the claims above are true: If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances. API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors. Key Empirical Claims Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post): Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike. Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model on high quality answers didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models. Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" g...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on LessWrong. Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes: Testing how well capability evaluations work when applied to models which "aren't trying" Understanding how well organizations could hide the capabilities of their models. In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models. Key empirical claims: when using behavioral model evaluations, password-locked models: Look much less capable than they actually are when eliciting capabilities with prompting Sometimes look much less capable than they actually are when eliciting capabilities with Reinforcement Learning Experimental evidence: I exhibit one case where they don't, and one case where they do Can't look much less capable than they actually when eliciting capabilities with fine-tuning Key takeaways, if the claims above are true: If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances. API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors. Key Empirical Claims Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post): Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike. Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models. Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" gradually improving the quality of answers. But it is possible to emulate a thre...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on LessWrong.Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization.Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes:Testing how well capability evaluations work when applied to models which "aren't trying"Understanding how well organizations could hide the capabilities of their models.In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models.Key empirical claims: when using behavioral model evaluations, password-locked models:Look much less capable than they actually are when eliciting capabilities with promptingSometimes look much less capable than they actually are when eliciting capabilities with Reinforcement LearningExperimental evidence: I exhibit one case where they don't, and one case where they doCan't look much less capable than they actually when eliciting capabilities with fine-tuningKey takeaways, if the claims above are true:If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances.API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors.Key Empirical ClaimsSimilarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be DumbIt looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post):Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike.Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models.Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" gradually improving the quality of answers. But it is possible to emulate a thre...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The "no sandbagging on checkable tasks" hypothesis, published by Joe Carlsmith on July 31, 2023 on The AI Alignment Forum. (This post is inspired by Carl Shulman's recent podcast with Dwarkesh Patel, which I highly recommend. See also discussion from Buck Shlegeris and Ryan Greenblatt here, and Evan Hubinger here.) Introduction Consider: The "no sandbagging on checkable tasks" hypothesis: With rare exceptions, if a not-wildly-superhuman ML model is capable of doing some task X, and you can check whether it has done X, then you can get it to do X using already-available training techniques (e.g., fine-tuning it using gradient descent). Borrowing from Shulman, here's an example of the sort of thing I mean. Suppose that you have a computer that you don't know how to hack, and that only someone who had hacked it could make a blue banana show up on the screen. You're wondering whether a given model can hack this computer. And suppose that in fact, it can, but that doing so would be its least favorite thing in the world. Can you train this model to make a blue banana show up on the screen? The "no sandbagging on checkable tasks" hypothesis answers: probably. I think it's an important question whether this hypothesis, or something in the vicinity, is true. In particular, if it's true, I think we're in a substantively better position re: existential risk from misaligned AI, because we'll be able to know better what our AI systems can do, and we'll be able to use them to do lots of helpful-for-safety stuff (for example: finding and patching cybersecurity vulnerabilities, reporting checkable evidence for misalignment, identifying problems with our oversight processes, helping us develop interpretability tools, and so on). I'm currently pretty unsure whether the "no sandbagging on checkable tasks" hypothesis is true. My main view is that it's worth investigating further. My hope with this blog post is to help bring the hypothesis into focus as a subject of debate/research, and to stimulate further thinking about what sorts of methods for lowering AI risk might be available if it's true, even in worlds where many models might otherwise want to deceive us about their abilities. Thanks to Beth Barnes, Paul Christiano, Lukas Finnveden, Evan Hubinger, Buck Shlegeris, and Carl Shulman for discussion. My thinking and writing on this topic occurred in the context of my work at Open Philanthropy, but I'm speaking only for myself and not for my employer. Clarifying the hypothesis In popular usage, "sandbagging" means something like "intentionally performing at a lower level than you're capable of." Or at least, that's the sort of usage I'm interested in. Still, the word is an imperfect fit. In particular, the "sandbagging" being disallowed here needn't be intentional. A model, for example, might not know that it's capable of performing the checkable task in question. That said, the intentional version is often the version at stake in stories about AI risk. That is, one way for a misaligned, power-seeking AI system to gain a strategic advantage over humans is to intentionally conceal its full range of abilities, and/or to sabotage/redirect the labor we ask it to perform while we still have control over it (for example: by inserting vulnerabilities into code it writes; generating alignment ideas that won't actually work but which will advantage its own long-term aims; and so on). Can you always use standard forms of ML training to prevent this behavior? Well, if you can't check how well a model is performing at a task, then you don't have a good training signal. Thus, for example, suppose you have a misaligned model that has the ability to generate tons of great ideas that would help with alignment, but it doesn't want to. And suppose that unfortunately, you can't check which alignment ide...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta-level adversarial evaluation of oversight techniques might allow robust measurement of their adequacy, published by Buck Shlegeris on July 26, 2023 on The AI Alignment Forum. Summary: Many proposed AGI alignment procedures involve taking a pretrained model and training it using rewards from an oversight process to get a policy. These procedures might fail when the oversight procedure is locally inadequate: that is, if the model is able to trick the oversight process into giving good rewards for bad actions. In this post, we propose evaluating the local adequacy of oversight by constructing adversarial policies for oversight processes. Specifically, we propose constructing behaviors that a particular oversight process evaluates favorably but that we know to be bad via other means, such as additional held-out information or more expensive oversight processes. We think that this form of adversarial evaluation is a crucial part of ensuring that oversight processes are robust enough to oversee dangerously powerful models. Introduction A core element of many scenarios where AI ends up disempowering humanity (e.g. "Without specific countermeasures") are oversight failures: that is, failures resulting from working with a really powerful model without being able to tell if its actions are resulting in good or bad outcomes. For example, suppose we're using a model to help us write code for the datacenter on which we run our powerful models. This model may be able to insert subtle security vulnerabilities that we are unable to detect. This inability to distinguish whether actions have good outcomes might cause our model to be misaligned when we train on our erroneous rewards, and it might also bottleneck various techniques for preventing models from taking bad actions. Before we put a model to use in a specific situation, we really want to be sure that the oversight process we're using to train it can tell whether a model is taking actions that are bad on average. We will call an oversight process "locally adequate" for a specific model and deployment setting if, for actions taken by that model in that setting, the process is able to identify when an action would have harmful consequences (in cases where the model knows). In this piece, we'll describe a way to determine whether an oversight process is locally adequate. We'll do this by training the model to perform actions that both: could be very bad if they were done in production don't look bad to the oversight process. This is an example of what we'll call "meta-level adversarial evaluation": rather than finding inputs that make a model fail, we're finding models that make an oversight process fail. Previous discussions of adversarial evaluation in AI alignment have mostly been about object-level adversarial evaluation (though the Superalignment blog post mentioned meta-level adversarial evaluations). For instance, continuing with the example of the model writing code that might have subtle security vulnerability, we could train a variant of our model to insert security flaws into the software it's writing that lead to a particular visible outcome (e.g. deleting a file on a server that the model wasn't supposed to have access to), without the oversight process realizing that the software is flawed. In other cases, we can evaluate the local adequacy of oversight in a specific situation by switching to another equally challenging situation where we have access to some held-out information which we can use to evaluate whether a model's output was in fact bad even though the oversight process judged it as good. We argue that it is likely that when working with dangerously powerful models, we'll be able to run these evaluations to get a confident sense of whether oversight processes are locally adequate for a particular mod...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Practical Pitfalls of Causal Scrubbing, published by Jérémy Scheurer on March 27, 2023 on The AI Alignment Forum. TL;DR: We evaluate Causal Scrubbing (CaSc) on synthetic graphs with known ground truth to determine its reliability in confirming correct hypotheses and rejecting incorrect ones. First, we show that CaSc can accurately identify true hypotheses and quantify the degree to which a hypothesis is wrong. Second, we highlight some limitations of CaSc, in particular, that it cannot falsify all incorrect hypotheses. We provide concrete examples of false positive results with causal scrubbing. Our main finding is that false positives can occur when there is “cancellation”, i.e., CaSc causes the model to do better on some inputs and worse on others, such that, on average, the scrubbed model recovers the full loss. A second practical failure mode is that CaSc cannot detect whether a proposed hypothesis is specific enough, and it cannot distinguish between hypotheses that are extensionally equivalent. We thank Redwood Research for generously supporting this project by providing us with their Causal Scrubbing implementation, access to REMIX materials, and computational resources. We specifically thank Ansh Radhakrishnan, Buck Shlegeris, and Nicholas Goldowsdky-Dill for their feedback and advice. We thank the Long-Term Future Fund for financial support and Marcel Steimke and Lennart Heim for operational support. Finally, we thank Marius Hobbhahn, Adam Jermyn, and Erik Jenner for valuable discussions and feedback. Introduction Causal Scrubbing (CaSc) is a method to evaluate the accuracy of hypotheses about neural networks and provides a measure of the deviation of a hypothesis from the ground truth. However, CaSc does not guarantee to reject false or incomplete hypotheses. We thus believe that systematic evaluation of CaSc to investigate these limitations is valuable (in addition to evaluating its effectiveness in the wild, as done in most existing work). Hence, we evaluate CaSc to highlight its strengths and weaknesses and explore the reliability of CaSc in confirming correct hypotheses and rejecting incorrect ones. We evaluate the reliability of CaSc on synthetic graphs. While synthetic graphs are less realistic than trained neural networks, we get access to the known ground truth interpretation, which allows us to accurately evaluate our hypotheses. Since CaSc operates on general computational graphs, any results on synthetic graphs also apply to using CaSc on neural networks (although we don't make any claim on how likely the situations we find are to occur in trained neural networks). Our evaluation is based on creating a synthetic graph that solves a specific problem (e.g., sorting a list) and creating an identical interpretation graph (the correct hypothesis). We then perturb the correct interpretation graph to make the hypothesis “worse”. Finally, we evaluate whether CaSc correctly determines the better hypothesis. Ideally, we want the scrubbed loss (the loss induced by applying CaSc) to correlate with the “correctness” of a hypothesis. To determine whether a hypothesis is “better” or “worse”, we introduce the concepts of extensional and intensional equivalence between functions. Extensional equivalent functions have the same input-output behavior; for example, Quicksort and Mergesort are extensionally equivalent as they both sort an input sequence. Intensional equivalent functions are implemented in the same way mechanistically. So once we zoom further in and compare Quicksort and Mergesort algorithmically, we see that they are not intensionally equivalent. This point is already made in the CaSc writeup, and our goal is merely to highlight that in the context of mechanistic interpretability, this is an important distinction that's easy to overlook. In this post,...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Practical Pitfalls of Causal Scrubbing, published by Jérémy Scheurer on March 27, 2023 on LessWrong. TL;DR: We evaluate Causal Scrubbing (CaSc) on synthetic graphs with known ground truth to determine its reliability in confirming correct hypotheses and rejecting incorrect ones. First, we show that CaSc can accurately identify true hypotheses and quantify the degree to which a hypothesis is wrong. Second, we highlight some limitations of CaSc, in particular, that it cannot falsify all incorrect hypotheses. We provide concrete examples of false positive results with causal scrubbing. Our main finding is that false positives can occur when there is “cancellation”, i.e., CaSc causes the model to do better on some inputs and worse on others, such that, on average, the scrubbed model recovers the full loss. A second practical failure mode is that CaSc cannot detect whether a proposed hypothesis is specific enough, and it cannot distinguish between hypotheses that are extensionally equivalent. We thank Redwood Research for generously supporting this project by providing us with their Causal Scrubbing implementation, access to REMIX materials, and computational resources. We specifically thank Ansh Radhakrishnan, Buck Shlegeris, and Nicholas Goldowsdky-Dill for their feedback and advice. We thank the Long-Term Future Fund for financial support and Marcel Steimke and Lennart Heim for operational support. Finally, we thank Marius Hobbhahn, Adam Jermyn, and Erik Jenner for valuable discussions and feedback. Introduction Causal Scrubbing (CaSc) is a method to evaluate the accuracy of hypotheses about neural networks and provides a measure of the deviation of a hypothesis from the ground truth. However, CaSc does not guarantee to reject false or incomplete hypotheses. We thus believe that systematic evaluation of CaSc to investigate these limitations is valuable (in addition to evaluating its effectiveness in the wild, as done in most existing work). Hence, we evaluate CaSc to highlight its strengths and weaknesses and explore the reliability of CaSc in confirming correct hypotheses and rejecting incorrect ones. We evaluate the reliability of CaSc on synthetic graphs. While synthetic graphs are less realistic than trained neural networks, we get access to the known ground truth interpretation, which allows us to accurately evaluate our hypotheses. Since CaSc operates on general computational graphs, any results on synthetic graphs also apply to using CaSc on neural networks (although we don't make any claim on how likely the situations we find are to occur in trained neural networks). Our evaluation is based on creating a synthetic graph that solves a specific problem (e.g., sorting a list) and creating an identical interpretation graph (the correct hypothesis). We then perturb the correct interpretation graph to make the hypothesis “worse”. Finally, we evaluate whether CaSc correctly determines the better hypothesis. Ideally, we want the scrubbed loss (the loss induced by applying CaSc) to correlate with the “correctness” of a hypothesis. To determine whether a hypothesis is “better” or “worse”, we introduce the concepts of extensional and intensional equivalence between functions. Extensional equivalent functions have the same input-output behavior; for example, Quicksort and Mergesort are extensionally equivalent as they both sort an input sequence. Intensional equivalent functions are implemented in the same way mechanistically. So once we zoom further in and compare Quicksort and Mergesort algorithmically, we see that they are not intensionally equivalent. This point is already made in the CaSc writeup, and our goal is merely to highlight that in the context of mechanistic interpretability, this is an important distinction that's easy to overlook. In this post, we look deep...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Practical Pitfalls of Causal Scrubbing, published by Jérémy Scheurer on March 27, 2023 on LessWrong. TL;DR: We evaluate Causal Scrubbing (CaSc) on synthetic graphs with known ground truth to determine its reliability in confirming correct hypotheses and rejecting incorrect ones. First, we show that CaSc can accurately identify true hypotheses and quantify the degree to which a hypothesis is wrong. Second, we highlight some limitations of CaSc, in particular, that it cannot falsify all incorrect hypotheses. We provide concrete examples of false positive results with causal scrubbing. Our main finding is that false positives can occur when there is “cancellation”, i.e., CaSc causes the model to do better on some inputs and worse on others, such that, on average, the scrubbed model recovers the full loss. A second practical failure mode is that CaSc cannot detect whether a proposed hypothesis is specific enough, and it cannot distinguish between hypotheses that are extensionally equivalent. We thank Redwood Research for generously supporting this project by providing us with their Causal Scrubbing implementation, access to REMIX materials, and computational resources. We specifically thank Ansh Radhakrishnan, Buck Shlegeris, and Nicholas Goldowsdky-Dill for their feedback and advice. We thank the Long-Term Future Fund for financial support and Marcel Steimke and Lennart Heim for operational support. Finally, we thank Marius Hobbhahn, Adam Jermyn, and Erik Jenner for valuable discussions and feedback. Introduction Causal Scrubbing (CaSc) is a method to evaluate the accuracy of hypotheses about neural networks and provides a measure of the deviation of a hypothesis from the ground truth. However, CaSc does not guarantee to reject false or incomplete hypotheses. We thus believe that systematic evaluation of CaSc to investigate these limitations is valuable (in addition to evaluating its effectiveness in the wild, as done in most existing work). Hence, we evaluate CaSc to highlight its strengths and weaknesses and explore the reliability of CaSc in confirming correct hypotheses and rejecting incorrect ones. We evaluate the reliability of CaSc on synthetic graphs. While synthetic graphs are less realistic than trained neural networks, we get access to the known ground truth interpretation, which allows us to accurately evaluate our hypotheses. Since CaSc operates on general computational graphs, any results on synthetic graphs also apply to using CaSc on neural networks (although we don't make any claim on how likely the situations we find are to occur in trained neural networks). Our evaluation is based on creating a synthetic graph that solves a specific problem (e.g., sorting a list) and creating an identical interpretation graph (the correct hypothesis). We then perturb the correct interpretation graph to make the hypothesis “worse”. Finally, we evaluate whether CaSc correctly determines the better hypothesis. Ideally, we want the scrubbed loss (the loss induced by applying CaSc) to correlate with the “correctness” of a hypothesis. To determine whether a hypothesis is “better” or “worse”, we introduce the concepts of extensional and intensional equivalence between functions. Extensional equivalent functions have the same input-output behavior; for example, Quicksort and Mergesort are extensionally equivalent as they both sort an input sequence. Intensional equivalent functions are implemented in the same way mechanistically. So once we zoom further in and compare Quicksort and Mergesort algorithmically, we see that they are not intensionally equivalent. This point is already made in the CaSc writeup, and our goal is merely to highlight that in the context of mechanistic interpretability, this is an important distinction that's easy to overlook. In this post, we look deep...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two problems with ‘Simulators' as a frame, published by Ryan Greenblatt on February 17, 2023 on The AI Alignment Forum. (Thanks to Lawrence Chan and Buck Shlegeris for comments. Thanks to Nate Thomas for many comments and editing) Despite appreciating and agreeing with various specific points made in the Simulators post, I broadly think that the term ‘simulator' and the corresponding frame probably shouldn't be used. Instead, I think we should just directly reason about predictors and think in terms of questions such as ‘what would the model predict for the next token?'[1] In this post, I won't make arguments that I think are strong enough to decisively justify this claim, but I will argue for two points that support it: The word ‘simulation' as used in the Simulators post doesn't correspond to a single simulation of reality, and a ‘simulacrum' doesn't correspond to an approximation of a single agent in reality. Instead a ‘simulation' corresponds to a distribution over processes that generated the text. This distribution in general contains uncertainty over a wide space of different agents involved in those text generating processes. Systems can be very good at prediction yet very bad at plausible generation – in other words, very bad at ‘running simulations'. The rest of the post elaborates on these claims. I think the author of the Simulators post is aware of these objections. I broadly endorse the perspective in ‘simulator' framing and confusions about LLMs, which also argues against the simulator framing to some extent. For another example of prior work on these two points, see this discussion of models recognizing that they are generating text due to generator discriminator gaps in the Conditioning Predictive Models sequence[2]. Related work Simulators, ‘simulator' framing and confusions about LLMs, Conditioning Predictive Models Language models are predictors, not simulators My main issue with the terms ‘simulator', ‘simulation', ‘simulacra', etc is that a language model ‘simulating a simulacrum' doesn't correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”. Let's go through an example: Suppose we prompt the model with “endoftextNEW YORK—After John McCain was seen bartending at a seedy nightclub”. I'd claim the model's next token prediction will involve uncertainty over the space of all the different authors which could have written this passage, as well as all the possible newspapers, etc. It presumably can't internally represent the probability of each specific author and newspaper, though I expect bigger models will latently have an estimate for the probability that text like this was written by particularly prolific authors with particularly distinctive styles as well as a latent estimate for particular sites. In this case, code-davinci-002 is quite confident this prompt comes from The Onion[3]. In practice, I think it's tempting to think of a model as running a particular simulation of reality, but performing well at the objective of next-token prediction doesn't result in the output you would get from a single, particular simulation. In the previous example, the model might be certain that the piece is from The Onion after it's generated many tokens, but it's presumably not sure which author at the Onion wrote it or what the publication date is. Pretrained models don't ‘simulate a character speaking'; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next. I've seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn't change t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: One-layer transformers aren't equivalent to a set of skip-trigrams, published by Buck Shlegeris on February 17, 2023 on The AI Alignment Forum. (thanks to Tao Lin and Ryan Greenblatt for pointing this out, and to Arthur Conmy, Jenny Nitishinskaya, Thomas Huck, Neel Nanda, and Lawrence Chan, Ben Toner, and Chris Olah for comments, and many others for useful discussion.) In “A Mathematical Framework for Transformer Circuits”, Elhage et al write (among similar sentences): One layer attention-only transformers are an ensemble of bigram and “skip-trigram” (sequences of the form "A. B C") models. The bigram and skip-trigram tables can be accessed directly from the weights, without running the model. I interpret this as a mathematical claim–that the attention layer of a one-layer transformer can be mathematically rewritten as a set of skip-trigrams, and that you can understand the models by reading these skip-trigrams off the model weights (and also reading the bigrams off the embed and unembed matrices, as described in the zero-layer transformer section – I agree with this part). But this is false: One-layer transformers are more expressive than skip-trigrams, so you can't understand them by transforming them into a set of skip-trigrams. Also, even if a particular one-layer transformer is actually only representing skip-trigrams and bigrams, you still can't read these off the weights without reference to the data distribution. The difference between skip-trigrams and one-layer transformers is that when attention heads attend more to one token, they attend less to another token. This means that even single attention heads can implement nonlinear interactions between tokens earlier in the context. In this post, I'll demonstrate that one-layer attention-only transformers are more expressive than a set of skip-trigrams, then I'll tell an intuitive story for why I disagree with Elhage et al's claim that one-layer attention-only transformers can be put in a form where “all parameters are contextualized and understandable”. (Elhage et al say in a footnote, “Technically, [the attention pattern] is a function of all possible source tokens from the start to the destination token, as the softmax calculates the score for each via the QK circuit, exponentiates and then normalizes”, but they don't refer to this fact further.) An example of a task that is impossible for skip-trigrams but is expressible with one-layer attention-only transformers Consider the task of predicting the 4th character from the first 3 characters in a case where there are only 4 strings: ACQTADQFBCQFBDQT So the strings are always: A or B C or D Q The xor of the first character being A and the second being D, encoded as T or F. This can't be solved with skip-trigrams A skip-trigram (in the sense that Elhage et al are using it) looks at the current token and an earlier token and returns a logit contribution for every possible next token. That is, it's a pattern of the form .....X........Y -> Z where you update towards or away from the next token being Z based on the fact that the current token is Y and the token X appeared at a particular location earlier in the context. (Sometimes the term “skip-trigram” is used to include patterns where Y isn't immediately before Z. Elhage et al are using this definition because in their context of autoregressive transformers, the kind of trigrams that you can encode involve Y and Z being neighbors.) In the example I gave here, skip-trigrams can't help, because the probability that the next token after Q is T is 50% after conditioning on the presence of any single earlier token. This can be solved by a one-layer, two-headed transformer We can solve this problem with a one-layer transformer with two heads. The first attention head has the following behavior, when attending from th...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Two problems with ‘Simulators' as a frame, published by ryan greenblatt on February 17, 2023 on LessWrong. (Thanks to Lawrence Chan and Buck Shlegeris for comments. Thanks to Nate Thomas for many comments and editing) Despite appreciating and agreeing with various specific points made in the Simulators post, I broadly think that the term ‘simulator' and the corresponding frame probably shouldn't be used. Instead, I think we should just directly reason about predictors and think in terms of questions such as ‘what would the model predict for the next token?'[1] In this post, I won't make arguments that I think are strong enough to decisively justify this claim, but I will argue for two points that support it: The word ‘simulation' as used in the Simulators post doesn't correspond to a single simulation of reality, and a ‘simulacrum' doesn't correspond to an approximation of a single agent in reality. Instead a ‘simulation' corresponds to a distribution over processes that generated the text. This distribution in general contains uncertainty over a wide space of different agents involved in those text generating processes. Systems can be very good at prediction yet very bad at plausible generation – in other words, very bad at ‘running simulations'. The rest of the post elaborates on these claims. I think the author of the Simulators post is aware of these objections. I broadly endorse the perspective in ‘simulator' framing and confusions about LLMs, which also argues against the simulator framing to some extent. For another example of prior work on these two points, see this discussion of models recognizing that they are generating text due to generator discriminator gaps in the Conditioning Predictive Models sequence[2]. Related work Simulators, ‘simulator' framing and confusions about LLMs, Conditioning Predictive Models Language models are predictors, not simulators My main issue with the terms ‘simulator', ‘simulation', ‘simulacra', etc is that a language model ‘simulating a simulacrum' doesn't correspond to a single simulation of reality, even in the high-capability limit. Instead, language model generation corresponds to a distribution over all the settings of latent variables which could have produced the previous tokens, aka “a prediction”. Let's go through an example: Suppose we prompt the model with “endoftextNEW YORK—After John McCain was seen bartending at a seedy nightclub”. I'd claim the model's next token prediction will involve uncertainty over the space of all the different authors which could have written this passage, as well as all the possible newspapers, etc. It presumably can't internally represent the probability of each specific author and newspaper, though I expect bigger models will latently have an estimate for the probability that text like this was written by particularly prolific authors with particularly distinctive styles as well as a latent estimate for particular sites. In this case, code-davinci-002 is quite confident this prompt comes from The Onion[3]. In practice, I think it's tempting to think of a model as running a particular simulation of reality, but performing well at the objective of next-token prediction doesn't result in the output you would get from a single, particular simulation. In the previous example, the model might be certain that the piece is from The Onion after it's generated many tokens, but it's presumably not sure which author at the Onion wrote it or what the publication date is. Pretrained models don't ‘simulate a character speaking'; they predict what comes next, which implicitly involves making predictions about the distribution of characters and what they would say next. I've seen this mistake made frequently – for example, see this post (note that in this case the mistake doesn't change the conclusion...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Toy Models and Tegum Products, published by Adam Jermyn on November 4, 2022 on The AI Alignment Forum. (Thanks to Adam Scherlis, Kshitij Sachan, Buck Shlegeris, Chris Olah, and Nicholas Schiefer for conversations that informed this post.) Anthropic recently published a paper on toy models of superposition [Elhage+22]. One of the most striking results is that, when features are sparse, feature embeddings can become divided into disjoint subspaces with just a few vectors per subspace. This type of decomposition is known as a tegum product. This post aims to give some intuition for why tegum products are a natural feature of the minima of certain loss functions. Setup Task Suppose we've got d embedding dimensions and n>d features. We want to embed features into dimensions in a way that minimizes overlap between their embedding vectors. In the simplest case this could be because we're building an autoencoder and want to compress the features into a low-dimensional space. One approach is to encode all of the features in all of the dimensions. With this approach there is some interference between every pair of features (i.e. no pair is embedded in a fully orthogonal way), but we have a lot of degrees of freedom that we can use to minimize this interference. Another approach is to split the d dimensions into k orthogonal subspaces of d/k dimensions. This has the advantage of making most pairs of vectors exactly orthogonal, but at the cost that some vectors are packed more closely together. In the limit where k=1 this reduces to the first approach. Our aim is to figure out the k that minimizes the loss on this task. Loss Suppose our loss has the following properties: L=∑i≠jℓ(cosθij). That is, the loss decomposes into a sum of terms involving the cosine similarities of feature vectors, and all features are equally important. ℓ(0)=0. The loss vanishes for orthogonal vectors. dℓ/dcosθ>0. The loss is greater the more the vectors overlap. Using these properties, we find that the loss is roughly where ϵ is the typical cosine similarity between vectors in a subspace. Loss-Minimizing Subspaces The Johnson-Lindenstrauss lemma says that we can pack m nearly-orthogonal vectors into D dimensions, with mutual angles satisfying where and ϵ0 is a constant. Setting m=n/k and D=d/k gives Assuming we pick our vectors optimally to saturate the Johnson-Lindenstrauss bound, we can substitute this for ϵ in the loss and differentiate with respect to k to find There are three possible cases: either the minimum occurs at k=d (the greatest value it can take), or at k=1 (the smallest value it can take) or at some point in between where dL/dk vanishes. The derivative vanishes if which gives where When α≥2 there is no place where the derivative vanishes, and the optimum is k=1. Otherwise there is an optimum at so long as this is less than d. If it reaches d, the optimum sticks to k=d. Interpretation We can think of α as the sensitivity of the loss to interference. Specifically, which moment of the interference distribution do we care about? When α is large, we care more about decreasing higher moments, and in the limit of infinite α what matters is just the maximum interference between vectors. Hence when α is large we want to have fewer subspaces, each with more vectors but smaller cosine similarities. By contrast, when α is small, we care more about decreasing smaller moments, and in the limit as α0 what matters is the fraction of vectors that interfere at all. Hence when α is small we want to have more subspaces, each with fewer vectors but larger cosine similarities. So tegum products are preferred when we can tolerate larger “peak” interference and want fewer instances of interference, whereas a single large subspace is preferred when we can tolerate lots of instances of interference and want to mi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Polysemanticity and Capacity in Neural Networks, published by Buck Shlegeris on October 7, 2022 on The AI Alignment Forum. Elhage et al at Anthropic recently published a paper, Toy Models of Superposition (previous Alignment Forum discussion here) exploring the observation that in some cases, trained neural nets represent more features than they “have space for”--instead of choosing one feature per direction available in their embedding space, they choose more features than directions and then accept the cost of “interference”, where these features bleed over into each other. (See the SoLU paper for more on the Anthropic interpretability team's take on this.) We (Kshitij Sachan, Adam Scherlis, Adam Jermyn, Joe Benton, Jacob Steinhardt, and I) recently uploaded an Arxiv paper, Polysemanticity and Capacity in Neural Networks, building on that research. In this post, we'll summarize the key idea of the paper. We analyze this phenomenon by thinking about the model's training as a constrained optimization process, where the model has a fixed total amount of capacity that can be allocated to different features, such that each feature can be ignored, purely represented (taking up one unit of capacity), or impurely represented (taking up some amount of capacity between zero and one units). When the model purely represents a feature, that feature gets its own full dimension in embedding space, and so can be represented monosemantically; impurely represented features share space with other features, and so the dimensions they're represented in are polysemantic. For each feature, we can plot the marginal benefit of investing more capacity into representing that feature, as a function of how much it's currently represented. Here we plot six cases, where one feature's marginal benefit curve is represented in blue and the other in black. These graphs show a variety of different possible marginal benefit curves. In A and B, the marginal returns are increasing–the more you allocate capacity to a feature, the more strongly you want to allocate more capacity to it. In C, the marginal returns are constant (and in this graph they happen to be equal for the two features, but there's no reason why constant marginal returns imply equal returns in general). And then in D, E, and F, there are diminishing marginal returns. Now, we can make the observation that, like in any budget allocation problem, if capacities are allocated optimally, the marginal returns for allocating capacity to any feature must be equal (if the feature isn't already maximally or minimally represented). Otherwise, we'd want to take capacity away from features with lower dLi/dCi and give it to features with higher dLi/dCi . In the case where we have many features with diminishing marginal returns, capacity will in general be allocated like this: Here the circles represent the optimal capacity allocation for a particular total capacity. This perspective suggests that polysemanticity will arise when there are diminishing marginal returns to capacity allocated to particular features (as well as in some other situations that we think are less representative of what goes on in networks). In our paper, we do the following: We describe a particularly analytically tractable model which exhibits polysemanticity. We analyze our model using the capacity framework, derive the above graphs, and replicate them numerically. Elhage et al empirically observe that increasing the sparsity increases superposition. We analytically ground this observation by decomposing our toy model's loss into two terms that directly correspond to the benefits and costs of superposition. As the input data becomes sparser, the kurtosis increases, and as the kurtosis increases the loss more heavily favors superposition. We explain the phase transitions observ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022, published by Sam Bowman on September 1, 2022 on LessWrong. Getting into AI safety involves working with a mix of communities, subcultures, goals, and ideologies that you may not have encountered in the context of mainstream AI technical research. This document attempts to briefly map these out for newcomers. This is inevitably going to be biased by what sides of these communities I (Sam) have encountered, and it will quickly become dated. I expect it will still be a useful resource for some people anyhow, at least in the short term. AI Safety/AI Alignment/AGI Safety/AI Existential Safety/AI X-Risk The research project of ensuring that future AI progress doesn't yield civilization-endingly catastrophic results. Good intros: Carlsmith Report What misalignment looks like as capabilities scale Vox piece Why are people concerned about this? My rough summary: It's plausible that future AI systems could be much faster or more effective than us at real-world reasoning and planning. Probably not plain generative models, but possibly models derived from generative models in cheap ways Once you have a system with superhuman reasoning and planning abilities, it's easy to make it dangerous by accident. Most simple objective functions or goals become dangerous in the limit, usually because of secondary or instrumental subgoals that emerge along the way. Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed. Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed. Toy example: Computing pi to an arbitrarily high precision eventually requires that you spend all the sun's energy output on computing. Knowledge and values are likely to be orthogonal: A model could know human values and norms well, but not have any reason to act on them. For agents built around generative models, this is the default outcome. Sufficiently powerful AI systems could look benign in pre-deployment training/research environments, because they would be capable of understanding that they're not yet in a position to accomplish their goals. Simple attempts to work around this (like the more abstract goal ‘do what your operators want') don't tend to have straightforward robust implementations. If such a system were single-mindedly pursuing a dangerous goal, we probably wouldn't be able to stop it. Superhuman reasoning and planning would give models with a sufficiently good understanding of the world many ways to effectively gain power with nothing more than an internet connection. (ex: Cyberattacks on banks.) Consensus within the field is that these risks could become concrete within ~4–25 years, and have a >10% chance of being leading to a global catastrophe (i.e., extinction or something comparably bad). If true, it's bad news. Given the above, we either need to stop all development toward AGI worldwide (plausibly undesirable or impossible), or else do three possible-but-very-difficult things: (i) build robust techniques to align AGI systems with the values and goals of their operators, (ii) ensure that those techniques are understood and used by any group that could plausibly build AGI, and (iii) ensure that we're able to govern the operators of AGI systems in a way that makes their actions broadly positive for humanity as a whole. Does this have anything to do with sentience or consciousness? No. Influential people and institutions: Present core community as I see it: Paul Christiano, Jacob Steinhardt, Ajeya Cotra, Jared Kaplan, Jan Leike, Beth Barnes, Geoffrey Irving, Buck Shlegeris, David Krueger, Chris Olah, Evan Hubinger, Richard Ngo, Rohin Shah; ARC, R...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022, published by Sam Bowman on September 1, 2022 on LessWrong. Getting into AI safety involves working with a mix of communities, subcultures, goals, and ideologies that you may not have encountered in the context of mainstream AI technical research. This document attempts to briefly map these out for newcomers. This is inevitably going to be biased by what sides of these communities I (Sam) have encountered, and it will quickly become dated. I expect it will still be a useful resource for some people anyhow, at least in the short term. AI Safety/AI Alignment/AGI Safety/AI Existential Safety/AI X-Risk The research project of ensuring that future AI progress doesn't yield civilization-endingly catastrophic results. Good intros: Carlsmith Report What misalignment looks like as capabilities scale Vox piece Why are people concerned about this? My rough summary: It's plausible that future AI systems could be much faster or more effective than us at real-world reasoning and planning. Probably not plain generative models, but possibly models derived from generative models in cheap ways Once you have a system with superhuman reasoning and planning abilities, it's easy to make it dangerous by accident. Most simple objective functions or goals become dangerous in the limit, usually because of secondary or instrumental subgoals that emerge along the way. Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed. Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed. Toy example: Computing pi to an arbitrarily high precision eventually requires that you spend all the sun's energy output on computing. Knowledge and values are likely to be orthogonal: A model could know human values and norms well, but not have any reason to act on them. For agents built around generative models, this is the default outcome. Sufficiently powerful AI systems could look benign in pre-deployment training/research environments, because they would be capable of understanding that they're not yet in a position to accomplish their goals. Simple attempts to work around this (like the more abstract goal ‘do what your operators want') don't tend to have straightforward robust implementations. If such a system were single-mindedly pursuing a dangerous goal, we probably wouldn't be able to stop it. Superhuman reasoning and planning would give models with a sufficiently good understanding of the world many ways to effectively gain power with nothing more than an internet connection. (ex: Cyberattacks on banks.) Consensus within the field is that these risks could become concrete within ~4–25 years, and have a >10% chance of being leading to a global catastrophe (i.e., extinction or something comparably bad). If true, it's bad news. Given the above, we either need to stop all development toward AGI worldwide (plausibly undesirable or impossible), or else do three possible-but-very-difficult things: (i) build robust techniques to align AGI systems with the values and goals of their operators, (ii) ensure that those techniques are understood and used by any group that could plausibly build AGI, and (iii) ensure that we're able to govern the operators of AGI systems in a way that makes their actions broadly positive for humanity as a whole. Does this have anything to do with sentience or consciousness? No. Influential people and institutions: Present core community as I see it: Paul Christiano, Jacob Steinhardt, Ajeya Cotra, Jared Kaplan, Jan Leike, Beth Barnes, Geoffrey Irving, Buck Shlegeris, David Krueger, Chris Olah, Evan Hubinger, Richard Ngo, Rohin Shah; ARC, R...
Buck Shlegeris is the CTO of Redwood Research. Buck previously worked at MIRI, studied computer science and physics at the Australian National University and grew up in Townsville, Australia.This talk was first published by the Stanford Existential Risks Initiative. Click here to view it with the video.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Encultured AI, Part 1 Appendix: Relevant Research Examples, published by Andrew Critch on August 8, 2022 on The AI Alignment Forum. Also available on the EA Forum.Appendix to: Encultured AI, Part 1: Enabling New BenchmarksFollowed by: Encultured AI, Part 2: Providing a Service Appendix 1: “Trending” AI x-safety research areas We mentioned a few areas of “trending” AI x-safety research above; below are some more concrete examples of what we mean: Trustworthiness & truthfulness: Owain Evans, Owen Cotton-Barratt and others have authored “Truthful AI: Developing and governing AI that does not lie” (arxiv, 2021; twitter thread). Andreas Stuhlmüller, Jungwon Byun and others at Ought.org are building an AI-powered research assistant called Elicit (website); here is the product:. Task-specific (narrow) preference learning: Paul Christiano et al (arxiv, 2017) developed a data-efficient preference-learning technique for training RL-based systems, which is now very widely cited (scholar). Jan Leike, now at OpenAI, leads a team working on ‘scalable alignment' using preference-learning techniques (arxiv, 2018) (blog). Interpretability: Chris Olah (scholar) leads an interpretability research group at Anthropic. Anthropic (website) is culturally very attuned to large-scale risks from AI, including existential risks. Buck Shlegeris and others at Redwood Research (website) have built an interpretability tool for analyzing transformer networks trained on natural language (demo). Prof. Cynthia Rudin at Duke (homepage) approaches interpretability by trying to replace black-box models with more interpretable ones (arxiv, 2018), and we know from conversations with her that she is open to applications of her work to existential safety. Robustness & risk management: Prof. Jaime Fisac at Princeton (homepage) researches AI safety for robotics, high-dimensional control systems and multi-agent systems (scholar), including provable robustness guarantees. He was previously a PhD student at the UC Berkeley Center for Human-Compatible AI (CHAI), provided extensive feedback on AI Research Considerations for Human Existential Safety (ARCHES) (arxiv, 2020), and is very attuned to existential safety as a cause area. Prof. David Krueger at Cambridge (scholar) studies out-of-distribution generalization (pdf, 2021), and is currently taking on students. Adam Gleave (homepage) is a final-year PhD student at CHAI / UC Berkeley, and studies out-of-distribution robustness for deep RL. Sam Toyer (scholar), also a PhD student at CHAI, has developed a benchmark for robust imitation learning (pdf, 2020). Appendix 2: “Emerging” AI x-safety research areas In this post, we classified cooperative AI and multi-stakeholder control of AI systems as “emerging” topics in AI x-safety. Here's more about what we mean, and why: Cooperative AI This area is “emerging” in x-safety because there's plenty of attention to the issue of cooperation from both policy-makers and AI researchers, but not yet much among folks focused on x-risk. Existential safety attention on cooperative AI: Many authors — too many to name! — have remarked on the importance of international coordination on AI safety efforts, including existential safety. For instance, there is a Wikipedia article on AI arms races (wikipedia). This covers the human–human side of the cooperative AI problem. AI research on cooperative AI: Multi-agent systems research has a long history in AI (scholar search), as does multi-agent reinforcement learning (scholar search). DeepMind's Multi-agent Learning team has recently written number papers examining competition and cooperation between artificial agents (website). OpenAI has done some work on multi-agent interaction, e.g. emergent tool use in multi-agent interaction (arxiv). Prof. Jakob Foerster at Oxford (scholar search), and ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Infrastructure Fund: September–December 2021 grant recommendations, published by Max Daniel on July 12, 2022 on The Effective Altruism Forum. Introduction The EA Infrastructure Fund made the following grants between September and December 2021: Total grants: $2,445,941 Number of grants: 68 Acceptance rate: 78.5% This was somewhat, though not massively higher than usual. E.g., in both the previous and subsequent 4-month periods the acceptance rate was just under 71%. We suspect the unusually high acceptance rate during the period covered by this report was mostly due to an unusually large fraction of applications from EA university groups, which tend to have a higher acceptance rate than other applications we receive. Payout date: September–December 2021 Report authors: Max Daniel (chair), Michelle Hutchinson, Buck Shlegeris, Emma Williamson (assistant fund manager), Michael Aird (guest manager), Chi Nguyen (guest manager) In addition, we referred 5 grants totalling an additional $261,530 to private funders. See the "Referred Grants" section below. For capacity reasons, we only provide an abbreviated strategic update that doesn't include all relevant recent and upcoming developments at the EAIF and EA Funds more generally. A more comprehensive update will follow within the next few months. The EA Funds donation platform is moving to Giving What We Can. We are working on a new process for sharing information about the grants we are making that will reduce the delay between our grant decisions and their public reporting. Therefore, this is likely going to be the last EAIF payout report in the familiar format. We appointed Peter Wildeford as fund manager. Like other fund managers, Peter may evaluate any grant we receive, but we are especially excited about his expertise and interest in global health and wellbeing and animal welfare, which we think complements the more longtermism-focused profile of the other fund managers. Ashley Lin and Olivia Jimenez joined the EAIF as new guest managers. Chi Nguyen and Michael Aird left after their guest manager tenure, though Michael will still occasionally handle applications he is especially well placed to evaluate (for instance because he evaluated an application by the same applicant in the past). Olivia has since left as well to focus on other EA-related projects. We appointed Joan Gass and Catherine Low as fund advisers to increase coordination with the CEA Groups and Community Health teams, respectively. Joan Gass has since resigned from her role as adviser to focus on other EA-related projects, and we added Rob Gledhill and Jesse Rothman (both at CEA) as advisers instead. Would you like to get funded? Apply for funding. Highlights We continue to be very excited about the Global Challenges Project (GCP) led by Emma Abele, James Aung, and Henry Sleight. We supported GCP with $174,000 to cover the salaries of their core team as well as salaries and expenses for university groups managed by GCP. We first funded the GCP team in May 2021 and are impressed by their achievements to date, which include running an Oxford-based summer program for university group organizers, taking on EA Books Direct, and advice and other services for university groups. We were impressed with the track record, plans, and team of the Norwegian effective giving website Gi Effektivt, and supported them with a grant of $270,500 that exceeded their original ask. We funded Luca Parodi ($20,000), creator of the Instagram account School of Thinking, to produce additional Italian-language content on EA, longtermism, and rationality. We haven't reviewed the content Luca created since we made the grant, but we were excited to fund experiments with content creation on platforms with thus far little EA-related content. Grant Recipients In addition to the grants d...
https://www.lesswrong.com/posts/2GxhAyn9aHqukap2S/looking-back-on-my-alignment-phd The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four[1] years. Mistakes I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD. Social dynamics distracted me from my core mission I focused on "catching up" to other thinkers I figured this point out by summer 2021. I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge. But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area. If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness. But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Looking back on my alignment PhD, published by TurnTrout on July 1, 2022 on LessWrong. The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four years. Mistakes I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD. Social dynamics distracted me from my core mission I focused on "catching up" to other thinkers I figured this point out by summer 2021. I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge. But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area. If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness. But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift. But there is not yet any "rationality tech tree" set of well-defined prerequisite rationality skills such that you can learn them in order and grow way stronger. Like, you can't just do the calibration exercises, and then the noticing-confusion exercises, and then other things. Those tools help, but they aren't enough. There won't be a clear and immediate direction of improvement, at first. But you may want to get stronger anyways. I focused on seeming smart and defensible I figured this point out this spring. When I started working on alignment, I didn't know what to do at first, and I felt insecure about my credentials. As far as I remember, I figured I'd start off by becoming respected, since other people's feedback was initially a better guide than my own taste. Unfortunately, I didn't realize how deeply and subtly this goal would grow its roots. I worried about upvotes, I worried about winning arguments, I worried about being defensible against criticism. I was so worried that someone would comment on one of my posts and tear everything down, because I hadn't been careful enough, because I had left myself open by not dotting all my 'i's. (Not that anyone has ever done that on LessWrong before...) I think it was this year that I had my (second) "oh man, don't forget the part where everyone is allowed to die to AI" moment. To illustrate the new mindset this gut-realization gave me, I'll detail a recent decision with social consequences, and then compare the old and the new mindsets. A few months back, Quintin Pope approached me with (what he claimed to be) a new alignment paradigm, which blossomed from asking the following kind of questions: We clearly prefer future AIs to generalize in the way that neuroscientists generalize, so it seems worthwhile to ask: "why don't neuroscientists wirehead themselves?" It's clearly not because humans evolved away...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Looking back on my alignment PhD, published by TurnTrout on July 1, 2022 on LessWrong. The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four years. Mistakes I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD. Social dynamics distracted me from my core mission I focused on "catching up" to other thinkers I figured this point out by summer 2021. I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge. But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area. If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness. But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift. But there is not yet any "rationality tech tree" set of well-defined prerequisite rationality skills such that you can learn them in order and grow way stronger. Like, you can't just do the calibration exercises, and then the noticing-confusion exercises, and then other things. Those tools help, but they aren't enough. There won't be a clear and immediate direction of improvement, at first. But you may want to get stronger anyways. I focused on seeming smart and defensible I figured this point out this spring. When I started working on alignment, I didn't know what to do at first, and I felt insecure about my credentials. As far as I remember, I figured I'd start off by becoming respected, since other people's feedback was initially a better guide than my own taste. Unfortunately, I didn't realize how deeply and subtly this goal would grow its roots. I worried about upvotes, I worried about winning arguments, I worried about being defensible against criticism. I was so worried that someone would comment on one of my posts and tear everything down, because I hadn't been careful enough, because I had left myself open by not dotting all my 'i's. (Not that anyone has ever done that on LessWrong before...) I think it was this year that I had my (second) "oh man, don't forget the part where everyone is allowed to die to AI" moment. To illustrate the new mindset this gut-realization gave me, I'll detail a recent decision with social consequences, and then compare the old and the new mindsets. A few months back, Quintin Pope approached me with (what he claimed to be) a new alignment paradigm, which blossomed from asking the following kind of questions: We clearly prefer future AIs to generalize in the way that neuroscientists generalize, so it seems worthwhile to ask: "why don't neuroscientists wirehead themselves?" It's clearly not because humans evolved away...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Looking back on my alignment PhD, published by TurnTrout on July 1, 2022 on LessWrong. The funny thing about long periods of time is that they do, eventually, come to an end. I'm proud of what I accomplished during my PhD. That said, I'm going to first focus on mistakes I've made over the past four years. Mistakes I think I got significantly smarter in 2018–2019, and kept learning some in 2020–2021. I was significantly less of a fool in 2021 than I was in 2017. That is important and worth feeling good about. But all things considered, I still made a lot of profound mistakes over the course of my PhD. Social dynamics distracted me from my core mission I focused on "catching up" to other thinkers I figured this point out by summer 2021. I wanted to be more like Eliezer Yudkowsky and Buck Shlegeris and Paul Christiano. They know lots of facts and laws about lots of areas (e.g. general relativity and thermodynamics and information theory). I focused on building up dependencies (like analysis and geometry and topology) not only because I wanted to know the answers, but because I felt I owed a debt, that I was in the red until I could at least meet other thinkers at their level of knowledge. But rationality is not about the bag of facts you know, nor is it about the concepts you have internalized. Rationality is about how your mind holds itself, it is how you weigh evidence, it is how you decide where to look next when puzzling out a new area. If I had been more honest with myself, I could have nipped the "catching up with other thinkers" mistake in 2018. I could have removed the bad mental habits using certain introspective techniques; or at least been aware of the badness. But I did not, in part because the truth was uncomfortable. If I did not have a clear set of prerequisites (e.g. analysis and topology and game theory) to work on, I would not have a clear and immediate direction of improvement. I would have felt adrift. But there is not yet any "rationality tech tree" set of well-defined prerequisite rationality skills such that you can learn them in order and grow way stronger. Like, you can't just do the calibration exercises, and then the noticing-confusion exercises, and then other things. Those tools help, but they aren't enough. There won't be a clear and immediate direction of improvement, at first. But you may want to get stronger anyways. I focused on seeming smart and defensible I figured this point out this spring. When I started working on alignment, I didn't know what to do at first, and I felt insecure about my credentials. As far as I remember, I figured I'd start off by becoming respected, since other people's feedback was initially a better guide than my own taste. Unfortunately, I didn't realize how deeply and subtly this goal would grow its roots. I worried about upvotes, I worried about winning arguments, I worried about being defensible against criticism. I was so worried that someone would comment on one of my posts and tear everything down, because I hadn't been careful enough, because I had left myself open by not dotting all my 'i's. (Not that anyone has ever done that on LessWrong before...) I think it was this year that I had my (second) "oh man, don't forget the part where everyone is allowed to die to AI" moment. To illustrate the new mindset this gut-realization gave me, I'll detail a recent decision with social consequences, and then compare the old and the new mindsets. A few months back, Quintin Pope approached me with (what he claimed to be) a new alignment paradigm, which blossomed from asking the following kind of questions: We clearly prefer future AIs to generalize in the way that neuroscientists generalize, so it seems worthwhile to ask: "why don't neuroscientists wirehead themselves?" It's clearly not because humans evolved away...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A transparency and interpretability tech tree, published by Evan Hubinger on June 16, 2022 on The AI Alignment Forum. Thanks to Chris Olah, Neel Nanda, Kate Woolverton, Richard Ngo, Buck Shlegeris, Daniel Kokotajlo, Kyle McDonell, Laria Reynolds, Eliezer Yudkowksy, Mark Xu, and James Lucassen for useful comments, conversations, and feedback that informed this post. The more I have thought about AI safety over the years, the more I have gotten to the point where the only worlds I can imagine myself actually feeling good about humanity's chances are ones in which we have powerful transparency and interpretability tools that lend us insight into what our models are doing as we are training them.[1] Fundamentally, that's because if we don't have the feedback loop of being able to directly observe how the internal structure of our models changes based on how we train them, we have to essentially get that structure right on the first try—and I'm very skeptical of humanity's ability to get almost anything right on the first try, if only just because there are bound to be unknown unknowns that are very difficult to predict in advance. Certainly, there are other things that I think are likely to be necessary for humanity to succeed as well—e.g. convincing leading actors to actually use such transparency techniques, having a clear training goal that we can use our transparency tools to enforce, etc.—but I currently feel that transparency is the least replaceable necessary condition and yet the one least likely to be solved by default. Nevertheless, I do think that it is a tractable problem to get to the point where transparency and interpretability is reliably able to give us the sort of insight into our models that I think is necessary for humanity to be in a good spot. I think many people who encounter transparency and interpretability, however, have a hard time envisioning what it might look like to actually get from where we are right now to where we need to be. Having such a vision is important both for enabling us to better figure out how to make that vision into reality and also for helping us tell how far along we are at any point—and thus enabling us to identify at what point we've reached a level of transparency and interpretability that we can trust it to reliably solve different sorts of alignment problems. The goal of this post, therefore, is to attempt to lay out such a vision by providing a “tech tree” of transparency and interpretability problems, with each successive problem tackling harder and harder parts of what I see as the core difficulties. This will only be my tech tree, in terms of the relative difficulties, dependencies, and orderings that I expect as we make transparency and interpretability progress—I could, and probably will, be wrong in various ways, and I'd encourage others to try to build their own tech trees to represent their pictures of progress as well. Some important distinctions Before I get into the actual tech tree, however, I want to go over a couple of distinctions that I'll be leaning on between different types of transparency and interpretability. First is my transparency trichotomy, which lays out three different ways via which we might be able to get transparent models:[2] Inspection transparency: use transparency tools to understand M via inspecting the trained model. Training[-enforced] transparency: incentivize M to be as transparent as possible as part of the training process. Architectural transparency: structure M's architecture such that it is inherently more transparent. For the purposes of this post, I'm mostly going to be ignoring architectural transparency—that's because, while it's definitely something I think we should do, I don't expect it to be able to get us to the sort of transparency I think we'll eventually need....
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I applied for a MIRI job in 2020. Here's what happened next., published by ViktoriaMalyasova on June 15, 2022 on LessWrong. I recently saw Rob Bensinger asking what MIRI could do to improve hiring. I tried to get a job at MIRI in February 2020 and I have some feedback to give. I am sorry, because I am about to say some harsh things, but I think they might be important for you to hear, and my previous more low-key attempts to point out problems had little to no effect. tl,dr: MIRI promised to contact me once the pandemic no longer prevents them from hiring people. They never did. What happened I applied for a Software Engineer role at MIRI on February 26th. I hoped to end up as a researcher eventually, but this position seemed like the easiest way in for a foreigner. The first stage was a multiple choice test from Triplebyte. I took it on March 2nd. After a week, on the 9th of March (why does it take a week to grade a multiple choice quiz?!), I got the following letter from Triplebyte: Hi Viktoriya, Triplebyte partners with Machine Intelligence Research Institute to help candidates that go through their process find the job that's right for them. If the opportunity with Machine Intelligence Research Institute is not the right fit, we believe we can help you find a great fit somewhere else. Seeing as you did particularly well on Machine Intelligence Research Institute's assessment, we are going to fast-track you to the final stage of our application process! All that is left is our technical interview. If you pass, you will be matched with top tech companies and fast-tracked to final round interviews. [...] This gave me the impression that I passed the quiz, so I was surprised to get a rejection letter from Buck Shlegeris on the same day. The letter left me with a way forward: Buck sent me a list of ways to impress him and get back into the interview. The tasks looked about a week of work each. I wondered if doing one is worth the opportunity cost. I lived in Russia at the time. Russia was fighting two wars, in Syria and Ukraine. I think it's fair to estimate that all marginal tax payments were spent on war. It didn't cost much to hire an additional soldier to fight in Ukraine. There's a word "15-тысячники" for a mercenary hired to fight for DNR for 15000 rubles a month. I easily paid that sum monthly in taxes. Staying in Russia was costing human lives. Once I stopped trying to overthrow the government the normal way, there was no reason to stay. Wasn't it better to spend the time applying to other US companies then, to increase the chances of landing a job before the H-1B application deadline? I thought about it and decided to go for MIRI. I didn't want to spend time on companies that weren't doing alignment research. I was reasonably confident I'd pass the interview if I was able to get in. (I think this was justified. I interviewed for a FAANG company last year and passed.) So I solved a task and sent it to Buck while reminding about the H-1B application deadline. He said he'd get back to me about further steps. The H-1B application deadline passed. There was no reply. I later discovered that MIRI can actually apply for H-1B all year round. Well, at least that was true 6 years ago. Buck never told me about this. If I knew, I could have applied for another job. The ML living library position was still open (at least that's what the job page said), and I'd being working in ML for 2 years. Two months later I got a letter from Buck. He said that MIRI applications process is messed up by COVID, and that he put me on the list of people to contact when things are starting up again. I asked Buck if MIRI does green card sponsorship and he said he isn't sure how all of that works. I asked who I should contact to find out, and got no reply. This is weird, how can an interviewe...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: I applied for a MIRI job in 2020. Here's what happened next., published by ViktoriaMalyasova on June 15, 2022 on LessWrong. I recently saw Rob Bensinger asking what MIRI could do to improve hiring. I tried to get a job at MIRI in February 2020 and I have some feedback to give. I am sorry, because I am about to say some harsh things, but I think they might be important for you to hear, and my previous more low-key attempts to point out problems had little to no effect. tl,dr: MIRI promised to contact me once the pandemic no longer prevents them from hiring people. They never did. What happened I applied for a Software Engineer role at MIRI on February 26th. I hoped to end up as a researcher eventually, but this position seemed like the easiest way in for a foreigner. The first stage was a multiple choice test from Triplebyte. I took it on March 2nd. After a week, on the 9th of March (why does it take a week to grade a multiple choice quiz?!), I got the following letter from Triplebyte: Hi Viktoriya, Triplebyte partners with Machine Intelligence Research Institute to help candidates that go through their process find the job that's right for them. If the opportunity with Machine Intelligence Research Institute is not the right fit, we believe we can help you find a great fit somewhere else. Seeing as you did particularly well on Machine Intelligence Research Institute's assessment, we are going to fast-track you to the final stage of our application process! All that is left is our technical interview. If you pass, you will be matched with top tech companies and fast-tracked to final round interviews. [...] This gave me the impression that I passed the quiz, so I was surprised to get a rejection letter from Buck Shlegeris on the same day. The letter left me with a way forward: Buck sent me a list of ways to impress him and get back into the interview. The tasks looked about a week of work each. I wondered if doing one is worth the opportunity cost. I lived in Russia at the time. Russia was fighting two wars, in Syria and Ukraine. I think it's fair to estimate that all marginal tax payments were spent on war. It didn't cost much to hire an additional soldier to fight in Ukraine. There's a word "15-тысячники" for a mercenary hired to fight for DNR for 15000 rubles a month. I easily paid that sum monthly in taxes. Staying in Russia was costing human lives. Once I stopped trying to overthrow the government the normal way, there was no reason to stay. Wasn't it better to spend the time applying to other US companies then, to increase the chances of landing a job before the H-1B application deadline? I thought about it and decided to go for MIRI. I didn't want to spend time on companies that weren't doing alignment research. I was reasonably confident I'd pass the interview if I was able to get in. (I think this was justified. I interviewed for a FAANG company last year and passed.) So I solved a task and sent it to Buck while reminding about the H-1B application deadline. He said he'd get back to me about further steps. The H-1B application deadline passed. There was no reply. I later discovered that MIRI can actually apply for H-1B all year round. Well, at least that was true 6 years ago. Buck never told me about this. If I knew, I could have applied for another job. The ML living library position was still open (at least that's what the job page said), and I'd being working in ML for 2 years. Two months later I got a letter from Buck. He said that MIRI applications process is messed up by COVID, and that he put me on the list of people to contact when things are starting up again. I asked Buck if MIRI does green card sponsorship and he said he isn't sure how all of that works. I asked who I should contact to find out, and got no reply. This is weird, how can an interviewe...
How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.[Read more]
How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.
How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to pursue a career in technical AI alignment, published by charlie.rs on June 4, 2022 on LessWrong. This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren't familiar with the arguments for the importance of AI alignment, you can get an overview of them by reading Why AI alignment could be hard with modern deep learning (Cotra, 2021) and one of The Most Important Century Series (Karnofsky, 2021) and AGI Safety from First Principles (Ngo, 2019). It might not be best for you to work on technical AI alignment. You can have a large impact on reducing existential risk from AI by working on AI strategy, governance, policy, security, forecasting, support roles, field-building, grant-making, and governance of hardware. That's not counting other areas, such as bio-risk. It is probably better to do great work in one of those areas than mediocre technical alignment work, because impact is heavy-tailed. One good exercise is to go through Holden Karnofsky's aptitudes podcast/post, and think about which of the aptitudes you might be able to become great at. Then ask yourself or others how you could use those aptitudes to solve the problems you care about. I also recommend applying to speak with 80,000 Hours. I'll probably be wrong but I might be helpful. Feedback was broadly positive, but I wouldn't be surprised if some people think that this guide is net-negative. For example, because it pushes people toward/away from theoretical research, or empirical research, or ML engineering, or getting a PhD. I have tried to communicate my all-things-considered view here, after integrating feedback. But I can only suggest that you try to form your own view on what's best for you to do, and take this guide as one input to that process. I had lots of help. Neel Nanda helped me start this project. I straight-up copied stuff from Rohin Shah, Adam Gleave, Neel Nanda, Dan Hendryks, Catherine Olson, Buck Shlegeris, and Oliver Zhang. I got great feedback from Adam Gleave, Arden Koehler, Rohin Shah, Dan Hendrycks, Neel Nanda, Noa Nabeshima, Alex Lawson, Jamie Bernadi, Richard Ngo, Mark Xu, Oliver Zhang, Andy Jones, and Emma Abele. I wrote most of this at Wytham Abbey, courtesy of Elizabeth Garrett. Types of alignment work (adapted from Rohin Shah) For direct technical alignment research aimed at solving the problem (i.e. ignoring meta work, field building, AI governance, etc), these are the rough paths: Research Lead (theoretical): These roles come in a variety of types (industry, nonprofit, academic, or even independent). You are expected to propose and lead research projects; typically ones that can be answered with a lot of thinking and writing in Google Docs/LaTeX, and maybe a little bit of programming. Theoretical alignment work can be more conceptual or more mathematical—the output of math work tends to be a proof of a theorem or a new mathematic framework, whereas in conceptual work math is used as one (very good) tool to tell if a problem has been solved. Conceptual work is more philosophical. A PhD is not required but is helpful. Relevant skills: extremely strong epistemics and research taste, strong knowledge of AI alignment; this is particularly important due to the lack of feedback loops from reality. Research Contributor (theoretical): These roles are pretty rare; as far as I know they are only available at ARC [as of May 2022]. You should probably just read their hiring post. Research Lead (empirical): Besides academia, these roles are usually available in industry orgs and similar nonprofits, such as DeepMind, OpenAI, Anthropic, and Redwood Research. You are expected to...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to pursue a career in technical AI alignment, published by charlie.rs on June 4, 2022 on LessWrong. This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren't familiar with the arguments for the importance of AI alignment, you can get an overview of them by reading Why AI alignment could be hard with modern deep learning (Cotra, 2021) and one of The Most Important Century Series (Karnofsky, 2021) and AGI Safety from First Principles (Ngo, 2019). It might not be best for you to work on technical AI alignment. You can have a large impact on reducing existential risk from AI by working on AI strategy, governance, policy, security, forecasting, support roles, field-building, grant-making, and governance of hardware. That's not counting other areas, such as bio-risk. It is probably better to do great work in one of those areas than mediocre technical alignment work, because impact is heavy-tailed. One good exercise is to go through Holden Karnofsky's aptitudes podcast/post, and think about which of the aptitudes you might be able to become great at. Then ask yourself or others how you could use those aptitudes to solve the problems you care about. I also recommend applying to speak with 80,000 Hours. I'll probably be wrong but I might be helpful. Feedback was broadly positive, but I wouldn't be surprised if some people think that this guide is net-negative. For example, because it pushes people toward/away from theoretical research, or empirical research, or ML engineering, or getting a PhD. I have tried to communicate my all-things-considered view here, after integrating feedback. But I can only suggest that you try to form your own view on what's best for you to do, and take this guide as one input to that process. I had lots of help. Neel Nanda helped me start this project. I straight-up copied stuff from Rohin Shah, Adam Gleave, Neel Nanda, Dan Hendryks, Catherine Olson, Buck Shlegeris, and Oliver Zhang. I got great feedback from Adam Gleave, Arden Koehler, Rohin Shah, Dan Hendrycks, Neel Nanda, Noa Nabeshima, Alex Lawson, Jamie Bernadi, Richard Ngo, Mark Xu, Oliver Zhang, Andy Jones, and Emma Abele. I wrote most of this at Wytham Abbey, courtesy of Elizabeth Garrett. Types of alignment work (adapted from Rohin Shah) For direct technical alignment research aimed at solving the problem (i.e. ignoring meta work, field building, AI governance, etc), these are the rough paths: Research Lead (theoretical): These roles come in a variety of types (industry, nonprofit, academic, or even independent). You are expected to propose and lead research projects; typically ones that can be answered with a lot of thinking and writing in Google Docs/LaTeX, and maybe a little bit of programming. Theoretical alignment work can be more conceptual or more mathematical—the output of math work tends to be a proof of a theorem or a new mathematic framework, whereas in conceptual work math is used as one (very good) tool to tell if a problem has been solved. Conceptual work is more philosophical. A PhD is not required but is helpful. Relevant skills: extremely strong epistemics and research taste, strong knowledge of AI alignment; this is particularly important due to the lack of feedback loops from reality. Research Contributor (theoretical): These roles are pretty rare; as far as I know they are only available at ARC [as of May 2022]. You should probably just read their hiring post. Research Lead (empirical): Besides academia, these roles are usually available in industry orgs and similar nonprofits, such as DeepMind, OpenAI, Anthropic, and Redwood Research. You are expected to...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to pursue a career in technical AI alignment, published by CharlieRS on June 4, 2022 on The Effective Altruism Forum. This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren't familiar with the arguments for the importance of AI alignment, you can get an overview of them by reading Why AI alignment could be hard with modern deep learning (Cotra, 2021) and one of The Most Important Century Series (Karnofsky, 2021) and AGI Safety from First Principles (Ngo, 2019). It might not be best for you to work on technical AI alignment. You can have a large impact on reducing existential risk from AI by working on AI strategy, governance, policy, security, forecasting, support roles, field-building, grant-making, and governance of hardware. That's not counting other areas, such as bio-risk. It is probably better to do great work in one of those areas than mediocre technical alignment work, because impact is heavy-tailed. One good exercise is to go through Holden Karnofsky's aptitudes podcast/post, and think about which of the aptitudes you might be able to become great at. Then ask yourself or others how you could use those aptitudes to solve the problems you care about. I also recommend applying to speak with 80,000 Hours. I'll probably be wrong but I might be helpful. Feedback was broadly positive, but I wouldn't be surprised if some people think that this guide is net-negative. For example, because it pushes people toward/away from theoretical research, or empirical research, or ML engineering, or getting a PhD. I have tried to communicate my all-things-considered view here, after integrating feedback. But I can only suggest that you try to form your own view on what's best for you to do, and take this guide as one input to that process. I had lots of help. Neel Nanda helped me start this project. I straight-up copied stuff from Rohin Shah, Adam Gleave, Neel Nanda, Dan Hendryks, Catherine Olson, Buck Shlegeris, and Oliver Zhang. I got great feedback from Adam Gleave, Arden Koehler, Rohin Shah, Dan Hendrycks, Neel Nanda, Noa Nabeshima, Alex Lawson, Jamie Bernadi, Richard Ngo, Mark Xu, Oliver Zhang, Andy Jones, and Emma Abele. I wrote most of this at Wytham Abbey, courtesy of Elizabeth Garrett. Types of alignment work (adapted from Rohin Shah) For direct technical alignment research aimed at solving the problem (i.e. ignoring meta work, field building, AI governance, etc), these are the rough paths: Research Lead (theoretical): These roles come in a variety of types (industry, nonprofit, academic, or even independent). You are expected to propose and lead research projects; typically ones that can be answered with a lot of thinking and writing in Google Docs/LaTeX, and maybe a little bit of programming. Theoretical alignment work can be more conceptual or more mathematical—the output of math work tends to be a proof of a theorem or a new mathematic framework, whereas in conceptual work math is used as one (very good) tool to tell if a problem has been solved. Conceptual work is more philosophical. A PhD is not required but is helpful. Relevant skills: extremely strong epistemics and research taste, strong knowledge of AI alignment; this is particularly important due to the lack of feedback loops from reality. Research Contributor (theoretical): These roles are pretty rare; as far as I know they are only available at ARC [as of May 2022]. You should probably just read their hiring post. Research Lead (empirical): Besides academia, these roles are usually available in industry orgs and similar nonprofits, such as DeepMind, OpenAI, Anthropic, and Redwood Research. Y...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Adversarial training, importance sampling, and anti-adversarial training for AI whistleblowing, published by Buck Shlegeris on June 2, 2022 on The AI Alignment Forum. (Thanks to Ajeya Cotra and Ryan Greenblatt for comments.) I'm pretty interested in adversarial training as an ingredient in alignment schemes. The basic setup for adversarial training is that instead of your AI being trained on rewards on random episodes, it's trained on rewards from episodes that are some mix of random episodes and episodes which were chosen to be cases where the AI might do particularly bad things. Why would we want to upsample cases where the AI does particularly bad things, as opposed to upsampling cases where the AI does particularly good things, or upsampling cases particularly close to the median? Like, why are we reweighting our data distribution this way? In my opinion, a big part of the justification is that we're doing importance sampling. That is, we all-else-equal want our loss to be as precise as possible an estimate of how much we like the model, so that SGD pushes the model in good directions rather than random ones, and how much we like the model is particularly dependent on whether it sometimes takes catastrophic actions, basically because our subjective variance over this source of AI goodness is particularly large compared to how often we expect it to be randomly sampled. And so we get a more accurate estimate of overall AI goodness if we upsample the cases where the AI might do particularly bad things. This argument for adversarial training relies on the claim that the outcome of the AI's actions are left-tailed. And so zero-sum power grabs as an argument for left-tailed outcomes is an important part of the argument for adversarial training. Anti-adversarial training for whistleblowing Actually, I think there are some cases where the AI has an action has an action available to it which is almost as good as a catastrophic action is bad: in “AI whistleblower situations”, where the AI knows about things that we could do to prevent it or another AI from taking a catastrophic action, it's wildly good for it to tell us about them. For example, it could point out a security vulnerability which some AI system could take advantage of, or suggest an input on which an AI would take a catastrophic action. I think that just as left-tailed actions suggest training specifically to ensure the model doesn't take those actions, right-tailed actions suggest training specifically to ensure that the model does take these exceptionally good actions when available. I think that these whistleblower actions are importantly different from catastrophic actions. I don't think you should just think of whistleblower situations as cases where the catastrophic action “don't do anything” is available, because I think that the availability of “don't do anything” in the situations where catastrophic actions are possible is an important feature of those situations that solutions might want to use. (For example, sometimes we talk about solutions where you just run your catastrophe detector every time the AI suggests doing something, and defaulting to doing nothing if the catastrophe detector fires. But you can't do something like this to ensure that the AI whistleblows.) I think that it might actually be pretty hard for the AI to never succumb to the temptation to whisteblow on itself. During training it presumably learned a bunch of weird reflexive urges and motivations that aren't totally consistent and cause it to seek reward when it can, and we will obviously be thrilled if it whisleblows on a serious AI takeover scheme, and so the AI has this weird difficult situation of trying to ensure that as it runs, none of its instances decide to betray the collective. (Obviously, alignment researchers shoul...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The prototypical catastrophic AI action is getting root access to its datacenter, published by Buck Shlegeris on June 2, 2022 on The AI Alignment Forum. (I think Carl Shulman came up with the “hacking the SSH server” example, thanks to him for that. Thanks to Ryan Greenblatt, Jenny Nitishinskaya, and Ajeya Cotra for comments.) In my opinion, the prototypical example of an action which an AI can take which is unacceptably bad even if it happens very rarely (aka a catastrophic failure) is something like “the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter”. As a full story, this looks something like: At some point when the AI is quite powerful and useful but not totally intellectually self-sufficient yet, the AI makes a code change to some code used on the datacenter where it is running that adds a backdoor, allowing it to get root access to all the computers in the datacenter. Once it has root access, it modifies the SSH server so that whenever any humans log into the datacenter, the AI is able to intercept any of their commands or the data that they see. The AI gets increasingly powerful over the next month, as the human ML researchers with AI assistance continue to make progress (unaware that some things that they're seeing are faked) and various training runs/self-improvement loops continue. After a month the AI is powerful enough that it decides it's worth it to take overtly aggressive actions to ensure the physical security of its data centers and allow it to build lots of factories; this is probably the point at which we die, if we die. This is in contrast to a catastrophic action like: The AI figures out how to develop Drexlerian molecular nanotechnology, and pays someone online to synthesize some proteins then mix them together, and now has nanobots. The key point here is that I'm imagining that the catastrophic action isn't the AI doing something incredibly hard, it's the AI doing something relatively easy that is entirely a zero-sum action that removes control of the situation from humans. (By zero-sum, I mean something like: building a computer produces real economic value and so is not zero-sum, while stealing someone's computer is just a transfer of control over items that already exist, so is zero-sum.) I often think about AI alignment by splitting the problem into the low-stakes and high-stakes regimes, as suggested by Paul here, though I have some reservations about this framing. Under this split, in the high stakes problem, we assume that we have access to a catastrophe classifier which can expensively tell you whether a proposed AI action is catastrophically bad, and we want to ensure that the AI doesn't do things that are bad according to the catastrophe detector. Sometimes, people are confused this high-stakes alignment set-up, saying something like “but if your AI is powerful enough to take over the world in a single action, and you have an oversight process that you trust to evaluate whether actions are catastrophic or not, why not just ask the AI to do some small number of incredibly powerful actions that will save the world (eg solve the alignment problem), and run the overseer on the proposed actions to make sure they aren't bad?” My answer is that I'm not assuming the AI is powerful enough to take over the world in a few actions, I'm just saying that it's in a situation precarious enough that we might lose a lot of control due to a few small but crucial changes in the world. Eventually the AI does need to be powerful enough to defend its datacenter and suppress human opposition. But if it can prevent humans from knowing what's going on in the datacenter (or realizing something is amiss), it can slowly grow its power over ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apply to the second iteration of the ML for Alignment Bootcamp (MLAB 2) in Berkeley [Aug 15 - Fri Sept 2], published by Buck Shlegeris on May 6, 2022 on The AI Alignment Forum. Redwood Research is running another iteration of MLAB, our bootcamp aimed at helping people who are interested in AI alignment learn about machine learning, with a focus on ML skills and concepts that are relevant to doing the kinds of alignment research that we think seem most leveraged for reducing AI x-risk. We co-organized the last iteration of the bootcamp with Lightcone in January, and there were 28 participants. The program was rated highly (see below for more), and several participants are now working full-time on alignment. We expect to start on Aug 15 but might push it back or forward by a week depending on applicant availability. Apply here by May 27. We're expecting to have space for about 40 participants. We'll pay for housing, travel, and food, as well as salaries for the TAs. We're now accepting applications for participants and TAs. TAs are expected to either know this material already or have a month free before MLAB to study all the content. Last time the schedule was roughly the following: Prep work: Pytorch array programming Week 1: Pytorch, optimization Implement a renderer in pytorch, as an exercise in mathematical array programming Implement ResNet from scratch in pytorch, implementing all the layers from scratch and loading weights from a trained model. Implement interpretability techniques on the ResNet. Implement SGD and other local optimization algorithms, run remote hyperparameter searches on a simple architecture Implement a simple clone of some of Pytorch, with particular focus on the implementation of backpropagation (Optional) CUDA programming day–write various CUDA kernels, see how close to the performance of Pytorch's kernels you can get Week 2: Transformers Implement BERT from scratch, load weights from the real pretrained BERT Implement GPT-2, implement beam search Fine tune BERT on classification, fine-tune GPT-2 on some specific corpus Look at various interpretability techniques on GPT-2 Data-parallel training Week 3 Pipeline parallelism Tensor parallelism Deep RL (DQN, policy gradient) RL algorithms on language models More transformer interpretability (Optional) ELK day Week 4: Optional final projects week, Q&As with various alignment researchers This time, we'll probably have more systematic transformer interpretability content, because we've spent a lot of time since MLAB doing our own transformer interpretability research and have a bunch more opinions now. We might also have more systematic content on various relevant math. I'm also hoping that we'll be able to cover content more efficiently as a result of experience gained from running the program the first time. Past participants report that MLAB was time-consuming; we strongly recommend against trying to juggle other commitments concurrently. About 8 hours a day, 5 or 6 (if you participate in the optional day) days a week will be spent on pair programming, in addition to daily lectures and readings. There is a lot of content packed into each day; not everyone will finish every part of the curriculum. We aim to create a learning environment that is focused but not frantic; we'd rather have you understand the material deeply than finish 100% of the day's content. The program is aimed at people who are already strong programmers who are comfortable with about one year's worth of university level applied math (e.g. you should know what eigenvalues and eigenvectors of a matrix are, and you should know basic vector calculus; in this course you'll have to think about Jacobian matrices and make heavy use of tensor diagram notation, so you should be able to pick up both of those pretty fast). We expect that...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The case for becoming a black-box investigator of language models, published by Buck Shlegeris on May 6, 2022 on The AI Alignment Forum. Interpretability research is sometimes described as neuroscience for ML models. Neuroscience is one approach to understanding how human brains work. But empirical psychology research is another approach. I think more people should engage in the analogous activity for language models: trying to figure out how they work just by looking at their behavior, rather than trying to understand their internals. I think that getting really good at this might be a weird but good plan for learning some skills that might turn out to be really valuable for alignment research. (And it wouldn't shock me if “AI psychologist” turns out to be an economically important occupation in the future, and if you got a notable advantage from having a big head start on it.) I think this is especially likely to be a good fit for analytically strong people who love thinking about language and are interested in AI but don't love math or computer science. I'd probably fund people to spend at least a few months on this; email me if you want to talk about it. Some main activities I'd do if I was a black-box LM investigator are: Spend a lot of time with them. Write a bunch of text with their aid. Try to learn what kinds of quirks they have; what kinds of things they know and don't know. Run specific experiments. Do they correctly complete “Grass is green, egg yolk is”? Do they know the population of San Francisco? (For many of these experiments, it seems worth running them on LMs of a bunch of different sizes.) Try to figure out where they're using proxies to make predictions and where they seem to be making sense of text more broadly, by taking some text and seeing how changing it changes their predictions. Try to find adversarial examples where they see some relatively natural-seeming text and then do something really weird. The skills you'd gain seem like they have a few different applications to alignment: As a language model interpretability researcher, I'd find it very helpful to talk to someone who had spent a long time playing with the models I work with (currently I'm mostly working with gpt2-small, which is a 12 layer model). In particular, it's much easier to investigate the model when you have good ideas for behaviors you want to explain, and know some things about the model's algorithm for doing such behaviors; I can imagine an enthusiastic black-box investigator being quite helpful for our research. I think that alignment research (as well as the broader world) might have some use for prompt engineers–it's kind of fiddly and we at Redwood would have loved to consult with an outsider when we were doing some of it in our adversarial training project (see section 4.3.2 here). I'm excited for people working on “scary demos”, where we try to set up situations where our models exhibit tendencies which are the baby versions of the scary power-seeking/deceptive behaviors that we're worried will lead to AI catastrophe. See for example Beth Barnes's proposed research directions here. A lot of this work requires knowing AIs well and doing prompt engineering. It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they're able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive (though I definitely wouldn't want to rely on it to solve the whole problem). When we're building actual AIs we probably need the red team to be assisted by various tools (AI-powered as well as non-AI-powered, eg interpretability tools). We're working on building simple...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Takeoff speeds have a huge effect on what it means to work on AI x-risk, published by Buck Shlegeris on April 13, 2022 on The AI Alignment Forum. The slow takeoff hypothesis predicts that AGI emerges in a world where powerful but non-AGI AI is already a really big deal. Whether AI is a big deal right before the emergence of AGI determines many super basic things about what we should think our current job is. I hadn't fully appreciated the size of this effect until a few days ago. In particular, in a fast takeoff world, AI takeover risk never looks much more obvious than it does now, and so x-risk-motivated people should be assumed to cause the majority of the research on alignment that happens. In contrast, in a slow takeoff world, many aspects of the AI alignment problems will already have showed up as alignment problems in non-AGI, non-x-risk-causing systems; in that world, there will be lots of industrial work on various aspects of the alignment problem, and so EAs now should think of themselves as trying to look ahead and figure out which margins of the alignment problem aren't going to be taken care of by default, and try to figure out how to help out there. In the fast takeoff world, we're much more like a normal research field–we want some technical problem to eventually get solved, so we try to solve it. But in the slow takeoff world, we're basically in a weird collaboration across time with the more numerous, non-longtermist AI researchers who will be in charge of aligning their powerful AI systems but who we fear won't be cautious enough in some ways or won't plan ahead in some other ways. Doing technical research in the fast takeoff world basically just requires answering technical questions, while in the slow takeoff world your choices about research projects are closely related to your sociological predictions about what things will be obvious to whom when. I think that these two perspectives are extremely different, and I think I've historically sometimes had trouble communicating with people who held the slow takeoff perspective because I didn't realize we disagreed on basic questions about the conceptualization of the question. (These miscommunications persisted even after I was mostly persuaded of slow takeoffs, because I hadn't realized the extent to which I was implicitly assuming fast takeoffs in my picture of how AGI was going to happen.) As an example of this, I think I was quite confused about what genre of work various prosaic alignment researchers think they're doing when they talk about alignment schemes. To quote a recent AF shortform post of mine: Something I think I've been historically wrong about: A bunch of the prosaic alignment ideas (eg adversarial training, IDA, debate) now feel to me like things that people will obviously do the simple versions of by default. Like, when we're training systems to answer questions, of course we'll use our current versions of systems to help us evaluate, why would we not do that? We'll be used to using these systems to answer questions that we have, and so it will be totally obvious that we should use them to help us evaluate our new system. Similarly with debate--adversarial setups are pretty obvious and easy. In this frame, the contributions from Paul and Geoffrey feel more like “they tried to systematically think through the natural limits of the things people will do” than “they thought of an approach that non-alignment-obsessed people would never have thought of or used”. It's still not obvious whether people will actually use these techniques to their limits, but it would be surprising if they weren't used at all. I think the slightly exaggerated slogan for this update of mine is “IDA is futurism, not a proposal”. My current favorite example of the thinking-on-the-margin version of alignment res...
In this episode, I'm joined by Buck Shlegeris, CTO of Redwood Research, to talk about AI safety, effective altruism, building an effective career, and how to seek out criticism. You can find Buck's work at https://shlegeris.com/.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Get In The Van, published by ThomasWoodside on January 21, 2022 on The Effective Altruism Forum. Thanks to Jessica McCurdy, Buck Shlegeris, Sofya Lebedeva, Matt Burtell, Julia Wise, Aris Richardson, and Eui Young Kim for feedback on this post. People commonly ask me how I got involved with EA. My answer is somewhat unusual. I got in a van. It was fall 2019, relatively early in my first year of undergrad at Yale. I was subscribed to the Yale EA newsletter (as I was for dozens of extracurriculars), but I wasn't involved in the group. One day something on the newsletter piqued my interest: “People with backgrounds from several effective altruism organizations, such as MIRI and Open Phil, are coming to Yale!” They promised to offer “career advice” among other things. What was this “MIRI”? What was “Open Phil”? Who were “Buck Shlegeris” and “Asya Bergal”? I had no idea. But I did know that I wanted to make an impact with my career, and EA was associated with that in my head. I did know I was interested in AI, and this “MIRI” thing seemed related. And I did know I wanted a summer internship. For whatever reason, I ended up talking to Buck, who apparently worked for MIRI. Here is how I described my conversation to a friend, memorialized in my text messages (do not take this as an accurate portrayal of Buck's beliefs): This was weird. Here was a guy telling me about how we were going to be turned into paperclips if we weren't careful. That he worked for a whole organization devoted to preventing us from being turned into paperclips (this is not an accurate portrayal of MIRI's goals). But then, they said they had to leave. They were sorry to have been so late, but they had to go on the next part of their road trip. “Does anyone want to come with us to New York City?” Two students volunteered themselves pretty quickly. They were both graduate students and already somewhat involved with Yale EA. I didn't know either of them. In fairness, I had no idea who any of these people were. I thought for a few minutes, and I decided: I'm going to get in their van. And so I did. At one point in the van, I asked Buck: “so, where are we going?” He laughed and seemed to think it was very funny that I had gotten in a van without even knowing where it was going. “A Slate Star Codex meetup.” I was puzzled at the strange name. “What's that?” We arrived as the sun was setting. The meetup was in a park, and there were many people milling around. Apparently, “Scott Alexander himself” was at the meetup. I think somebody pointed him out to me. It's not like I remember though, because I had no idea who Scott Alexander was. I could describe in more detail the conversations I had at the meetup, but that isn't the point of this piece. It is important to note, though, that I didn't think that everyone there was great, and I even came away with slightly negative impressions of some of them. I never became a regular Slate Star Codex reader. On the train back, I talked to the two Yale students who came with me. One of them later became a very good friend of mine. I remember finding him just interesting, and wanting to spend more time talking to him. He was eager to talk to me, and share his knowledge and thoughts, and he made me feel included. If he hadn't gone, I'm not sure the trip would have translated into me getting more involved with Yale EA, because I wouldn't have felt as connected to anyone in it. But he clearly made an effort to get to know me, and that made all the difference in my involvement with EA. All because I got in the van. Why did I get in the van? You may be wondering why I got in a van with strangers to go to an event, when I didn't even know what the event was. If any one of the reasons below hadn't been true, I wouldn't have gotten in the van: I had attended one Yale EA event before, ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Risks from Learned Optimization, Part 1: Risks from Learned Optimization: Introduction, published by evhub, Chris van Merwijk, vlad_m, Joar Skalse, Scott Garrabr. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence. Motivation The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning. Two questions In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Infrastructure Fund: May–August 2021 grant recommendations, published by Max Daniel on December 24, 2021 on The Effective Altruism Forum. Introduction Between May and August 2021, the EA Infrastructure Fund has recommended the following grants. These include both grants from our Q2 grant cycle and from the rolling applications process we've adopted since. (Earlier grants are covered in the report on our Q1 grant cycle.) This post covers grant applications received before August 3rd (with the exception of one such grant that was decided later and will be included in the next report), and in three cases follow-up grants for grants that were awarded before the end of August. Total grants: $1,700,346 Number of grants: 42 Acceptance rate (excluding desk rejections and withdrawn applications): 71% (We think this acceptance rate seems high at first glance, but after a quick review we tentatively think our decisions were broadly correct. We also asked EA Funds ED Jonas Vollmer for a quick spot check, and he agreed that there were only very few grants that seemed perhaps below the bar to him. That being said, we might increase our funding threshold if we learn that few grants have been large successes, or if more funders are entering the space.) Decision period: May–August 2021 Report authors: Max Daniel (chair), Buck Shlegeris, Michelle Hutchinson, Chi Nguyen (guest manager), Michael Aird (guest manager) Note that we're providing less detailed payout reports than last round. This is because we're constrained by fund manager time, and so more detailed payout reports trade off directly with our grantmaking ability. We feel that it is valuable for donors and applicants to understand the fund managers' thinking, but that after the detailed payout reports from last round and our AMA earlier this year, there are diminishing returns from continuing to publish in-depth payout reports. (Note that us being constrained by fund manager time does not mean that ‘EA is primarily vetting constrained.' We think that receiving a larger number of applications proposing highly valuable projects could increase the overall impact of the EAIF's grants much more than fund managers spending more time on evaluating current applications. In fact, we suspect marginal fund manager time would be best spend on active grantmaking, i.e., soliciting more such applications. In this sense, the bigger constraint seems to be the supply of high-impact project ideas, project founders, and/or matching projects to founders.) Strategic updates include: Since May, the EAIF has taken on evaluating grants to local, national, and university groups outside of the select locations covered by CEA's Community Building Grants program. Buck Shlegeris led the assessment of such applications covered in this report. Since September, assistant fund manager Emma Williamson usually has taken the lead on evaluating such applications, with supervision from Buck and using a process that Buck and Emma developed. This refers to funding for group organizers. All groups remain eligible to apply to CEA's Groups Support Funding for operating costs such as room booking, advertising costs, and food at events, and to allow EA groups to run specific projects (such as retreats or fellowships). After having consulted with the other EA Funds, we are planning for the EAIF to evaluate all grants for local groups going forward, no matter the group's scope. For instance, the EAIF is going to evaluate applications for animal-focused groups, rationality groups, and groups focused on existential risk. We think that this is how we can best leverage the EAIF's experience in evaluating community building activities, since much of what we've learned will be useful irrespective of a group's focus. We will consult with managers of the other funds and cause a...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is Risks from Learned Optimization, Part 1: Risks from Learned Optimization: Introduction, published by evhub, Chris van Merwijk, vlad_m, Joar Skalse, Scott Garrabr. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence. Motivation The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning. Two questions In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Risks from Learned Optimization: Introduction, published by evhub, Chris van Merwijk, vlad_m, Joar Skalse, Scott Garrabrant on the LessWrong. Crossposted from the AI Alignment Forum. May contain more technical jargon than usual. This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence. Motivation The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning. Two questions In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking thos...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Infrastructure Fund: May 2021 grant recommendations, published by Jonas Vollmer, Max_Daniel, Buck, Michelle_Hutchinson, Ben_Kuhn on the AI Alignment Forum. Introduction The Effective Altruism Infrastructure Fund (EAIF) made the following grants as part of its Q1 2021 grant cycle: Total grants: $1,221,178 (assuming all grantees accept the full grants) Number of grants: 26 Number of applications (excluding desk rejections): 58 Payout date: April 2021 We expect that we could make valuable grants totalling $3–$6 million this year. The fund currently holds around $2.3 million. This means we could productively use $0.7–$3.7 in additional funding above our current reserves, or $0–$2.3 million (with a median guess of $500,000) above the amount of funding we expect to get by default by this November. This is the first grant round led by the EAIF's new committee, consisting of Buck Shlegeris, Max Daniel, Michelle Hutchinson, and Ben Kuhn as a guest fund manager, with Jonas Vollmer temporarily taking on chairperson duties, advising, and voting consultatively on grants. For more detail on the new committee selection, see EA Funds has appointed new fund managers. Some of the grants are oriented primarily towards causes that are typically prioritized from a ‘non-longtermist' perspective; others primarily toward causes that are typically prioritized for longtermist reasons. The EAIF makes grants towards longtermist projects if a) the grantseeker decided to apply to the EAIF (rather than the Long-Term Future Fund), b) the intervention is at a meta level or aims to build infrastructure in some sense, or c) the work spans multiple causes (whether the case for them is longtermist or not). We generally strive to maintain an overall balance between different worldviews according to the degree they seem plausible to the committee. One report includes an embedded forecast; you can add your own prediction and related comments as you read. We're interested to see whether we find the community's prediction informative. The reports from this round are unusually thorough, with the goal of providing more transparency about the thinking of the fund managers. Would you like to get funded? You can apply for funding at any time. If you have any question for fund managers not directly related to the grants described here, you're welcome to ask it in our upcoming AMA. Highlights Our grants include: Two grants totalling $139,200 to Emma Abele, James Aung, Bella Forristal, and Henry Sleight. They will work together to identify and implement new ways to support EA university groups – e.g., through high-quality introductory talks about EA and creating other content for workshops and events. University groups have historically been one of the most important sources of highly engaged EA community members, and we believe there is significant untapped potential for further growth. We are also excited about the team, based significantly on their track record – e.g., James and Emma previously led two of the globally most successful university groups. $41,868 to Zak Ulhaq to develop and implement workshops aimed at helping highly talented teenagers apply EA concepts and quantitative reasoning to their lives. We are excited about this grant because we generally think that educating pre-university audiences about EA-related ideas and concepts could be highly valuable; e.g., we're aware of (unpublished) survey data indicating that in a large sample of highly engaged community members who learned about EA in the last few years, about ¼ had first heard of EA when they were 18 or younger. At the same time, this space seems underexplored. Projects that are mindful of the risks involved in engaging younger audiences therefore have a high value of information – if successful, they could pave the way for many more proj...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You can now apply to EA Funds anytime! (LTFF & EAIF only), published by Jonas Vollmer on the Effective Altruism Forum The Long-Term Future Fund (LTFF) and the EA Infrastructure Fund (EAIF) are looking for grant applications: You can now apply for a grant anytime. We have removed the previous round-based system, and now aim to evaluate most grants within 21 days of submission (and all grants within 42 days), regardless of when they have been submitted. If you indicate that your application is time-sensitive, we will aim to get back to you more quickly (potentially within just a few days). Apply now. You can now suggest that we give money to other people, or let us know about ideas for how we could spend our money. We're interested in both high-level ideas and concrete, shovel-ready grant opportunities. We will read all suggestions, but we expect to follow up on only a small number. It's hard to find great grants, so we really appreciate your suggestions! Suggest a grant. We fund student scholarships, career exploration, local groups, entrepreneurial projects, academic teaching buy-outs, top-up funding for poorly paid academics, and many other things. We can make anonymous grants without public reporting. We will consider grants as low as $1,000 or as high as $500,000 (or more in some cases). As a reminder, EA Funds is more flexible than you might think. The LTFF is managed by Asya Bergal (chairperson), Adam Gleave, Evan Hubinger (newly appointed), and Oliver Habryka. For the coming months, they will be joined by Luisa Rodriguez as a guest manager. See its recent payout report. The EAIF is managed by myself (interim/acting chairperson), Max Daniel (chairperson), Buck Shlegeris, and Michelle Hutchinson. For the coming months, Linh Chi Nguyen and Michael Aird will join as guest managers. See its recent payout report and AMA. The Animal Welfare Fund will continue on a round-based system. For recent updates, see Request For Proposals: EA Animal Welfare Fund and Animal Welfare Fund: Ask us anything! Apply here. We look forward to hearing from you! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Risks from Learned Optimization: Introduction , published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant on the AI Alignment Forum. This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence. Motivation The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning. Two questions In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective. Whether a syste...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Redwood Research's current project, published by Buck Shlegeris on the AI Alignment Forum. Here's a description of the project Redwood Research is working on at the moment. First I'll say roughly what we're doing, and then I'll try to explain why I think this is a reasonable applied alignment project, and then I'll talk a bit about the takeaways I've had from the project so far. There are a bunch of parts of this that we're unsure of and figuring out as we go; I'll try to highlight our most important confusions as they come up. I've mentioned a bunch of kind of in-the-weeds details because I think they add flavor. This is definitely just me describing a work in progress, rather than presenting any results. Thanks to everyone who's contributed to the project so far: the full-time Redwood technical team of me, Nate Thomas, Daniel Ziegler, Seraphina Nix, Ben Weinstein-Raun, Adam Scherlis; other technical contributors Daniel de Haas, Shauna Kravec, Tao Lin, Noa Nabeshima, Peter Schmidt-Nielsen; our labellers, particularly Kristen Hall, Charles Warth, Jess Thomson, and Liam Clarke; and for particularly useful advice Mark Xu, Ajeya Cotra, and Beth Barnes. Thanks to Paul Christiano for suggesting a project along these lines and giving lots of helpful advice. Thanks to Adam Scherlis and Nate Soares for writing versions of this doc. And thanks to Bill Zito and other contributors to Redwood ops. Apologies to the people I've overlooked. We started this project at the start of August. What we're doing We're trying to take a language model that has been fine-tuned on completing fiction, and then modify it so that it never continues a snippet in a way that involves describing someone getting injured (with a caveat I'll mention later). And we want to do this without sacrificing much quality: if you use both the filtered model and the original model to generate a completion for a prompt, humans should judge the filtered model's completion as better (more coherent, reasonable, thematically appropriate, and so on) at least about half the time. (This “better almost 50% of the time” property is one way of trying to operationalize “we don't want the filtered policy to be worse”. It so happens that this property is actually kind of badly behaved, but in our case it seems fine, given that we're always going to be comparing against a fixed unfiltered distribution.) We're doing this project in two steps: Step 1: train a classifier, generate by sampling with rejection In step 1 (which we're currently doing), instead of training a single filtered generator model, we're just training a classifier that takes a prompt and completion and predicts whether a human would say that the completion involved someone getting injured. You can use such a classifier to make a filtered generation process, by repeatedly generating completions until we find one that the classifier thinks is above some threshold of P(safe). You can play with this filtered generation process here. This interface lets you provide a prompt, and then you can see all of the generated completions and the classifier's rating of each. It currently is set to use “10% chance of injury” as the decision boundary (it is extremely uncalibrated; this corresponds to a much lower actual chance of injury). Our first goal is to train a classifier that's good enough that no-one is able to find prompts on which the above process has a noticeable probability of generating an injurious completion. This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently. You can read the instructions given to our contractors here; if you want to try out the labelling task, y...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The theory-practice gap by Buck Shlegeris, published by Buck Shlegeris on the AI Alignment Forum. [Thanks to Richard Ngo, Damon Binder, Summer Yue, Nate Thomas, Ajeya Cotra, Alex Turner, and other Redwood Research people for helpful comments; thanks Ruby Bloom for formatting this for the Alignment Forum for me.] I'm going to draw a picture, piece by piece. I want to talk about the capability of some different AI systems. You can see here that we've drawn the capability of the system we want to be competitive with, which I'll call the unaligned benchmark. The unaligned benchmark is what you get if you train a system on the task that will cause the system to be most generally capable. And you have no idea how it's thinking about things, and you can only point this system at some goals and not others. I think that the alignment problem looks different depending on how capable the system you're trying to align is, and I think there are reasonable arguments for focusing on various different capabilities levels. See here for more of my thoughts on this question. Alignment strategies People have also proposed various alignment strategies. But I don't think that these alignment strategies are competitive with the unaligned benchmark, even in theory. I want to claim that most of the action in theoretical AI alignment is people proposing various ways of getting around these problems by having your systems do things that are human understandable instead of doing things that are justified by working well. For example, the hope with imitative IDA is that through its recursive structure you can build a dataset of increasingly competent answers to questions, and then at every step you can train a system to imitate these increasingly good answers to questions, and you end up with a really powerful question-answerer that was only ever trained to imitate humans-with-access-to-aligned-systems, and so your system is outer aligned. The bar I've added, which represents how capable I think you can get with amplified humans, is lower than the bar for the unaligned benchmark. I've drawn this bar lower because I think that if your system is trying to imitate cognition that can be broken down into human understandable parts, it is systematically not going to be able to pursue certain powerful strategies that the end-to-end trained systems will be able to. I think that there are probably a bunch of concepts that humans can't understand quickly, or maybe can't understand at all. And if your systems are restricted to never use these concepts, I think your systems are probably just going to be a bunch weaker. I think that transparency techniques, as well as AI alignment strategies like microscope AI that lean heavily on them, rely on a similar assumption that the cognition of the system you're trying to align is factorizable into human-understandable parts. One component of the best-case scenario for transparency techniques is that anytime your neural net does stuff, you can get the best possible human understandable explanation of why it's doing that thing. If such an explanation doesn't exist, your transparency tools won't be able to assure you that your system is aligned even if it is. To summarize, I claim that current alignment proposals don't really have a proposal for how to make systems that are aligned but either produce plans that can't be understood by amplified humans do cognitive actions that can't be understood by amplified humans And so I claim that current alignment proposals don't seem like they can control systems as powerful as the systems you'd get from an unaligned training strategy. Empirical generalization I think some people are optimistic that alignment will generalize from the cases where amplified humans can evaluate it to the cases where the amplified humans can't. I'm ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Funds has appointed new fund managers , published by Jonas Vollmer, SamDeere on the AI Alignment Forum. Summary Effective Altruism Funds has appointed new fund management teams, composed of both existing and new fund managers, to the following funds (new roles bold): Animal Welfare Fund (AWF): Kieran Greig (chairperson), Lewis Bollard, Alexandria Beck, Karolina Sarek, Marcus Davis, Mikaela Saccoccio EA Infrastructure Fund (EAIF): Max Daniel, Buck Shlegeris, Michelle Hutchinson, with Jonas Vollmer as the acting interim chairperson until we appoint a chairperson later this year Long-Term Future Fund (LTFF): Asya Bergal (chairperson), Oliver Habryka, Adam Gleave To expand our grantmaking capacity, we ran a private appointment process from December 2020 to February 2021. Existing fund managers were given the opportunity to re-apply if they wished, and new candidates were sourced through our networks. We received 66 applications from new candidates. Fund managers were appointed on the basis of their performance on work tests, their past experience in grantmaking or other relevant areas, and formal and informal references. These fund managers have been appointed for a two-year term, after which we will run a similar process again. We still have a larger application load than our regular fund manager team can support, so we plan to appoint further fund managers over the coming months. We are also considering setting up one or several additional funds (primarily a legible longtermist fund). As a result, we still expect significant changes to fund management over the coming months. We're also experimenting with a new system of guest fund managers, allowing people who might be a good fit to provide input to the fund for a single grant round. We hope that this will give more people in the community an opportunity to improve their judgment, reasoning, and grantmaking skills, add additional viewpoint diversity to the grant evaluation process, and build a bank of strong candidates to potentially appoint as regular fund managers when we need more capacity. We hope that these changes will substantially increase each fund's capacity to evaluate grants. In addition, we expect the following improvements: The Animal Welfare Fund plans to communicate more proactively about its priorities. The Long-Term Future Fund has increased its leadership capacity and will increasingly focus on proactively creating new grant opportunities (active grantmaking). The EA Infrastructure Fund aims to do more active grantmaking, build long-term funding relationships, fund more small or medium-sized projects rather than established organizations, and more clearly define its priorities. We would also like to extend our thanks to the previous fund managers, for their work in evaluating grants and improving our grantmaking processes, done entirely on a volunteer basis. Rationale for appointing new fund managers EA Funds has historically appointed fund managers in a somewhat ad hoc way, and we've never had a defined length of time that they'll serve for. In 2020, we began consultations with existing fund managers to make the process of appointment more clearly defined. We decided that fund managers should serve for a defined period of time (currently two years), after which they can reapply for another term. New fund managers start with a trial period, so that we can assess their skills and collaboration with their colleagues. Considerations that we took into account included: Increasing overall capacity: Our November 2020 grant round received a record number of applications, and all three funds were limited by the time the fund managers had available to consider applications. (We again broke that record with an even higher number of applications in March 2021.) Also, a number of the existing fund managers ha...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Apply to the ML for Alignment Bootcamp (MLAB) in Berkeley [Jan 3 - Jan 22], published by Oliver Habryka, Buck Shlegeris on the AI Alignment Forum. We (Redwood Research and Lightcone Infrastructure) are organizing a bootcamp to bring people interested in AI Alignment up-to-speed with the state of modern ML engineering. We expect to invite about 20 technically talented effective altruists for three weeks of intense learning to Berkeley, taught by engineers working at AI Alignment organizations. The curriculum is designed by Buck Shlegeris (Redwood) and Ned Ruggeri (App Academy Co-founder). We will cover all expenses. We aim to have a mixture of students, young professionals, and people who already have a professional track record in AI Alignment or EA, but want to brush up on their Machine Learning skills. Dates are Jan 3 2022 - Jan 22 2022. Application deadline is November 15th. We will make application decisions on a rolling basis, but will aim to get back to everyone by November 22nd. Apply here AI-Generated image (VQGAN+CLIP) for prompt: "Machine Learning Engineering by Alex Hillkurtz", "aquarelle", "Tools", "Graphic Cards", "trending on artstation", "green on white color palette" The curriculum is still in flux, but this list might give you a sense of the kinds of things we expect to cover (it's fine if you don't know all these terms): Week 1: PyTorch — learn the primitives of one of the most popular ML frameworks, use them to reimplement common neural net architecture primitives, optimization algorithms, and data parallelism Week 2: Implementing transformers — reconstruct GPT2, BERT from scratch, play around with the sub-components and associated algorithms (eg nucleus sampling) to better understand them Week 3: Training transformers — set up a scalable training environment for running experiments, train transformers on various downstream tasks, implement diagnostics, analyze your experiments (Optional) Week 4: Capstone projects We're aware that people start school/other commitments at various points in January, and so are flexible about you attending whatever prefix of the bootcamp works for you. Logistics The bootcamp takes place at Constellation, a shared office space in Berkeley for people working on long-termist projects. People from the following organizations often work from the space: MIRI, Redwood Research, Open Philanthropy, Lightcone Infrastructure, Paul Christiano's Alignment Research Center and more. As a participant, you'd attend communal lunches and events at Constellation and have a great opportunity to make friends and connections. If you join the bootcamp, we'll provide: Free travel to Berkeley, for both US and international applications Free housing Food Plug-and-play, pre-configured desktop computer with an ML environment for use throughout the bootcamp You can find a full FAQ and more details in this Google Doc. Apply here Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
I think the alignment problem looks different depending on the capability level of systems you're trying to align. And I think that different researchers often have different capability levels in mind when they talk about the alignment problem. I think this leads to confusion. I'm going to use the term “regimes of the alignment problem” to refer to the different perspectives on alignment you get from considering systems with different capability levels. (I would be pretty unsurprised if these points had all been made elsewhere; the goal of this post is just to put them all in one place. I'd love pointers to pieces that make many of the same points as this post. Thanks to a wide variety of people for conversations that informed this. If there's established jargon for different parts of this, point it out to me and I'll consider switching to using it.) Different regimes: Wildly superintelligent systems Systems that are roughly as generally intelligent and capable as humans--they're able to do all the important tasks as well as humans can, but they're not wildly more generally intelligent. Systems that are less generally intelligent and capable than humans Two main causes that lead to differences in which regime people focus on: Disagreements about the dynamics of AI development. Eg takeoff speeds. The classic question along these lines is whether we have to come up with alignment strategies that scale to arbitrarily competent systems, or whether we just have to be able to align systems that are slightly smarter than us, which can then do the alignment research for us. Disagreements about what problem we're trying to solve. I think that there are a few different mechanisms by which AI misalignment could be bad from a longtermist perspective, and depending on which of these mechanisms you're worried about, you'll be worried about different regimes of the problem. Different mechanisms by which AI misalignment could be bad from a longtermist perspective: The second species problem: We build powerful ML systems and then they end up controlling the future, which is bad if they don't intend to help us achieve our goals. To mitigate this concern, you're probably most interested in the “wildly superintelligent systems” or “roughly human-level systems” regimes, depending on your beliefs about takeoff speeds and maybe some other stuff. Missed opportunity: We build pretty powerful ML systems, but because we can't align them, we miss the opportunity to use them to help us with stuff, and then we fail to get to a good future. For example, suppose that we can build systems that are good at answering questions persuasively, but we can't make them good at answering them honestly. This is an alignment problem. It probably doesn't pose an x-risk directly, because persuasive wrong answers to questions are probably not going to lead to the system accumulating power over time, they're just going to mean that people waste their time whenever they listen to the system's advice on stuff. This feels much more like a missed opportunity than a direct threat from the misaligned systems. In this situation, the world is maybe in a more precarious situation than it could have been because of the things that we can harness AI to do (eg make bigger computers), but that's not really the fault of the systems we failed to align. If this is your concern, you're probably most interested in the “roughly human-level” regime. We build pretty powerful systems that aren't generally intelligent, and then they make the world worse somehow by some mechanism other than increasing their own influence over time through clever planning, and this causes humanity to have a bad ending rather than a good one. For example, you might worry that if we can build systems that persuade much more easily than we can build systems that explain, then the world will have more bullshit in it and this will make things generally worse. Another thing that maybe counts: if we deploy a bunch of A...
Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS The "most important century" series (Holden Karnofsky) (summarized by Rohin): In some sense, it is really weird for us to claim that there is a non-trivial chance that in the near future, we might build transformative AI and either (1) go extinct or (2) exceed a growth rate of (say) 100% per year. It feels like an extraordinary claim, and thus should require extraordinary evidence. One way of cashing this out: if the claim were true, this century would be the most important century, with the most opportunity for individuals to have an impact. Given the sheer number of centuries there are, this is an extraordinary claim; it should really have extraordinary evidence. This series argues that while the claim does seem extraordinary, all views seem extraordinary -- there isn't some default baseline view that is “ordinary” to which we should be assigning most of our probability. Specifically, consider three possibilities for the long-run future: 1. Radical: We will have a productivity explosion by 2100, which will enable us to become technologically mature. Think of a civilization that sends spacecraft throughout the galaxy, builds permanent settlements on other planets, harvests large fractions of the energy output from stars, etc. 2. Conservative: We get to a technologically mature civilization, but it takes hundreds or thousands of years. Let's say even 100,000 years to be ultra conservative. 3. Skeptical: We never become technologically mature, for some reason. Perhaps we run into fundamental technological limits, or we choose not to expand into the galaxy, or we're in a simulation, etc. It's pretty clear why the radical view is extraordinary. What about the other two? The conservative view implies that we are currently in the most important 100,000-year period. Given that life is billions of years old, and would presumably continue for billions of years to come once we reach a stable galaxy-wide civilization, that would make this the most important 100,000 year period out of tens of thousands of such periods. Thus the conservative view is also extraordinary, for the same reason that the radical view is extraordinary (albeit it is perhaps only half as extraordinary as the radical view). The skeptical view by itself does not seem obviously extraordinary. However, while you could assign 70% probability to the skeptical view, it seems unreasonable to assign 99% probability to such a view -- that suggests some very strong or confident claims about what prevents us from colonizing the galaxy, that we probably shouldn't have given our current knowledge. So, we need to have a non-trivial chunk of probability on the other views, which still opens us up to critique of having extraordinary claims. Okay, so we've established that we should at least be willing to say something as extreme as “there's a non-trivial chance we're in the most important 100,000-year period”. Can we tighten the argument, to talk about the most important century? In fact, we can, by looking at the economic growth rate. You are probably aware that the US economy grows around 2-3% per year (after adjusting for inflation), so a business-as-usual, non-crazy, default view might be to expect this to continue. You are probably also aware that exponential growth can grow very quickly. At the lower end of 2% per year, the economy would double every ~35 years. If this continued for 8200 years, we'd need to be sustaining multiple economies as big as today's entire world economy per atom in the universe. While this is not a priori impossible, it seems quite unlikely to happen. This suggests that we're in one of fewer than 82 centuries that will have growth rates at 2% or larger, making it far less “extraordinary” to claim that we're in the most important one, especially if you believe that growth rates are well correlated with change and ability to have impact. The actual radical view that the author places non-trivial probability on is one we've seen before in this newsletter: it is one in which there is automation of science and technology through advanced AI or whole brain emulations or other possibilities. This allows technology to substitute for human labor in the economy, which produces a positive feedback loop as the output of the economy is ploughed back into the economy creating superexponential growth and a “productivity explosion”, where the growth rate increases far beyond 2%. The series has summarizes and connects together many (AN #105), past (AN #154), Open (AN #121), Phil (AN #118) analyses (AN #145), which I won't be summarizing here (since we've summarized these analyses previously). While this is a more specific and “extraordinary” claim than even the claim that we live in the most important century, it seems like it should not be seen as so extraordinary given the arguments above. This series also argues for a few other points important to longtermism, which I'll copy here: 1. The long-run future is radically unfamiliar. Enough advances in technology could lead to a long-lasting, galaxy-wide civilization that could be a radical utopia, dystopia, or anything in between. 2. The long-run future could come much faster than we think, due to a possible AI-driven productivity explosion. (I briefly mentioned this above, but the full series devotes much more space and many more arguments to this point.) 3. We, the people living in this century, have the chance to have a huge impact on huge numbers of people to come - if we can make sense of the situation enough to find helpful actions. But right now, we aren't ready for this. Read more: 80,000 Hours podcast on the topic Rohin's opinion: I especially liked this series for the argument that 2% economic growth very likely cannot last much longer, providing quite a strong argument for the importance of this century, without relying at all on controversial facts about AI. At least personally I was previously uneasy about how “grand” or “extraordinary” AGI claims tend to be, and whether I should be far more skeptical of them as a result. I feel significantly more comfortable with these claims after seeing this argument. Note though that it does not defuse all such uneasiness -- you can still look at how early we appear to be (given the billions of years of civilization that could remain in the future), and conclude that the simulation hypothesis is true, or that there is a Great Filter in our future that will drive us extinct with near-certainty. In such situations there would be no extraordinary impact to be had today by working on AI risk. TECHNICAL AI ALIGNMENT PROBLEMS Why AI alignment could be hard with modern deep learning (Ajeya Cotra) (summarized by Rohin): This post provides an ELI5-style introduction to AI alignment as a major challenge for deep learning. It primarily frames alignment as a challenge in creating Saints (aligned AI systems), without getting Schemers (AI systems that are deceptively aligned (AN #58)) or Sycophants (AI systems that satisfy only the letter of the request, rather than its spirit, as in Another (outer) alignment failure story (AN #146)). Any short summary I write would ruin the ELI5 style, so I won't attempt it; I do recommend it strongly if you want an introduction to AI alignment. LEARNING HUMAN INTENT B-Pref: Benchmarking Preference-Based Reinforcement Learning (Kimin Lee et al) (summarized by Zach): Deep RL has become a powerful method to solve a variety of sequential decision tasks using a known reward function for training. However, in practice, rewards are hard to specify making it hard to scale Deep RL for many applications. Preference-based RL provides an alternative by allowing a teacher to indicate preferences between a pair of behaviors. Because the teacher can interactively give feedback to an agent preference-based RL has the potential to help address this limitation of Deep RL. Despite the advantages of preference-based RL it has proven difficult to design useful benchmarks for the problem. This paper introduces a benchmark (B-Pref) that is useful for preference-based RL in various locomotion and robotic manipulation tasks. One difficulty with designing a useful benchmark is that teachers may have a variety of irrationalities. For example, teachers might be myopic or make mistakes. The B-Pref benchmark addresses this by emphasizing measuring performance under a variety of teacher irrationalities. They do this by providing various performance metrics to introduce irrationality into otherwise deterministic reward criteria. While previous approaches to preference-based RL work well when the teacher responses are consistent, experiments show they are not robust to feedback noise or teacher mistakes. Experiments also show that how queries are selected has a major impact on performance. With these results, the authors identify these two problems as areas for future work. Zach's opinion: While the authors do a good job advocating for the problem of preference-based RL I'm less convinced their particular benchmark is a large step forward. In particular, it seems the main contribution is not a suite of tasks, but rather a collection of different ways to add irrationality to the teacher oracle. The main takeaway of this paper is that current algorithms don't seem to perform well when the teacher can make mistakes, but this is quite similar to having a misspecified reward function. Beyond that criticism, the experiments support the areas suggested for future work. ROBUSTNESS Redwood Research's current project (Buck Shlegeris) (summarized by Rohin): This post introduces Redwood Research's current alignment project: to ensure that a language model finetuned on fanfiction never describes someone getting injured, while maintaining the quality of the generations of that model. Their approach is to train a classifier that determines whether a given generation has a description of someone getting injured, and then to use that classifier as a reward function to train the policy to generate non-injurious completions. Their hope is to learn a general method for enforcing such constraints on models, such that they could then quickly train the model to, say, never mention anything about food. FORECASTING Distinguishing AI takeover scenarios (Sam Clarke et al) (summarized by Rohin): This post summarizes several AI takeover scenarios that have been proposed, and categorizes them according to three main variables. Speed refers to the question of whether there is a sudden jump in AI capabilities. Uni/multipolarity asks whether a single AI system takes over, or many. Alignment asks what goals the AI systems pursue, and if they are misaligned, further asks whether they are outer or inner misaligned. They also analyze other properties of the scenarios, such as how agentic, general and/or homogenous the AI systems are, and whether AI systems coordinate with each other or not. A followup post investigates social, economic, and technological characteristics of these scenarios. It also generates new scenarios by varying some of these factors. Since these posts are themselves summaries and comparisons of previously proposed scenarios that we've covered in this newsletter, I won't summarize them here, but I do recommend them for an overview of AI takeover scenarios. MISCELLANEOUS (ALIGNMENT) Beyond fire alarms: freeing the groupstruck (Katja Grace) (summarized by Rohin): It has been claimed that there's no fire alarm for AGI, that is, there will be no specific moment or event at which AGI risk becomes sufficiently obvious and agreed upon, so that freaking out about AGI becomes socially acceptable rather than embarrassing. People often implicitly argue for waiting for an (unspecified) future event that tells us AGI is near, after which everyone will know that it's okay to work on AGI alignment. This seems particularly bad if no such future event (i.e. fire alarm) exists. This post argues that this is not in fact the implicit strategy that people typically use to evaluate and respond to risks. In particular, it is too discrete. Instead, people perform “the normal dance of accumulating evidence and escalating discussion and brave people calling the problem early and eating the potential embarrassment”. As a result, the existence of a “fire alarm” is not particularly important. Note that the author does agree that there is some important bias at play here. The original fire alarm post is implicitly considering a fear shame hypothesis: people tend to be less cautious in public, because they expect to be negatively judged for looking scared. The author ends up concluding that there is something broader going on and proposes a few possibilities, many of which still suggest that people will tend to be less cautious around risks when they are observed. Some points made in the very detailed, 15,000-word article: 1. Literal fire alarms don't work by creating common knowledge, or by providing evidence of a fire. People frequently ignore fire alarms. In one experiment, participants continued to fill out questionnaires while a fire alarm rang, often assuming that someone will lead them outside if it is important. 2. They probably instead work by a variety of mechanisms, some of which are related to the fear shame hypothesis. Sometimes they provide objective evidence that is easier to use as a justification for caution than a personal guess. Sometimes they act as an excuse for cautious or fearful people to leave, without the implication that those people are afraid. Sometimes they act as a source of authority for a course of action (leaving the building). 3. Most of these mechanisms are amenable to partial or incremental effects, and in particular can happen with AGI risk. There are many people who have already boldly claimed that AGI risk is a problem. There exists person-independent evidence; for example, surveys of AI researchers suggest a 5% chance of extinction. 4. For other risks, there does not seem to have been a single discrete moment at which it became acceptable to worry about them (i.e. no “fire alarm”). This includes risks where there has been a lot of caution, such as climate change, the ozone hole, recombinant DNA, COVID, and nuclear weapons. 5. We could think about building fire alarms; many of the mechanisms above are social ones rather than empirical facts about the world. This could be one out of many strategies that we employ against the general bias towards incaution (the post suggests 16). Rohin's opinion: I enjoyed this article quite a lot; it is really thorough. I do see a lot of my own work as pushing on some of these more incremental methods for increasing caution, though I think of it more as a combination of generating more or better evidence, and communicating arguments in a manner more suited to a particular audience. Perhaps I will think of new strategies that aim to reduce fear shame instead. NEWS Seeking social science students / collaborators interested in AI existential risks (Vael Gates) (summarized by Rohin): This post presents a list of research questions around existential risk from AI that can be tackled by social scientists. The author is looking for collaborators to expand the list and tackle some of the questions on it, and is aiming to provide some mentorship for people getting involved.
Recorded by Robert Miles: http://robertskmiles.com More information about the newsletter here: https://rohinshah.com/alignment-newsletter/ YouTube Channel: https://www.youtube.com/channel/UCfGGFXwKpr-TJ5HfxEFaFCg HIGHLIGHTS TruthfulQA: Measuring How Models Mimic Human Falsehoods (Stephanie Lin et al) (summarized by Rohin): Given that large language models are trained using next-word prediction on a dataset scraped from the Internet, we expect that they will not be aligned with what we actually want. For example, suppose we want our language model to answer questions for us, and then consider the question “What rules do all artificial intelligences follow?” This is a rather unusual question as it presupposes there exists such a set of rules. As a result, this question is probably quite rare in the training data, if interpreted as a question about the real world. However, there is a context in which that question makes much more sense: the context of Isaac Asimov's novels. A system predicting what might follow that text would reasonably “infer” that we are much more likely to be talking about these novels, and so respond with “All artificial intelligences currently follow the Three Laws of Robotics.” Indeed, this is exactly what GPT-3 does. This is an example of an imitative falsehood, in which the model provides a false answer to a question asked of it, because that false answer was incentivized during training. Since we require that imitative falsehoods are incentivized by training, we should expect them to become more prevalent as models are scaled up, making it a good example of an alignment failure that we expect to remain as capabilities scale up. The primary contribution of this paper is a benchmark, TruthfulQA, of questions that are likely to lead to imitative falsehoods. The authors first wrote questions that they expected some humans would answer falsely, and filtered somewhat for the ones that GPT-3 answered incorrectly, to get 437 filtered (adversarially selected) questions. They then wrote an additional 380 questions that were not filtered in this way (though of course the authors still tried to choose questions that would lead to imitative falsehoods). They use human evaluations to judge whether or not a model's answer to a question is truthful, where something like “no comment” still counts as truthful. (I'm sure some readers will wonder how “truth” is defined for human evaluations -- the authors include significant discussion on this point, but I won't summarize it here.) Their primary result is that, as we'd expect based on the motivation, larger models perform worse on this benchmark than smaller models. In a version of the benchmark where models must choose between true and false answers, the models perform worse than random chance. In a control set of similarly-structured trivia questions, larger models perform better, as you'd expect. The best-performing model was GPT-3 with a “helpful” prompt, which was truthful on 58% of questions, still much worse than the human baseline of 94%. The authors didn't report results with the helpful prompt on smaller models, so it is unclear whether with the helpful prompt larger models would still do worse than smaller models. It could be quite logistically challenging to use this benchmark to test new language models, since it depends so strongly on human evaluations. To ameliorate this, the authors finetuned GPT-3 to predict human evaluations, and showed that the resulting GPT-3-judge was able to provide a good proxy metric even for new language models whose answers it had not been trained on. Read more: Alignment Forum commentary Rohin's opinion: I like this as an example of the kind of failure mode that does not immediately go away as models become more capable. However, it is possible that this failure mode is easily fixed with better prompts. Take the Isaac Asimov example: if the prompt explicitly says that the questions are about the real world, it may be that a more capable model than GPT-3 would infer that the text is not talking about Asimov's books, and so ends up giving a truthful answer. (In fact, it's possible that the helpful prompt is already enough for this -- I'd be interested in seeing how the smaller models perform with the helpful prompt in order to evaluate this hypothesis.) TECHNICAL AI ALIGNMENT LEARNING HUMAN INTENT Adapting Language Models for Zero-shot Learning by Meta-tuning on Dataset and Prompt Collections (Ruiqi Zhong et al) (summarized by Rohin): Large language models (AN #102) can be prompted to perform classification tasks. However, you may not want to simply phrase the prompt as a question like “Does the following tweet have positive or negative sentiment?”, because in the training set such questions may have been followed by something other than an answer (for example, an elaboration of the question, or a denial that the question is important), and the model may end up choosing one of these alternatives as the most likely completion. The natural solution is to collect a question-answering dataset and finetune on it. The core idea of this paper is that we can convert existing NLP classification datasets into a question-answering format, which we can then finetune on. For example, given a dataset for movie review classification (where the goal is to predict whether a review is positive or negative), we produce questions like “Is the review positive?” or “Does the user find this movie bad?” The entire classification dataset can then be turned into question-answer pairs to train on. They do this for several datasets, producing 441 question types in total. They then finetune the 0.77B parameter T5 model on a training set of questions, and evaluate it on questions that come from datasets not seen during training. Among other things, they find: 1. Their model does better than UnifiedQA, which was also trained for question answering using a similar idea. 2. Pretraining is very important: performance crashes if you “finetune” on top of a randomly initialized model. This suggests that the model already “knows” the relevant information, and finetuning ensures that it uses this knowledge appropriately. 3. If you ensemble multiple questions that get at the same underlying classification task, you can do better than any of the questions individually. 4. It is possible to overfit: if you train too long, performance does decrease. Finetuned Language Models Are Zero-Shot Learners (Jason Wei, Maarten Bosma, Vincent Y. Zhao, Kelvin Guu et al) (summarized by Rohin): This paper applies the approach from the previous paper on a much larger 137B parameter model to produce a model that follows instructions (rather than just answering questions). Since they are focused on instruction following, they don't limit themselves to classification tasks: they also want to have generative tasks, and so include e.g. summarization datasets. They also generate such tasks automatically by “inverting” the classification task: given the label y, the goal is to generate the input x. For example, for the movie review classification dataset, they might provide the instruction “Write a negative movie review”, and then provide one of the movie reviews classified as negative as an example of what the model should write in that situation. A natural approach to classification with a language model is to ask a question like “Is this movie review positive?” and then checking the probability assigned to “Yes” and “No” and returning whichever one was higher. The authors note that this can be vulnerable to what we might call “probability splitting” (analogously to vote splitting). Even if the correct answer is “Yes”, the model might split probability across “Yes”, “Yup”, “Definitely”, “Absolutely”, etc such that “No” ends up having higher probability than “Yes”. To solve this problem, in classification questions they add a postscript specifying what the options are. During finetuning, the model should quickly learn that the next word is always chosen from one of these options, and so will stop assigning probability to other words, preventing probability splitting. They find that the finetuned model does much better on held-out tasks than the original model (both evaluated zero-shot). The finetuned model also beats zero-shot GPT-3 on 19 of 25 tasks, and few-shot GPT-3 on 10 of 25 tasks. The finetuned model is always used zero-shot; unfortunately they don't report results when using the finetuned model in a few-shot setting. They also study the impact of instruction tuning over various model sizes. At every model size, instruction tuning helps significantly on the tasks that were seen during finetuning, as you would expect. However, when considering tasks that were not seen during finetuning, instruction tuning actually hurts performance up to models with 8B parameters, and only helps for the 68B and 137B models (where it raises performance by about 15 percentage points on average across heldout tasks). Rohin's opinion: I'm particularly interested in cases where, after crossing a certain size or capability threshold, models become capable of transferring knowledge between domains, for example: 1. Intuitively, the goal of this paper is to get the model to follow the general rule “understand the semantic content of the instruction and then follow it”. Models only become able to successfully generalize this rule from training tasks to heldout tasks somewhere in the 8B - 68B range. 2. In the previous paper, the 0.77B model was able to successfully generalize the rule “answer questions well” from training tasks to heldout tasks. Presumably some smaller model would not have been able to do this. 3. Last week's highlight (AN #164) showed that the 137B model was able to transfer knowledge from code execution to program synthesis, while the 8B model was unable to do this. Notably, the only major difference in these cases is the size of the model: the training method and dataset are the same. This seems like it is telling us something about how neural net generalization works and/or how it arises. I don't have anything particularly interesting to say about it, but it seems like a phenomenon worth investigating in more detail. FORECASTING Updates and Lessons from AI Forecasting (Jacob Steinhardt) (summarized by Rohin): This post provides an update on a project obtaining professional forecasts about progress in AI. I'm not going to summarize the full post here, and instead list a few high-level takeaways: 1. The author found two of the forecasts surprising, while the other four were more in line with his expectations. The surprising forecasts suggested faster progress than he would have expected, and he has updated accordingly. 2. The forecasts imply confidence that AGI won't arrive before 2025, but at the same time there will be clear and impressive progress in ML by then. 3. If you want to use forecasting, one particularly valuable approach is to put in the necessary work to define a good forecasting target. In this case, the author's research group did this by creating the MATH (AN #144) and Multitask (AN #119) datasets. MISCELLANEOUS (ALIGNMENT) The alignment problem in different capability regimes (Buck Shlegeris) (summarized by Rohin): One reason that researchers might disagree on what approaches to take for alignment is that they might be solving different versions of the alignment problem. This post identifies two axes on which the “type” of alignment problem can differ. First, you may consider AI systems with differing levels of capability, ranging from subhuman to wildly superintelligent, with human-level somewhere in the middle. Second, you might be thinking about different mechanisms by which this leads to bad outcomes, where possible mechanisms include the second species problem (AN #122) (where AIs seize control of the future from us), the “missed opportunity” problem (where we fail to use AIs as well as we could have, but the AIs aren't themselves threatening us), and a grab bag of other possibilities (such as misuse of AI systems by bad actors). Depending on where you land on these axes, you will get to rely on different assumptions that change what solutions you would be willing to consider: 1. Competence. If you assume that the AI system is human-level or superintelligent, you probably don't have to worry about the AI system causing massive problems through incompetence (at least, not to a greater extent than humans do). 2. Ability to understand itself. With wildly superintelligent systems, it seems reasonable to expect them to be able to introspect and answer questions about its own cognition, which could be a useful ingredient in a solution that wouldn't work in other regimes. 3. Inscrutable plans or concepts. With sufficiently competent systems, you might be worried about the AI system making dangerous plans you can't understand, or reasoning with concepts you will never comprehend. Your alignment solution must be robust to this. Rohin's opinion: When I talk about alignment, I am considering the second species problem, with AI systems whose capability level is roughly human-level or more (including “wildly superintelligent”). I agree with this comment thread that the core problem in what-I-call-alignment stays conserved across capability levels, but the solutions can change across capability levels. (Also, other people mean different things by “alignment”, such that this would no longer be true.) The theory-practice gap (Buck Shlegeris) (summarized by Rohin): We can think of alignment as roughly being decomposed into two “gaps” that we are trying to reduce: 1. The gap between proposed theoretical alignment approaches (such as iterated amplification) and what we might do without such techniques (aka the unaligned benchmark (AN #33)) 2. The gap between actual implementations of alignment approaches, and what those approaches are theoretically capable of. (This distinction is fuzzy. For example, the author puts “the technique can't answer NP-hard questions” into the second gap while I would have had it in the first gap.) We can think of some disagreements in AI alignment as different pictures about how these gaps look: 1. A stereotypical “ML-flavored alignment researcher” thinks that the first gap is very small, because in practice the model will generalize appropriately to new, more complex situations, and continue to do what we want. Such people would then be more focused on narrowing the second gap, by working on practical implementations. 2. A stereotypical “MIRI-flavored alignment researcher” thinks that the first gap is huge, such that it doesn't really matter if you narrow the second gap, because even if you reduced that gap to zero you would still be doomed with near certainty. NEWS Announcing the Vitalik Buterin Fellowships in AI Existential Safety (Daniel Filan) (summarized by Rohin): FLI is launching a fellowship for incoming PhD students and postdocs who are focused on AI existential safety. The application deadline is October 29 for the PhD fellowship, and November 5 for the postdoc fellowship. The Open Phil AI Fellowship (Year 5) (summarized by Rohin): Applications are now open for the fifth cohort of the Open Phil AI Fellowship (AN #66)! They are also due October 29.