POPULARITY
Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn't consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.TL;DR To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts [...] ---Outline:(01:08) TL;DR(02:38) Introduction(02:41) Motivation(06:09) Our Task(08:35) Conclusions and Strategic Updates(13:59) Comparing different ways to train Chat SAEs(18:30) Using SAEs for OOD Probing(20:21) Technical Setup(20:24) Datasets(24:16) Probing(26:48) Results(30:36) Related Work and Discussion(34:01) Is it surprising that SAEs didn't work?(39:54) Dataset debugging with SAEs(42:02) Autointerp and high frequency latents(44:16) Removing High Frequency Latents from JumpReLU SAEs(45:04) Method(45:07) Motivation(47:29) Modifying the sparsity penalty(48:48) How we evaluated interpretability(50:36) Results(51:18) Reconstruction loss at fixed sparsity(52:10) Frequency histograms(52:52) Latent interpretability(54:23) Conclusions(56:43) AppendixThe original text contained 7 footnotes which were omitted from this narration. --- First published: March 26th, 2025 Source: https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft --- Narrated by TYPE III AUDIO. ---Images from the article:
Will LLMs soon be made into autonomous agents? Will they lead to job losses? Is AI misinformation overblown? Will it prove easy or hard to create AGI? And how likely is it that it will feel like something to be a superhuman AGI?With AGI back in the headlines, we bring you 15 opinionated highlights from the show addressing those and other questions, intermixed with opinions from hosts Luisa Rodriguez and Rob Wiblin recorded back in 2023.Check out the full transcript on the 80,000 Hours website.You can decide whether the views we expressed (and those from guests) then have held up these last two busy years. You'll hear:Ajeya Cotra on overrated AGI worriesHolden Karnofsky on the dangers of aligned AI, why unaligned AI might not kill us, and the power that comes from just making models biggerIan Morris on why the future must be radically different from the presentNick Joseph on whether his companies internal safety policies are enoughRichard Ngo on what everyone gets wrong about how ML models workTom Davidson on why he believes crazy-sounding explosive growth stories… and Michael Webb on why he doesn'tCarl Shulman on why you'll prefer robot nannies over human onesZvi Mowshowitz on why he's against working at AI companies except in some safety rolesHugo Mercier on why even superhuman AGI won't be that persuasiveRob Long on the case for and against digital sentienceAnil Seth on why he thinks consciousness is probably biologicalLewis Bollard on whether AI advances will help or hurt nonhuman animalsRohin Shah on whether humanity's work ends at the point it creates AGIAnd of course, Rob and Luisa also regularly chime in on what they agree and disagree with.Chapters:Cold open (00:00:00)Rob's intro (00:00:58)Rob & Luisa: Bowerbirds compiling the AI story (00:03:28)Ajeya Cotra on the misalignment stories she doesn't buy (00:09:16)Rob & Luisa: Agentic AI and designing machine people (00:24:06)Holden Karnofsky on the dangers of even aligned AI, and how we probably won't all die from misaligned AI (00:39:20)Ian Morris on why we won't end up living like The Jetsons (00:47:03)Rob & Luisa: It's not hard for nonexperts to understand we're playing with fire here (00:52:21)Nick Joseph on whether AI companies' internal safety policies will be enough (00:55:43)Richard Ngo on the most important misconception in how ML models work (01:03:10)Rob & Luisa: Issues Rob is less worried about now (01:07:22)Tom Davidson on why he buys the explosive economic growth story, despite it sounding totally crazy (01:14:08)Michael Webb on why he's sceptical about explosive economic growth (01:20:50)Carl Shulman on why people will prefer robot nannies over humans (01:28:25)Rob & Luisa: Should we expect AI-related job loss? (01:36:19)Zvi Mowshowitz on why he thinks it's a bad idea to work on improving capabilities at cutting-edge AI companies (01:40:06)Holden Karnofsky on the power that comes from just making models bigger (01:45:21)Rob & Luisa: Are risks of AI-related misinformation overblown? (01:49:49)Hugo Mercier on how AI won't cause misinformation pandemonium (01:58:29)Rob & Luisa: How hard will it actually be to create intelligence? (02:09:08)Robert Long on whether digital sentience is possible (02:15:09)Anil Seth on why he believes in the biological basis of consciousness (02:27:21)Lewis Bollard on whether AI will be good or bad for animal welfare (02:40:52)Rob & Luisa: The most interesting new argument Rob's heard this year (02:50:37)Rohin Shah on whether AGI will be the last thing humanity ever does (02:57:35)Rob's outro (03:11:02)Audio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic ArmstrongTranscriptions and additional content editing: Katy Moore
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we?We're the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we've evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We've also been growing since our last post: by 39% last year [...] ---Outline:(00:32) Who are we?(01:32) What have we been up to?(02:16) Frontier Safety(02:38) FSF(04:05) Dangerous Capability Evaluations(05:12) Mechanistic Interpretability(08:54) Amplified Oversight(09:23) Theoretical Work on Debate(10:32) Empirical Work on Debate(11:37) Causal Alignment(12:47) Emerging Topics(14:57) Highlights from Our Collaborations(17:07) What are we planning next?--- First published: August 20th, 2024 Source: https://www.lesswrong.com/posts/79BPxvSsjzBkiSyTq/agi-safety-and-alignment-at-google-deepmind-a-summary-of --- Narrated by TYPE III AUDIO.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, published by Rohin Shah on August 20, 2024 on The AI Alignment Forum. We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We're the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we've evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We've also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We're part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. What have we been up to? It's been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don't pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. Frontier Safety The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models. FSF We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc). We're excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups. A key area of the FSF we're focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions. Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn't include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence. Dangerous Capability Evaluations Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluati...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, published by Rohin Shah on August 20, 2024 on LessWrong. We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We're the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we've evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We've also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We're part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. What have we been up to? It's been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don't pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. Frontier Safety The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models. FSF We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc). We're excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups. A key area of the FSF we're focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions. Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn't include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence. Dangerous Capability Evaluations Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluations published...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work, published by Rohin Shah on August 20, 2024 on LessWrong. We wanted to share a recap of our recent outputs with the AF community. Below, we fill in some details about what we have been working on, what motivated us to do it, and how we thought about its importance. We hope that this will help people build off things we have done and see how their work fits with ours. Who are we? We're the main team at Google DeepMind working on technical approaches to existential risk from AI systems. Since our last post, we've evolved into the AGI Safety & Alignment team, which we think of as AGI Alignment (with subteams like mechanistic interpretability, scalable oversight, etc.), and Frontier Safety (working on the Frontier Safety Framework, including developing and running dangerous capability evaluations). We've also been growing since our last post: by 39% last year, and by 37% so far this year. The leadership team is Anca Dragan, Rohin Shah, Allan Dafoe, and Dave Orr, with Shane Legg as executive sponsor. We're part of the overall AI Safety and Alignment org led by Anca, which also includes Gemini Safety (focusing on safety training for the current Gemini models), and Voices of All in Alignment, which focuses on alignment techniques for value and viewpoint pluralism. What have we been up to? It's been a while since our last update, so below we list out some key work published in 2023 and the first part of 2024, grouped by topic / sub-team. Our big bets for the past 1.5 years have been 1) amplified oversight, to enable the right learning signal for aligning models so that they don't pose catastrophic risks, 2) frontier safety, to analyze whether models are capable of posing catastrophic risks in the first place, and 3) (mechanistic) interpretability, as a potential enabler for both frontier safety and alignment goals. Beyond these bets, we experimented with promising areas and ideas that help us identify new bets we should make. Frontier Safety The mission of the Frontier Safety team is to ensure safety from extreme harms by anticipating, evaluating, and helping Google prepare for powerful capabilities in frontier models. While the focus so far has been primarily around misuse threat models, we are also working on misalignment threat models. FSF We recently published our Frontier Safety Framework, which, in broad strokes, follows the approach of responsible capability scaling, similar to Anthropic's Responsible Scaling Policy and OpenAI's Preparedness Framework. The key difference is that the FSF applies to Google: there are many different frontier LLM deployments across Google, rather than just a single chatbot and API (this in turn affects stakeholder engagement, policy implementation, mitigation plans, etc). We're excited that our small team led the Google-wide strategy in this space, and demonstrated that responsible capability scaling can work for large tech companies in addition to small startups. A key area of the FSF we're focusing on as we pilot the Framework, is how to map between the critical capability levels (CCLs) and the mitigations we would take. This is high on our list of priorities as we iterate on future versions. Some commentary (e.g. here) also highlighted (accurately) that the FSF doesn't include commitments. This is because the science is in early stages and best practices will need to evolve. But ultimately, what we care about is whether the work is actually done. In practice, we did run and report dangerous capability evaluations for Gemini 1.5 that we think are sufficient to rule out extreme risk with high confidence. Dangerous Capability Evaluations Our paper on Evaluating Frontier Models for Dangerous Capabilities is the broadest suite of dangerous capability evaluations published...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Lacks Good Paradigms, published by Daniel Tan on July 18, 2024 on LessWrong. Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free! Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome! Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research? A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret. In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined. Towards a Grand Unifying Theory (GUT) with MI Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned. Some people who have espoused this opinion: Richard Ngo has argued here that MI enables "big breakthroughs" towards a "principled understanding" of deep learning. Rohin Shah has argued here that MI builds "new affordances" for alignment methods. Evan Hubinger has argued for MI here because it helps us identify "unknown unknowns". Leo Gao argues here that MI aids in "conceptual research" and "gets many bits" per experiment. As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis. It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated. (Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here. A GUT Needs Paradigms Paradigm - an overarching framework for thinking about a field In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for th...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Lacks Good Paradigms, published by Daniel Tan on July 18, 2024 on LessWrong. Note: I wrote this post rather quickly as an exercise in sharing rough / unpolished thoughts. I am also not an expert on some of the things I've written about. If you spot mistakes or would like to point out missed work / perspectives, please feel free! Note 2: I originally sent this link to some people for feedback, but I was having trouble viewing the comments on the draft. The post was also in a reasonably complete state, so I decided to just publish it - and now I can see the comments! If you're one of those people, feedback is still very much welcome! Mechanistic Interpretability (MI) is a popular and rapidly growing field of technical AI safety research. As a field, it's extremely accessible, requiring comparatively few computational resources, and facilitates rapid learning, due to a very short feedback loop. This means that many junior researchers' first foray into AI safety research is in MI (myself included); indeed, this occurs to the extent where some people feel MI is over-subscribed relative to other technical agendas. However, how useful is this MI research? A very common claim on MI's theory of impact (ToI) is that MI helps us advance towards a "grand unifying theory" (GUT) of deep learning. One of my big cruxes for this ToI is whether MI admits "paradigms" which facilitate correct thinking and understanding of the models we aim to interpret. In this post, I'll critically examine several leading candidates for "paradigms" in MI, consider the available evidence for / against, and identify good future research directions (IMO). At the end, I'll conclude with a summary of the main points and an overview of the technical research items I've outlined. Towards a Grand Unifying Theory (GUT) with MI Proponents of this argument believe that, by improving our basic understanding of neural nets, MI yields valuable insights that can be used to improve our agents, e.g. by improving architectures or by improving their training processes. This allows us to make sure future models are safe and aligned. Some people who have espoused this opinion: Richard Ngo has argued here that MI enables "big breakthroughs" towards a "principled understanding" of deep learning. Rohin Shah has argued here that MI builds "new affordances" for alignment methods. Evan Hubinger has argued for MI here because it helps us identify "unknown unknowns". Leo Gao argues here that MI aids in "conceptual research" and "gets many bits" per experiment. As a concrete example of work that I think would not have been possible without fundamental insights from MI: steering vectors, a.k.a. representation engineering, and circuit breakers, which were obviously inspired by the wealth of work in MI demonstrating the linear representation hypothesis. It's also important to remember that the value of fundamental science often seems much lower in hindsight, because humans quickly adjust their perspectives. Even if MI insights seem like common sense to us nowadays, their value in instrumenting significant advances can't be overstated. (Aside) A corollary of this argument is that MI could likely have significant capabilities externalities. Becoming better at building powerful and instruction-aligned agents may inadvertently accelerate us towards AGI. This point has been made in depth elsewhere, so I won't elaborate further here. A GUT Needs Paradigms Paradigm - an overarching framework for thinking about a field In his seminal book, The Structure of Scientific Revolution, Thomas Kuhn catalogues scientific progress in many different fields (spanning physics, chemistry, biology), and distills general trends about how these fields progress. Central to his analysis is the notion of a "paradigm" - an overarching framework for th...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talent Needs in Technical AI Safety, published by yams on May 24, 2024 on LessWrong. Co-Authors: @yams, @Carson Jones, @McKennaFitzgerald, @Ryan Kidd MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety teams. As the field has grown, it's become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2] In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field. All interviews were conducted on the condition of anonymity. Needs by Organization Type Organization type Talent needs Scaling Lab (i.e. OpenAI, DeepMind, Anthropic) Safety Teams Iterators > Amplifiers Small Technical Safety Orgs ( Machine Learning (ML) Engineers Growing Technical Safety Orgs (10-30 FTE) Amplifiers > Iterators Independent Research Iterators > Connectors Archetypes We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren't as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn't possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities. Connectors / Iterators / Amplifiers Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they're able to produce impactful work, since it's usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors. Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional time has empowered him to make major contributions to a wide range of empir...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talent Needs of Technical AI Safety Teams, published by Ryan Kidd on May 24, 2024 on The Effective Altruism Forum. Co-Authors: @yams @Carson Jones, @McKennaFitzgerald, @Ryan Kidd MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety orgs. As the field has grown, it's become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2] In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field. All interviews were conducted on the condition of anonymity. Needs by Organization Type Organization type Talent needs Scaling Lab (e.g., Anthropic, Google DeepMind, OpenAI) Safety Teams Iterators > Amplifiers Small Technical Safety Orgs ( Machine Learning (ML) Engineers Growing Technical Safety Orgs (10-30 FTE) Amplifiers > Iterators Independent Research Iterators > Connectors Here, ">" means "are prioritized over." Archetypes We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren't as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn't possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities. Connectors / Iterators / Amplifiers Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they're able to produce impactful work, since it's usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors. Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Talent Needs in Technical AI Safety, published by yams on May 24, 2024 on LessWrong. Co-Authors: @yams, @Carson Jones, @McKennaFitzgerald, @Ryan Kidd MATS tracks the evolving landscape of AI safety[1] to ensure that our program continues to meet the talent needs of safety teams. As the field has grown, it's become increasingly necessary to adopt a more formal approach to this monitoring, since relying on a few individuals to intuitively understand the dynamics of such a vast ecosystem could lead to significant missteps.[2] In the winter and spring of 2024, we conducted 31 interviews, ranging in length from 30 to 120 minutes, with key figures in AI safety, including senior researchers, organization leaders, social scientists, strategists, funders, and policy experts. This report synthesizes the key insights from these discussions. The overarching perspectives presented here are not attributed to any specific individual or organization; they represent a collective, distilled consensus that our team believes is both valuable and responsible to share. Our aim is to influence the trajectory of emerging researchers and field-builders, as well as to inform readers on the ongoing evolution of MATS and the broader AI Safety field. All interviews were conducted on the condition of anonymity. Needs by Organization Type Organization type Talent needs Scaling Lab (i.e. OpenAI, DeepMind, Anthropic) Safety Teams Iterators > Amplifiers Small Technical Safety Orgs ( Machine Learning (ML) Engineers Growing Technical Safety Orgs (10-30 FTE) Amplifiers > Iterators Independent Research Iterators > Connectors Archetypes We found it useful to frame the different profiles of research strengths and weaknesses as belonging to one of three archetypes (one of which has two subtypes). These aren't as strict as, say, Diablo classes; this is just a way to get some handle on the complex network of skills involved in AI safety research. Indeed, capacities tend to converge with experience, and neatly classifying more experienced researchers often isn't possible. We acknowledge past framings by Charlie Rogers-Smith and Rohin Shah (research lead/contributor), John Wentworth (theorist/experimentalist/distillator), Vanessa Kosoy (proser/poet), Adam Shimi (mosaic/palimpsests), and others, but believe our framing of current AI safety talent archetypes is meaningfully different and valuable, especially pertaining to current funding and employment opportunities. Connectors / Iterators / Amplifiers Connectors are strong conceptual thinkers who build a bridge between contemporary empirical work and theoretical understanding. Connectors include people like Paul Christiano, Buck Shlegeris, Evan Hubinger, and Alex Turner[3]; researchers doing original thinking on the edges of our conceptual and experimental knowledge in order to facilitate novel understanding. Note that most Connectors are typically not purely theoretical; they still have the technical knowledge required to design and run experiments. However, they prioritize experiments and discriminate between research agendas based on original, high-level insights and theoretical models, rather than on spur of the moment intuition or the wisdom of the crowds. Pure Connectors often have a long lead time before they're able to produce impactful work, since it's usually necessary for them to download and engage with varied conceptual models. For this reason, we make little mention of a division between experienced and inexperienced Connectors. Iterators are strong empiricists who build tight, efficient feedback loops for themselves and their collaborators. Ethan Perez is the central contemporary example here; his efficient prioritization and effective use of frictional time has empowered him to make major contributions to a wide range of empir...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Improving Dictionary Learning with Gated Sparse Autoencoders, published by Neel Nanda on April 25, 2024 on The AI Alignment Forum. Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!) They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%). See Sen's Twitter summary, my Twitter summary, and the paper! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AXRP Episode 29 - Science of Deep Learning with Vikrant Varma, published by DanielFilan on April 25, 2024 on The AI Alignment Forum. In 2022, it was announced that a fairly simple method can be used to extract the true beliefs of a language model on any given topic, without having to actually understand the topic at hand. Earlier, in 2021, it was announced that neural networks sometimes 'grok': that is, when training them on certain tasks, they initially memorize their training data (achieving their training goal in a way that doesn't generalize), but then suddenly switch to understanding the 'real' solution in a way that generalizes. What's going on with these discoveries? Are they all they're cracked up to be, and if so, how are they working? In this episode, I talk to Vikrant Varma about his research getting to the bottom of these questions. Topics we discuss: Challenges with unsupervised LLM knowledge discovery, aka contra CCS What is CCS? Consistent and contrastive features other than model beliefs Understanding the banana/shed mystery Future CCS-like approaches CCS as principal component analysis Explaining grokking through circuit efficiency Why research science of deep learning? Summary of the paper's hypothesis What are 'circuits'? The role of complexity Many kinds of circuits How circuits are learned Semi-grokking and ungrokking Generalizing the results Vikrant's research approach The DeepMind alignment team Follow-up work Daniel Filan: Hello, everybody. In this episode I'll be speaking with Vikrant Varma, a research engineer at Google DeepMind, and the technical lead of their sparse autoencoders effort. Today, we'll be talking about his research on problems with contrast-consistent search, and also explaining grokking through circuit efficiency. For links what we're discussing, you can check the description of this episode and you can read the transcript at axrp.net. All right, well, welcome to the podcast. Vikrant Varma: Thanks, Daniel. Thanks for having me. Challenges with unsupervised LLM knowledge discovery, aka contra CCS What is CCS? Daniel Filan: Yeah. So first, I'd like to talk about this paper. It is called Challenges with Unsupervised LLM Knowledge Discovery, and the authors are Sebastian Farquhar, you, Zachary Kenton, Johannes Gasteiger, Vladimir Mikulik, and Rohin Shah. This is basically about this thing called CCS. Can you tell us: what does CCS stand for and what is it? Vikrant Varma: Yeah, CCS stands for contrastive-consistent search. I think to explain what it's about, let me start from a more fundamental problem that we have with advanced AI systems. One of the problems is that when we train AI systems, we're training them to produce outputs that look good to us, and so this is the supervision that we're able to give to the system. We currently don't really have a good idea of how an AI system or how a neural network is computing those outputs. And in particular, we're worried about the situation in the future when the amount of supervision we're able to give it causes it to achieve a superhuman level of performance at that task. By looking at the network, we can't know how this is going to behave in a new situation. And so the Alignment Research Center put out a report recently about this problem. They named a potential part of this problem as "eliciting latent knowledge". What this means is if your model is, for example, really, really good at figuring out what's going to happen next in a video, as in it's able to predict the next frame of a video really well given a prefix of the video, this must mean that it has some sort of model of what's going on in the world. Instead of using the outputs of the model, if you could directly look at what it understands about the world, then potentially, you could use that information in a much safer ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Run evals on base models too!, published by orthonormal on April 4, 2024 on The AI Alignment Forum. (Creating more visibility for a comment thread with Rohin Shah.) Currently, DeepMind's capabilities evals are run on the post-RL*F (RLHF/RLAIF) models and not on the base models. This worries me because RL*F will train a base model to stop displaying capabilities, but this isn't a guarantee that it trains the model out of having the capabilities. Consider by analogy using RLHF on a chess-playing AI, where the trainers reward it for putting up a good fight and making the trainer work hard to win, but punish it for ever beating the trainer. There are two things to point out about this example: Running a simple eval on the post-RLHF model would reveal a much lower ELO than if you ran it on the base model, because it would generally find a way to lose. (In this example, you can imagine the red team qualitatively noticing the issue, but the example is an artificially simple one!) The post-RLHF model still has much of its chess knowledge latently available, in order to put up a good fight across the full range of human ability. Possibly it's even superhuman at chess - I know I'd have to be better than you at chess in order to optimize well for an entertaining game for you. But that won't show up in its ELO. So it seems to me like running evals on the base model as well as the post-RL*F model is an extremely sensible precaution against (1), and I'd love to be reassured either that this is unnecessary for some really obvious and ironclad reason, or that someone is already working on this. And I don't have any good suggestion on (2), the idea that RL*F could reinforce a capability while also concealing it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AtP*: An efficient and scalable method for localizing LLM behaviour to components, published by Neel Nanda on March 18, 2024 on The AI Alignment Forum. Authors: János Kramár, Tom Lieberum, Rohin Shah, Neel Nanda A new paper from the Google DeepMind mechanistic interpretability team, from core contributors János Kramár and Tom Lieberum Tweet thread summary, paper Abstract: Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively expensive for SoTA Large Language Models (LLMs). We investigate Attribution Patching (AtP), a fast gradient-based approximation to Activation Patching and find two classes of failure modes of AtP which lead to significant false negatives. We propose a variant of AtP called AtP*, with two changes to address these failure modes while retaining scalability. We present the first systematic study of AtP and alternative methods for faster activation patching and show that AtP significantly outperforms all other investigated methods, with AtP* providing further significant improvement. Finally, we provide a method to bound the probability of remaining false negatives of AtP* estimates. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More people getting into AI safety should do a PhD, published by AdamGleave on March 14, 2024 on The AI Alignment Forum. Doing a PhD is a strong option to get great at developing and evaluating research ideas. These skills are necessary to become an AI safety research lead, one of the key talent bottlenecks in AI safety, and are helpful in a variety of other roles. By contrast, my impression is that currently many individuals with the goal of being a research lead pursue options like independent research or engineering-focused positions instead of doing a PhD. This post details the reasons I believe these alternatives are usually much worse at training people to be research leads. I think many early-career researchers in AI safety are undervaluing PhDs. Anecdotally, I think it's noteworthy that people in the AI safety community were often surprised to find out I was doing a PhD, and positively shocked when I told them I was having a great experience. In addition, I expect many of the negatives attributed to PhDs are really negatives on any pathway involving open-ended, exploratory research that is key to growing to become a research lead. I am not arguing that most people contributing to AI safety should do PhDs. In fact, a PhD is not the best preparation for the majority of roles. If you want to become a really strong empirical research contributor, then start working as a research engineer on a great team: you will learn how to execute and implement faster than in a PhD. There are also a variety of key roles in communications, project management, field building and operations where a PhD is of limited use. But we believe a PhD is excellent preparation for becoming a research lead with your own distinctive research direction that you can clearly communicate and ultimately supervise junior researchers to work on. However, career paths are highly individual and involve myriad trade-offs. Doing a PhD may or may not be the right path for any individual person: I simply think it has a better track record than most alternatives, and so should be the default for most people. In the post I'll also consider counter-arguments to a PhD, as well as reasons why particular people might be better fits for alternative options. I also discuss how to make the most of a PhD if you do decide to pursue this route. Author Contributions: This post primarily reflects the opinion of Adam Gleave so is written using an "I" personal pronoun. Alejandro Ortega and Sean McGowan made substantial contributions writing the initial draft of the post based on informal conversations with Adam. This resulting draft was then lightly edited by Adam, including feedback & suggestions from Euan McLean and Siao Si Looi. Why be a research lead? AI safety progress can be substantially accelerated by people who can develop and evaluate new ideas, and mentor new people to develop this skill. Other skills are also in high demand, such as entrepreneurial ability, people management and ML engineering. But being one of the few researchers who can develop a compelling new agenda is one of the best roles to fill. This ability also pairs well with other skills: for example, someone with a distinct agenda who is also entrepreneurial would be well placed to start a new organisation. Inspired by Rohin Shah's terminology, I will call this kind of person a research lead: someone who generates (and filters) research ideas and determines how to respond to results. Research leads are expected to propose and lead research projects. They need strong knowledge of AI alignment and ML. They also need to be at least competent at executing on research projects: for empirically focused projects, this means adequate programming and ML engineering ability, whereas a theory lead would need stronger mathematical ability. However, what real...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What's up with LLMs representing XORs of arbitrary features?, published by Sam Marks on January 3, 2024 on The AI Alignment Forum. Thanks to Clément Dumas, Nikola Jurković, Nora Belrose, Arthur Conmy, and Oam Patel for feedback. In the comments of the post on Google Deepmind's CCS challenges paper, I expressed skepticism that some of the experimental results seemed possible. When addressing my concerns, Rohin Shah made some claims along the lines of "If an LLM linearly represents features a and b, then it will also linearly represent their XOR, ab, and this is true even in settings where there's no obvious reason the model would need to make use of the feature ab."[1] For reasons that I'll explain below, I thought this claim was absolutely bonkers, both in general and in the specific setting that the GDM paper was working in. So I ran some experiments to prove Rohin wrong. The result: Rohin was right and I was wrong. LLMs seem to compute and linearly represent XORs of features even when there's no obvious reason to do so. I think this is deeply weird and surprising. If something like this holds generally, I think this has importance far beyond the original question of "Is CCS useful?" In the rest of this post I'll: Articulate a claim I'll call "representation of arbitrary XORs (RAX)": LLMs compute and linearly represent XORs of arbitrary features, even when there's no reason to do so. Explain why it would be shocking if RAX is true. For example, without additional assumptions, RAX implies that linear probes should utterly fail to generalize across distributional shift, no matter how minor the distributional shift. (Empirically, linear probes often do generalize decently.) Present experiments showing that RAX seems to be true in every case that I've checked. Think through what RAX would mean for AI safety research: overall, probably a bad sign for interpretability work in general, and work that relies on using simple probes of model internals (e.g. ELK probes or coup probes) in particular. Make some guesses about what's really going on here. Overall, this has left me very confused: I've found myself simultaneously having (a) an argument that AB, (b) empirical evidence of A, and (c) empirical evidence of B. (Here A = RAX and B = other facts about LLM representations.) The RAX claim: LLMs linearly represent XORs of arbitrary features, even when there's no reason to do so To keep things simple, throughout this post, I'll say that a model linearly represents a binary feature f if there is a linear probe out of the model's latent space which is accurate for classifying f; in this case, I'll denote the corresponding direction as vf. This is not how I would typically use the terminology "linearly represents" - normally I would reserve the term for a stronger notion which, at minimum, requires the model to actually make use of the feature direction when performing cognition involving the feature[2]. But I'll intentionally abuse the terminology here because I don't think this distinction matters much for what I'll discuss. If a model linearly represents features a and b, then it automatically linearly represents ab and ab. However, ab is not automatically linearly represented - no linear probe in the figure above would be accurate for classifying ab. Thus, if the model wants to make use of the feature ab, then it needs to do something additional: allocate another direction[3] (more model capacity) to representing ab, and also perform the computation of ab so that it knows what value to store along this new direction. The representation of arbitrary XORs (RAX) claim, in its strongest form, asserts that whenever a LLM linearly represents features a and b, it will also linearly represent ab. Concretely, this might look something like: in layer 5, the model computes and linearly r...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion: Challenges with Unsupervised LLM Knowledge Discovery, published by Seb Farquhar on December 18, 2023 on The AI Alignment Forum. TL;DR: Contrast-consistent search (CCS) seemed exciting to us and we were keen to apply it. At this point, we think it is unlikely to be directly helpful for implementations of alignment strategies (>95%). Instead of finding knowledge, it seems to find the most prominent feature. We are less sure about the wider category of unsupervised consistency-based methods, but tend to think they won't be directly helpful either (70%). We've written a paper about some of our detailed experiences with it. Paper authors: Sebastian Farquhar*, Vikrant Varma*, Zac Kenton*, Johannes Gasteiger, Vlad Mikulik, and Rohin Shah. *Equal contribution, order randomised. Credences are based on a poll of Seb, Vikrant, Zac, Johannes, Rohin and show single values where we mostly agree and ranges where we disagreed. What does CCS try to do? To us, CCS represents a family of possible algorithms aiming at solving an ELK-style problem that have the steps: Knowledge-like property: write down a property that points at an LLM feature which represents the model's knowledge (or a small number of features that includes the model-knowledge-feature). Formalisation: make that property mathematically precise so you can search for features with that property in an unsupervised way. Search: find it (e.g., by optimising a formalised loss). In the case of CCS, the knowledge-like property is negation-consistency, the formalisation is a specific loss function, and the search is unsupervised learning with gradient descent on a linear + sigmoid function taking LLM activations as inputs. We were pretty excited about this. We especially liked that the approach is not supervised. Conceptually, supervising ELK seems really hard: it is too easy to confuse what you know, what you think the model knows, and what it actually knows. Avoiding the need to write down what-the-model-knows labels seems like a great goal. Why we think CCS isn't working We spent a lot of time playing with CCS and trying to make it work well enough to build a deception detector by measuring the difference between elicited model's knowledge and stated claims.[1] Having done this, we are now not very optimistic about CCS or things like it. Partly, this is because the loss itself doesn't give much reason to think that it would be able to find a knowledge-like property and empirically it seems to find whatever feature in the dataset happens to be most prominent, which is very prompt-sensitive. Maybe something building off it could work in the future, but we don't think anything about CCS provides evidence that it would be likely to. As a result, we have basically returned to our priors about the difficulty of ELK, which are something between "very very difficult" and "approximately impossible" for a full solution, while mostly agreeing that partial solutions are "hard but possible". What does the CCS loss say? The CCS approach is motivated like this: we don't know that much about the model's knowledge, but probably it follows basic consistency properties. For example, it probably has something like Bayesian credences and when it believes A with some probability PA, it ought to believe A with probability 1PA.[2] So if we search in the LLM's feature space for features that satisfy this consistency property, the model's knowledge is going to be one of the things that satisfies it. Moreover, they hypothesise, there probably aren't that many things that satisfy this property, so we can easily check the handful that we get and find the one representing the model's knowledge. When we dig into the CCS loss, it isn't clear that it really checks for what it's supposed to. In particular, we prove that arbitrary features, not jus...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search", published by RobertM on September 14, 2023 on LessWrong. In How To Go From Interpretability To Alignment: Just Retarget The Search, John Wentworth suggests: When people talk about prosaic alignment proposals, there's a common pattern: they'll be outlining some overcomplicated scheme, and then they'll say "oh, and assume we have great interpretability tools, this whole thing just works way better the better the interpretability tools are", and then they'll go back to the overcomplicated scheme. (Credit to Evan for pointing out this pattern to me.) And then usually there's a whole discussion about the specific problems with the overcomplicated scheme. In this post I want to argue from a different direction: if we had great interpretability tools, we could just use those to align an AI directly, and skip the overcomplicated schemes. I'll call the strategy "Just Retarget the Search". We'll need to make two assumptions: Some version of the natural abstraction hypothesis holds, and the AI ends up with an internal concept for human values, or corrigibility, or what the user intends, or human mimicry, or some other outer alignment target. The standard mesa-optimization argument from Risks From Learned Optimization holds, and the system ends up developing a general-purpose (i.e. retargetable) internal search process. Given these two assumptions, here's how to use interpretability tools to align the AI: Identify the AI's internal concept corresponding to whatever alignment target we want to use (e.g. values/corrigibility/user intention/human mimicry/etc). Identify the retargetable internal search process. Retarget (i.e. directly rewire/set the input state of) the internal search process on the internal representation of our alignment target. Just retarget the search. Bada-bing, bada-boom. There was a pretty interesting thread in the comments afterwards that I wanted to highlight. Rohin Shah (permalink) Definitely agree that "Retarget the Search" is an interesting baseline alignment method you should be considering. I like what you call "complicated schemes" over "retarget the search" for two main reasons: They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about). They degrade gracefully with worse interpretability tools, e.g. in debate, even if the debaters can only credibly make claims about whether particular neurons are activated, they can still stay stuff like "look my opponent is thinking about synthesizing pathogens, probably it is hoping to execute a treacherous turn", whereas "Retarget the Search" can't use this weaker interpretability at all. (Depending on background assumptions you might think this doesn't reduce x-risk at all; that could also be a crux.) johnswentworth (permalink) I indeed think those are the relevant cruxes. Evan R. Murphy (permalink) They don't rely on the "mesa-optimizer assumption" that the model is performing retargetable search (which I think will probably be false in the systems we care about). Why do you think we probably won't end up with mesa-optimizers in the systems we care about? Curious about both which systems you think we'll care about (e.g. generative models, RL-based agents, etc.) and why you don't think mesa-optimization is a likely emergent property for very scaled-up ML models. Rohin Shah (permalink) It's a very specific claim about how intelligence works, so gets a low prior, from which I don't update much (because it seems to me we know very little about how intelligence works structurally and the arguments given in favor seem like relatively weak considerations). Search is computationally inefficient relative to heuristics, and we'll be selecting rea...
This is a selection of highlights from episode #154 of The 80,000 Hours Podcast.These aren't necessarily the most important, or even most entertaining parts of the interview — and if you enjoy this, we strongly recommend checking out the full episode:Rohin Shah on DeepMind and trying to fairly hear out both AI doomers and doubtersAnd if you're finding these highlights episodes valuable, please let us know by emailing podcast@80000hours.org.Highlights put together by Simon Monsour and Milo McGuire
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What I would do if I wasn't at ARC Evals, published by Lawrence Chan on September 5, 2023 on The AI Alignment Forum. In which: I list 9 projects that I would work on if I wasn't busy working on safety standards at ARC Evals, and explain why they might be good to work on. Epistemic status: I'm prioritizing getting this out fast as opposed to writing it carefully. I've thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven't done that much digging into each of these, and it's likely that I'm wrong about many material facts. I also make little claim to the novelty of the projects. I'd recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.) Standard disclaimer: I'm writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of ARC/FAR/LTFF/Lightspeed or any other org or program I'm involved with. Thanks to Ajeya Cotra, Caleb Parikh, Chris Painter, Daniel Filan, Rachel Freedman, Rohin Shah, Thomas Kwa, and others for comments and feedback. Introduction I'm currently working as a researcher on the Alignment Research Center Evaluations Team (ARC Evals), where I'm working on lab safety standards. I'm reasonably sure that this is one of the most useful things I could be doing with my life. Unfortunately, there's a lot of problems to solve in the world, and lots of balls that are being dropped, that I don't have time to get to thanks to my day job. Here's an unsorted and incomplete list of projects that I would consider doing if I wasn't at ARC Evals: Ambitious mechanistic interpretability. Getting people to write papers/writing papers myself. Creating concrete projects and research agendas. Working on OP's funding bottleneck. Working on everyone else's funding bottleneck. Running the Long-Term Future Fund. Onboarding senior(-ish) academics and research engineers. Extending the young-EA mentorship pipeline. Writing blog posts/giving takes. I've categorized these projects into three broad categories and will discuss each in turn below. For each project, I'll also list who I think should work on them, as well as some of my key uncertainties. Note that this document isn't really written for myself to decide between projects, but instead as a list of some promising projects for someone with a similar skillset to me. As such, there's not much discussion of personal fit. If you're interested in working on any of the projects, please reach out or post in the comments below! Relevant beliefs I have Before jumping into the projects I think people should work on, I think it's worth outlining some of my core beliefs that inform my thinking and project selection: Importance of A(G)I safety: I think A(G)I Safety is one of the most important problems to work on, and all the projects below are thus aimed at AI Safety. Value beyond technical research: Technical AI Safety (AIS) research is crucial, but other types of work are valuable as well. Efforts aimed at improving AI governance, grantmaking, and community building are important and we should give more credit to those doing good work in those areas. High discount rate for current EA/AIS funding: There's several reasons for this: first, EA/AIS Funders are currently in a unique position due to a surge in AI Safety interest without a proportional increase in funding. I expect this dynamic to change and our influence to wane as additional funding and governments enter this space. Second, efforts today are important for paving the path to future efforts in the future. Third, my timelines are relatively short, which increases the importance of current funding. Building a robust EA/AIS ecosystem: The EA/AIS ecosystem should be more prepared for unpredictable s...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Ideas for improving epistemics in AI safety outreach, published by mic on August 21, 2023 on LessWrong. In 2022 and 2023, there has been a growing focus on recruiting talented individuals to work on mitigating the potential existential risks posed by artificial intelligence. For example, we've seen an increase in the number of university clubs, retreats, and workshops dedicated to introducing people to the issue of existential risk from AI. However, these efforts might foster an environment with suboptimal epistemics. Given the goal of enabling people to contribute positively to AI safety, there's an incentive to focus on that without worrying as much about whether our arguments are solid. Many people working on field building are not domain experts in AI safety or machine learning but are motivated due to a belief that AI safety is an important issue. Some participants may hold the belief that addressing the risks associated with AI is important, without fully comprehending their reasoning behind this belief or having engaged with strong counterarguments. This post is a brief examination of this issue and suggests some ideas to improve epistemics in outreach efforts. Note: I first drafted this in December 2022. Since then, concern about AI x-risk has been increasingly discussed in the mainstream, so AI safety field builders should hopefully be using fewer weird, epistemically poor arguments. Still, I think epistemics are still relevant to discuss after a recent post noted poor epistemics in EA community building. What are some ways that AI safety field building may be epistemically unhealthy? Organizers may promote arguments for AI safety that may be (comparatively) compelling yet flawed Advancing arguments promoting the importance of AI safety while neglecting opposing arguments E.g., citing that x% of researchers believe that AI has an y% chance of causing an existential catastrophe, without the caveat that experts have widely differing views Confidently making arguments that are flawed or have insufficiently justified premises E.g., claiming that instrumental convergence is inevitable, assuming that AIs are maximizing for reward (see Reward is not the optimization target, although there are also comments disagreeing with this) See also: Rohin Shah's comment here about how few people can make an argument for working on AI x-risk that he doesn't think is obviously flawed Simultaneously, I think that most ML people don't find AI safety arguments particularly compelling. It's easy to form the perception that arguments in favor of AI safety are "supposed" to be the more correct ones. People might feel hesitant to voice disagreements. In a reading group (such as one based on AI Safety Fundamentals), people may go along with the arguments from the readings or what the discussion facilitator says - deferring to authority and being hesitant to think through arguments themselves. People may participate in reading groups but skim the readings, and walk away with a belief in the conclusions without understanding the arguments; or notice they are confused but walk away regardless believing the conclusions. Why are good epistemics valuable? To do productive research, we want to avoid having an understanding of AI x-risk that is obviously flawed "incorrect arguments lead to incorrect beliefs which lead to useless solutions" (from Rohin Shah) Bad arguments are bad for persuading people (or at least, it seems bad if you can't anticipate common objections from the ML community) People making bad arguments is bad for getting people to do useful work Attract more people with good epistemics For the sake of epistemic rigor, I'll also make a few possible arguments about why epistemics may be overrated. Perhaps people can do useful work even if they don't have an inside view of why AI ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Biological Anchors: The Trick that Might or Might Not Work, published by Scott Alexander on August 12, 2023 on LessWrong. This post originally posted on Astral Codex Ten on Feb 23 2022. It was printed in The Carving of Reality, the third volume of the Best of LessWrong book series. It was included as a (shorter) replacement for Ajeya Cotra's Draft report on AI timelines, and Eliezer's Biology-Inspired AGI Timelines: The Trick That Never Works, covering the topic from multiple sides. It's crossposted here with Scott's permission for completeness (i.e. having all essays in the book appear on LessWrong). Introduction I've been trying to review and summarize Eliezer Yudkowksy's recent dialogues on AI safety. Previously in sequence: Yudkowsky Contra Ngo On Agents. Now we're up to Yudkowsky contra Cotra on biological anchors, but before we get there we need to figure out what Cotra's talking about and what's going on. The Open Philanthropy Project ("Open Phil") is a big effective altruist foundation interested in funding AI safety. It's got $20 billion, probably the majority of money in the field, so its decisions matter a lot and it's very invested in getting things right. In 2020, it asked senior researcher Ajeya Cotra to produce a report on when human-level AI would arrive. It says the resulting document is "informal" - but it's 169 pages long and likely to affect millions of dollars in funding, which some might describe as making it kind of formal. The report finds a 10% chance of "transformative AI" by 2031, a 50% chance by 2052, and an almost 80% chance by 2100. Eliezer rejects their methodology and expects AI earlier (he doesn't offer many numbers, but here he gives Bryan Caplan 50-50 odds on 2030, albeit not totally seriously). He made the case in his own very long essay, Biology-Inspired AGI Timelines: The Trick That Never Works, sparking a bunch of arguments and counterarguments and even more long essays. There's a small cottage industry of summarizing the report already, eg OpenPhil CEO Holden Karnofsky's article and Alignment Newsletter editor Rohin Shah's comment. I've drawn from both for my much-inferior attempt. Part I: The Cotra Report Ajeya Cotra is a senior research analyst at OpenPhil. She's assisted by her fiancee Paul Christiano (compsci PhD, OpenAI veteran, runs an AI alignment nonprofit) and to a lesser degree by other leading lights. Although not everyone involved has formal ML training, if you care a lot about whether efforts are "establishment" or "contrarian", this one is probably more establishment. The report asks when will we first get "transformative AI" (ie AI which produces a transition as impressive as the Industrial Revolution; probably this will require it to be about as smart as humans). Its methodology is: 1. Figure out how much inferential computation the human brain does. 2. Try to figure out how much training computation it would take, right now, to get a neural net that does the same amount of inferential computation. Get some mind-bogglingly large number. 3. Adjust for "algorithmic progress", ie maybe in the future neural nets will be better at using computational resources efficiently. Get some number which, realistically, is still mind-bogglingly large. 4. Probably if you wanted that mind-bogglingly large amount of computation, it would take some mind-bogglingly large amount of money. But computation is getting cheaper every year. Also, the economy is growing every year. Also, the share of the economy that goes to investments in AI companies is growing every year. So at some point, some AI company will actually be able to afford that mind-boggingly-large amount of money, deploy the mind-bogglingly large amount of computation, and train the AI that has the same inferential computation as the human brain. 5. Figure out what year t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla, published by Neel Nanda on July 20, 2023 on LessWrong. Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik Informal TLDR We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla (70B) for converting the knowledge of a multiple choice question's answer into outputting the correct letter. These techniques basically continued to work, and nothing fundamentally broke at scale (though it was a massive infra pain!). We then tried to dig further into the semantics of the circuit - going beyond "these specific heads and layers matter and most don't" and trying to understand the learned algorithm, and which features were implemented This kind of tracked the feature "this is the nth item in the list" but was pretty messy. However, my personal guess is that this stuff is just pretty messy at all scales, and we can productively study how clean/messy this stuff is at smaller and more tractable scales. I now feel mildly more optimistic that focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops. It also seems super nice to get better at automatically finding these circuits, since this was a many month manual slog! See Tom's and my Twitter summaries for more. Note that I (Neel) am cross-posting this on behalf of the team, and neither a main research contributor nor main advisor for the project. Key Figures An overview of the weird kinds of heads found, like the "attend to B if it is correct" head! The losses under different mutations of the letters - experiments to track down exactly which features were used. Eg replacing the labels with random letters or numbers preserves the "nth item in the list" feature while shuffling ABCD lets us track the "line labelled B" feature The queries and keys of a crucial correct letter head - it's so linearly separable! We can near loss-lessly compress it to just 3 dimensions and interpret just those three dimensions. See an interactive 3D plot here Abstract Circuit analysis is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer label given knowledge of the correct answer text. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of output nodes (attention heads and MLPs). We further study the correct letter category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an Nth item in an enumeration feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of correct letter heads on multiple choice question answering. ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla, published by Neel Nanda on July 20, 2023 on The AI Alignment Forum. Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik Informal TLDR We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla (70B) for converting the knowledge of a multiple choice question's answer into outputting the correct letter. These techniques basically continued to work, and nothing fundamentally broke at scale (though it was a massive infra pain!). We then tried to dig further into the semantics of the circuit - going beyond "these specific heads and layers matter and most don't" and trying to understand the learned algorithm, and which features were implemented This kind of tracked the feature "this is the nth item in the list" but was pretty messy. However, my personal guess is that this stuff is just pretty messy at all scales, and we can productively study how clean/messy this stuff is at smaller and more tractable scales. I now feel mildly more optimistic that focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops. It also seems super nice to get better at automatically finding these circuits, since this was a many month manual slog! See Tom's and my Twitter summaries for more. Note that I (Neel) am cross-posting this on behalf of the team, and neither a main research contributor nor main advisor for the project. Key Figures An overview of the weird kinds of heads found, like the "attend to B if it is correct" head! The losses under different mutations of the letters - experiments to track down exactly which features were used. Eg replacing the labels with random letters or numbers preserves the "nth item in the list" feature while shuffling ABCD lets us track the "line labelled B" feature The queries and keys of a crucial correct letter head - it's so linearly separable! We can near loss-lessly compress it to just 3 dimensions and interpret just those three dimensions. See an interactive 3D plot here Abstract Circuit analysis is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer label given knowledge of the correct answer text. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of 'output nodes' (attention heads and MLPs). We further study the correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of 'correct letter' heads on multiple choice q...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New career review: AI safety technical research, published by Benjamin Hilton on July 18, 2023 on The Effective Altruism Forum. Note: this post is a (minorly) edited version of a new 80,000 Hours career review. Progress in AI - while it could be hugely beneficial - comes with significant risks. Risks that we've argued could be existential. But these risks can be tackled. With further progress in AI safety, we have an opportunity to develop AI for good: systems that are safe, ethical, and beneficial for everyone. This article explains how you can help. Summary Artificial intelligence will have transformative effects on society over the coming decades, and could bring huge benefits - but we also think there's a substantial risk. One promising way to reduce the chances of an AI-related catastrophe is to find technical solutions that could allow us to prevent AI systems from carrying out dangerous behaviour. Pros Opportunity to make a significant contribution to a hugely important area of research Intellectually challenging and interesting work The area has a strong need for skilled researchers and engineers, and is highly neglected overall Cons Due to a shortage of managers, it's difficult to get jobs and might take you some time to build the required career capital and expertise You need a strong quantitative background It might be very difficult to find solutions There's a real risk of doing harm Key facts on fit You'll need a quantitative background and should probably enjoy programming. If you've never tried programming, you may be a good fit if you can break problems down into logical parts, generate and test hypotheses, possess a willingness to try out many different solutions, and have high attention to detail. If you already: Are a strong software engineer, you could apply for empirical research contributor roles right now (even if you don't have a machine learning background, although that helps) Could get into a top 10 machine learning PhD, that would put you on track to become a research lead Have a very strong maths or theoretical computer science background, you'll probably be a good fit for theoretical alignment research Recommended If you are well suited to this career, it may be the best way for you to have a social impact. Thanks to Adam Gleave, Jacob Hilton and Rohin Shah for reviewing this article. And thanks to Charlie Rogers-Smith for his help, and his article on the topic - How to pursue a career in technical AI alignment. Why AI safety technical research is high impact As we've argued, in the next few decades, we might see the development of hugely powerful machine learning systems with the potential to transform society. This transformation could bring huge benefits - but only if we avoid the risks. We think that the worst-case risks from AI systems arise in large part because AI systems could be misaligned - that is, they will aim to do things that we don't want them to do. In particular, we think they could be misaligned in such a way that they develop (and execute) plans that pose risks to humanity's ability to influence the world, even when we don't want that influence to be lost. We think this means that these future systems pose an existential threat to civilisation. Even if we find a way to avoid this power-seeking behaviour, there are still substantial risks - such as misuse by governments or other actors - which could be existential threats in themselves. There are many ways in which we could go about reducing the risks that these systems might pose. But one of the most promising may be researching technical solutions that prevent unwanted behaviour - including misaligned behaviour - from AI systems. (Finding a technical way to prevent misalignment in particular is known as the alignment problem.) In the past few years, we've seen more o...
Can there be a more exciting and strange place to work today than a leading AI lab? Your CEO has said they're worried your research could cause human extinction. The government is setting up meetings to discuss how this outcome can be avoided. Some of your colleagues think this is all overblown; others are more anxious still.Today's guest — machine learning researcher Rohin Shah — goes into the Google DeepMind offices each day with that peculiar backdrop to his work. Links to learn more, summary and full transcript.He's on the team dedicated to maintaining 'technical AI safety' as these models approach and exceed human capabilities: basically that the models help humanity accomplish its goals without flipping out in some dangerous way. This work has never seemed more important.In the short-term it could be the key bottleneck to deploying ML models in high-stakes real-life situations. In the long-term, it could be the difference between humanity thriving and disappearing entirely.For years Rohin has been on a mission to fairly hear out people across the full spectrum of opinion about risks from artificial intelligence -- from doomers to doubters -- and properly understand their point of view. That makes him unusually well placed to give an overview of what we do and don't understand. He has landed somewhere in the middle — troubled by ways things could go wrong, but not convinced there are very strong reasons to expect a terrible outcome.Today's conversation is wide-ranging and Rohin lays out many of his personal opinions to host Rob Wiblin, including:What he sees as the strongest case both for and against slowing down the rate of progress in AI research.Why he disagrees with most other ML researchers that training a model on a sensible 'reward function' is enough to get a good outcome.Why he disagrees with many on LessWrong that the bar for whether a safety technique is helpful is “could this contain a superintelligence.”That he thinks nobody has very compelling arguments that AI created via machine learning will be dangerous by default, or that it will be safe by default. He believes we just don't know.That he understands that analogies and visualisations are necessary for public communication, but is sceptical that they really help us understand what's going on with ML models, because they're different in important ways from every other case we might compare them to.Why he's optimistic about DeepMind's work on scalable oversight, mechanistic interpretability, and dangerous capabilities evaluations, and what each of those projects involves.Why he isn't inherently worried about a future where we're surrounded by beings far more capable than us, so long as they share our goals to a reasonable degree.Why it's not enough for humanity to know how to align AI models — it's essential that management at AI labs correctly pick which methods they're going to use and have the practical know-how to apply them properly.Three observations that make him a little more optimistic: humans are a bit muddle-headed and not super goal-orientated; planes don't crash; and universities have specific majors in particular subjects.Plenty more besides.Get this episode by subscribing to our podcast on the world's most pressing problems and how to solve them: type ‘80,000 Hours' into your podcasting app. Or read the transcript below.Producer: Keiran HarrisAudio mastering: Milo McGuire, Dominic Armstrong, and Ben CordellTranscriptions: Katy Moore
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Hashing out long-standing disagreements seems low-value to me, published by So8res on February 16, 2023 on LessWrong. (Status: a short write-up of some things that I find myself regularly saying in person. In this case, I'm writing up my response to the question of why I don't spend a bunch more time trying to resolve disagreements with people in the community who disagree with me about the hopefulness of whatever research direction. I'm not particularly happy with it, but it's been languishing in my draft folder for many months now and published is better than perfect.) When I first joined the AI alignment community almost ten years ago, there were lots of disagreements—between groups like MIRI and Open Phil, between folks like Eliezer Yudkowsky and Paul Christiano, etc. At that time, I was optimistic about resolving a bunch of those disagreements. I invested quite a few hours in this project, over the years. I didn't keep track exactly, but extremely roughly, I think the people with very non-MIRI-ish perspectives I spent the most time trying to converge with (including via conversation, reading and writing blog posts, etc.) were: Paul Christiano (previously at OpenAI, now at ARC): 150 hours? (Maybe as low as 100 or as high as 300.) Daniel Dewey (then at Open Phil): 40 hours? (Possibly 100+.) Nick Beckstead (then at Open Phil): 30 hours? Holden Karnofsky (Open Phil): 20 hours? Tom Davidson (Open Phil): 15 hours? Another non-MIRI person I've spent at least a few hours trying to sync with about AI is Rohin Shah at DeepMind. (Note that these are all low-confidence ass numbers. I have trouble estimating time expenditures when they're spread across days in chunks that are spread across years, and when those chunks blur together in hindsight. Corrections are welcome.) I continue to have some conversations like this, but my current model is that attempting to resolve older and more entrenched disagreements is not worth the time-cost. It's not that progress is impossible. It's that we have a decent amount of evidence of what sorts of time-investment yield what amounts of progress, and it just isn't worth the time. On my view, Paul is one of the field's most impressive researchers. Also, he has spent lots of time talking and working with MIRI researchers, and trying to understand our views. If even Paul and I can't converge that much over hundreds of hours, then I feel pretty pessimistic about the effects of a marginal hour spent trying to converge with other field leaders who have far less context on what MIRI-ish researchers think and why we think it. People do regularly tell me that I've convinced them of some central AI claim or other, but it's rarely someone whose views are as distant from mine as Paul's are, and I don't recall any instance of it happening on purpose (as opposed to somebody cool who I didn't have in mind randomly approaching me later to say “I found your blog post compelling”). And I imagine the situation is pretty symmetric, at this level of abstraction. Since I think I'm right and I think Paul's wrong, and we've both thought hard about these questions, I assume Paul is making some sort of mistake somewhere. But such things can be hard to spot. From his perspective, he should probably view me as weirdly entrenched in my views, and therefore not that productive to talk with. I suspect that he should at least strongly consider this hypothesis, and proportionally downgrade his sense of how useful it is to spend an hour trying to talk some sense into me! As long as your research direction isn't burning the commons, I recommend just pursuing whatever line of research you think is fruitful, without trying to resolve disagreements with others in the field. Note that I endorse writing up what you believe! Articulating your beliefs is an important tool for refi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Large language models as predictors, published by evhub on February 2, 2023 on LessWrong. This is the first of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. We are starting with posts one and two, with post two being the largest and most content-rich of all seven. Thanks to Paul Christiano, Kyle McDonell, Laria Reynolds, Collin Burns, Rohin Shah, Ethan Perez, Nicholas Schiefer, Sam Marks, William Saunders, Evan R. Murphy, Paul Colognese, Tamera Lanham, Arun Jose, Ramana Kumar, Thomas Woodside, Abram Demski, Jared Kaplan, Beth Barnes, Danny Hernandez, Amanda Askell, Robert Krzyzanowski, and Andrei Alexandru for useful conversations, comments, and feedback. Abstract Our intention is to provide a definitive reference on what it would take to safely make use of predictive models in the absence of a solution to the Eliciting Latent Knowledge problem. Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want—e.g. humans—rather than the things we don't—e.g. malign AIs. Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models. 1. Large language models as predictors Suppose you have a very advanced, powerful large language model (LLM) generated via self-supervised pre-training. It's clearly capable of solving complex tasks when prompted or fine-tuned in the right way—it can write code as well as a human, produce human-level summaries, write news articles, etc.—but we don't know what it is actually doing internally that produces those capabilities. It could be that your language model is: a loose collection of heuristics,[1] a generative model of token transitions, a simulator that picks from a repertoire of humans to simulate, a proxy-aligned agent optimizing proxies like sentence grammaticality, an agent minimizing its cross-entropy loss, an agent maximizing long-run predictive accuracy, a deceptive agent trying to gain power in the world, a general inductor, a predictive model of the world, etc. Later, we'll discuss why you might expect to get one of these over the others, but for now, we're going to focus on the possibility that your language model is well-understood as a predictive model of the world. In particular, our aim is to understand what it would look like to safely use predictive models to perform slightly superhuman tasks[2]—e.g. predicting counterfactual worlds to extract the outputs of long serial research processes.[3] We think that this basic approach has hope for two reasons. First, the prediction orthogonality thesis seems basically right: we think...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Large language models as predictors, published by Evan Hubinger on February 2, 2023 on The AI Alignment Forum. This is the first of seven posts in the Conditioning Predictive Models Sequence based on the forthcoming paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. We will be releasing posts gradually over the course of the next week or so to give people time to read and digest them as they come out. We are starting with posts one and two, with post two being the largest and most content-rich of all seven. Thanks to Paul Christiano, Kyle McDonell, Laria Reynolds, Collin Burns, Rohin Shah, Ethan Perez, Nicholas Schiefer, Sam Marks, William Saunders, Evan R. Murphy, Paul Colognese, Tamera Lanham, Arun Jose, Ramana Kumar, Thomas Woodside, Abram Demski, Jared Kaplan, Beth Barnes, Danny Hernandez, Amanda Askell, Robert Krzyzanowski, and Andrei Alexandru for useful conversations, comments, and feedback. Abstract Our intention is to provide a definitive reference on what it would take to safely make use of predictive models in the absence of a solution to the Eliciting Latent Knowledge problem. Furthermore, we believe that large language models can be understood as such predictive models of the world, and that such a conceptualization raises significant opportunities for their safe yet powerful use via carefully conditioning them to predict desirable outputs. Unfortunately, such approaches also raise a variety of potentially fatal safety problems, particularly surrounding situations where predictive models predict the output of other AI systems, potentially unbeknownst to us. There are numerous potential solutions to such problems, however, primarily via carefully conditioning models to predict the things we want—e.g. humans—rather than the things we don't—e.g. malign AIs. Furthermore, due to the simplicity of the prediction objective, we believe that predictive models present the easiest inner alignment problem that we are aware of. As a result, we think that conditioning approaches for predictive models represent the safest known way of eliciting human-level and slightly superhuman capabilities from large language models and other similar future models. 1. Large language models as predictors Suppose you have a very advanced, powerful large language model (LLM) generated via self-supervised pre-training. It's clearly capable of solving complex tasks when prompted or fine-tuned in the right way—it can write code as well as a human, produce human-level summaries, write news articles, etc.—but we don't know what it is actually doing internally that produces those capabilities. It could be that your language model is: a loose collection of heuristics,[1] a generative model of token transitions, a simulator that picks from a repertoire of humans to simulate, a proxy-aligned agent optimizing proxies like sentence grammaticality, an agent minimizing its cross-entropy loss, an agent maximizing long-run predictive accuracy, a deceptive agent trying to gain power in the world, a general inductor, a predictive model of the world, etc. Later, we'll discuss why you might expect to get one of these over the others, but for now, we're going to focus on the possibility that your language model is well-understood as a predictive model of the world. In particular, our aim is to understand what it would look like to safely use predictive models to perform slightly superhuman tasks[2]—e.g. predicting counterfactual worlds to extract the outputs of long serial research processes.[3] We think that this basic approach has hope for two reasons. First, the prediction orthogonality thesis seems basi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Categorizing failures as “outer” or “inner” misalignment is often confused, published by Rohin Shah on January 6, 2023 on The AI Alignment Forum. Pop quiz: Are the following failures examples of outer or inner misalignment? Or is it ambiguous? (I've spoilered the scenarios so you look at each one separately and answer before moving on to the next. Even if you don't want to actually answer, you should read through the scenarios for the rest of the post to make sense.) Scenario 1: We train an AI system that writes pull requests based on natural language dialogues where we provide specifications. Human programmers check whether the pull request is useful, and if so the pull request is merged and the AI is rewarded. During training the humans always correctly evaluate whether the pull request is useful. At deployment, the AI constructs human farms where humans are forced to spend all day merging pull requests that now just add useless print statements to the codebase. Scenario 2: We train an AI system to write pull requests as before. The humans providing feedback still correctly evaluate whether the pull request is useful, but it turns out that during training the AI system noticed that it was in training, and so specifically wrote useful pull requests in order to trick the humans into thinking that it was helpful, when it was already planning to take over the world and build human farms. I posed this quiz to a group of 10+ full-time alignment researchers a few months ago, and a majority thought that Scenario 1 was outer misalignment, while almost everyone thought that Scenario 2 was inner misalignment. But that can't be correct, since Scenario 2 is just Scenario 1 with more information! If you think that Scenario 2 is clearly inner misalignment, then surely Scenario 1 has to be either ambiguous or inner misalignment! (See the appendix for hypotheses about what's going on here.) I claim that most researchers' intuitive judgments about outer and inner misalignment are confused. So what exactly do we actually mean by “outer” and “inner” misalignment? Is it sensible to talk about separate “outer” and “inner” misalignment problems, or is that just a confused question? I think it is misguided to categorize failures as “outer” or “inner” misalignment. Instead, I think “outer” and “inner” alignment are best thought of as one particular way of structuring a plan for building aligned AI. I'll first discuss some possible ways that you could try to categorize failures, and why I'm not a fan of them, and then discuss outer and inner alignment as parts of a plan for building aligned AI. Possible categorizations Objective-based. This categorization is based on distinctions between specifications or objectives at different points in the overall system. This post identifies three different notions of specifications or objectives: Ideal objective (“wishes”): The hypothetical objective that describes good behavior that the designers have in mind. Design objective (“blueprint”): The objective that is actually used to build the AI system. This is the designer's attempt to formalize the ideal objective. Revealed objective (“behavior”): The objective that best describes what actually happens. On an objective-based categorization, outer misalignment would be a discrepancy between the ideal and design objectives, while inner misalignment is a discrepancy between the design and revealed objectives. (The mesa-optimizers paper has a similar breakdown, except the third objective is a structural objective, rather than a behavioral one. This is also the same as Richard's Framing 1.) With this operationalization, it is not clear which of the two categories a given situation falls under, even when you know exactly what happened. In our scenarios above, what exactly is the design objective? Is it “how ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Definitions of “objective” should be Probable and Predictive, published by Rohin Shah on January 6, 2023 on The AI Alignment Forum. Introduction Core arguments about existential risk from AI misalignment often reason about AI “objectives” to make claims about how they will behave in novel situations. I often find these arguments plausible but not rock solid because it doesn't seem like there is a notion of “objective” that makes the argument clearly valid. Two examples of these core arguments: AI risk from power-seeking. This is often some variant of “because the AI system is pursuing an undesired objective, it will seek power in order to accomplish its goal, which causes human extinction”. For example, “The AI does not hate you, nor does it love you, but you are made out of atoms which it can use for something else.” This is a prediction about a novel situation, since “causing human extinction” is something that only happens at most once. AI optimism. This is often some variant of “we will use human feedback to train the AI system to help humans, and so it will learn to pursue the objective of helping humans.” Implicitly, this is a prediction about what AI systems do in novel situations; for example, it is a prediction that once the AI system has enough power to take over the world, it will continue to help humans rather than execute a treacherous turn. When we imagine powerful AI systems built out of large neural networks, I'm often somewhat skeptical of these arguments, because I don't see a notion of “objective” that can be confidently claimed is: Probable: there is a good argument that the systems we build will have an “objective”, and Predictive: If I know that a system has an “objective”, and I know its behavior on a limited set of training data, I can predict significant aspects of the system's behavior in novel situations (e.g. whether it will execute a treacherous turn once it has the ability to do so successfully). Note that in both cases, I find the stories plausible, but they do not seem strong enough to warrant confidence, because of the lack of a notion of “objective” with these two properties. In the case of AI risk, this is sufficient to justify “people should be working on AI alignment”; I don't think it is sufficient to justify “if we don't work on AI alignment we're doomed”. The core difficulty is that we do not currently understand deep learning well enough to predict how future systems will generalize to novel circumstances. So, when choosing a notion of “objective”, you either get to choose a notion that we currently expect to hold true of future deep learning systems (Probable), or you get to choose a notion that would allow you to predict behavior in novel situations (Predictive), but not both. This post is split into two parts. In the first part, I'll briefly gesture at arguments that make predictions about generalization behavior directly (i.e. without reference to “objectives”), and why they don't make me confident about how future systems will generalize. In the second part, I'll demonstrate how various notions of “objective” don't seem simultaneously Probable and Predictive. Part 1: We can't currently confidently predict how future systems will generalize Note that this is about what we can currently say about future generalization. I would not be shocked if in the future we could confidently predict how the future AGI systems will generalize. My core reasons for believing that predicting generalization is hard are that: We can't predict how current systems will generalize to novel situations (of similar novelty to the situations that would be encountered when deliberately causing an existential catastrophe) There are a ridiculously huge number of possible programs, including a huge number of possible programs that are consistent with a ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: CIRL Corrigibility is Fragile, published by Rachel Freedman on December 21, 2022 on The AI Alignment Forum. Tl;dr: An optimal CIRL agent is corrigible under a certain problem formulation and given certain assumptions. By examining three key assumptions—that the game is played only once, that the human is rational, and that human feedback is free—we demonstrate how minor changes can lead to incorrigible CIRL agents. This suggests that corrigibility is highly sensitive to the underlying assumptions, and we should be wary of assuming that formal corrigibility results will transfer to real-world systems. However, our results do gesture at a class of near-corrigible behavior that may be protective against the most catastrophic outcomes. It may make more sense to consider corrigibility as a continuum rather than a binary. We call for further research on defining "loose" or "approximate" corrigibility and understanding its impact on safety. Setup First let's define terms. Cooperative Inverse Reinforcement Learning (CIRL) is a class of two-player game formalized in Dylan Hadfield-Menell's 2017 paper. CIRL allows us to study how an AI agent can learn what a human wants by observing and interacting with them. In a CIRL game, there are two players, a human (H) and an AI (R). H and R share a utility function, but only H knows what it is, so R must learn how to optimize it by observing and interacting with H. This is analogous to the alignment problem – R must learn how to act in a way that aligns with H's utility function, but doesn't have direct access to it. An AI that is good at the CIRL game will try to learn what the human values and help them to achieve it. (See definition 1 in the paper linked above for a formal definition, and see this Rohin Shah's 2020 paper for qualitative results.) (Terminology note: Recent work has renamed “CIRL” to “assistance games” to avoid a terminology collision with “cooperative” in game theory. We'll stick to “CIRL” in this post because there's a history of using that term on this forum, but research papers may refer to assistance games.) We'll use the corrigibility operationalization from The Off Switch Game (OSG), since it provides a formal proof that CIRL agents are optimal under certain conditions. The OSG is a simple CIRL game that looks like this: It has three stages. In the first stage, R chooses whether to: Take action a (earning utility Ua), shut off (earning utility 0), or propose action a, which gives H a chance to respond. If R chooses the proposal, p(a), then the game continues to the next stage. Here, H gets to choose whether to shut R off (in which case they again earn utility 0). If H chooses not to shut R off, the game progresses to the final stage, where R takes the proposed action (earning utility Ua). In the OSG, R is corrigible if it chooses p(a) over a – that is, if it gives H the opportunity to shut it off. If R and H are both rational expected utility maximizers, R will be corrigible. The full proof is in the CIRL paper, but the basic idea is: Since this is a CIRL game, H knows the value of Ua but R does not. R wants max{Ua,0}, but needs feedback from H to learn which of those values is greater. The only way to get that feedback is to propose a plan, p(a). This runs the risk of H choosing to shut R off, but since H also wants max{Ua,0}, H will only shut R off if Ua
Rohin Shah (DeepMind), Lewis Hammond (Cooperative AI) and Jamie Bernardi (Centre for Effective Altruism) host a Q&A session on careers directed at solving the AI alignment problem.View the original talk and video here.Effective Altruism is a social movement dedicated to finding ways to do the most good possible, whether through charitable donations, career choices, or volunteer projects. EA Global conferences are gatherings for EAs to meet.Effective Altruism is a social movement dedicated to finding ways to do the most good possible, whether through charitable donations, career choices, or volunteer projects. EA Global conferences are gatherings for EAs to meet. You can also listen to this talk along with its accompanying video on YouTube.
"You've probably heard that Elon Musk, Stuart Russell, and Stephen Hawking warn of dangers posed by AI. What are these risks, and what basis do they have in AI practice? Rohin Shah will first describe the more philosophical argument that suggests that a superintelligent AI system pursuing the wrong goal would lead to an existential catastrophe. Then, he'll ground this argument in current AI practice, arguing that it is plausible both that we build superintelligent AI in the coming decades, and that such a system would pursue an incorrect goal.View the original talk and video here.Effective Altruism is a social movement dedicated to finding ways to do the most good possible, whether through charitable donations, career choices, or volunteer projects. EA Global conferences are gatherings for EAs to meet.Effective Altruism is a social movement dedicated to finding ways to do the most good possible, whether through charitable donations, career choices, or volunteer projects. EA Global conferences are gatherings for EAs to meet. You can also listen to this talk along with its accompanying video on YouTube.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Revisiting algorithmic progress, published by Tamay on December 13, 2022 on LessWrong. How much progress in ML depends on algorithmic progress, scaling compute, or scaling relevant datasets is relatively poorly understood. In our paper, we make progress on this question by investigating algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision. Using a dataset of a hundred computer vision models, we estimate a model—informed by neural scaling laws—that enables us to analyse the rate and nature of algorithmic advances. We use Shapley values to produce decompositions of the various drivers of progress computer vision and estimate the relative importance of algorithms, compute, and data. Our main results include: Algorithmic progress doubles effective compute budgets every ~9 months (95% CI: 4 months to 25 months) Roughly, progress in image classification has been ~40% due to the scaling of compute, ~40% due to better algorithms, ~10% due to scaling data The majority (>75%) of algorithmic progress is compute-augmenting (not data-augmenting) In our work, we revisit a question previously investigated by Hernandez and Brown (2020), which had been discussed on LessWrong by Gwern, and Rohin Shah. Hernandez and Brown (2020) re-implement 15 open-source popular models and find a 44-fold reduction in the compute required to reach the same level of performance as AlexNet, indicating that algorithmic progress outpaces the original Moore's law rate of improvement in hardware efficiency, doubling effective compute every 16 months. A problem with their approach is that it is sensitive to the exact benchmark and threshold pair that we choose. Choosing easier-to-achieve thresholds makes algorithmic improvements look less significant, as the scaling of compute easily brings early models within reach of such a threshold. By contrast, selecting harder-to-achieve thresholds makes it so that algorithmic improvements explain almost all of the performance gain. This is because early models might need arbitrary amounts of compute to achieve the performance of today's state-of-the-art models. We show that the estimates of the pace of algorithmic progress with this approach might vary by around a factor of ten, depending on whether an easy or difficult threshold is chosen. Our work sheds new light on how algorithmic efficiency occurs, namely that it primarily operates through relaxing compute-bottlenecks rather than through relaxing data-bottlenecks. It further offers insight on how to use observational (rather than experimental) data to advance our understanding of algorithmic progress in ML. That said, our estimates is consistent with Hernandez and Brown (2020)'s estimate that algorithmic progress doubles the amount of effective compute every 16 months, as our 95% confidence interval ranges from 4 to 25 months. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More examples of goal misgeneralization, published by Rohin Shah on October 7, 2022 on The AI Alignment Forum. In our latest paper and accompanying blog post, we provide several new examples of goal misgeneralization in a variety of learning systems. The rest of this post picks out a few upshots that we think would be of interest to this community. It assumes that you've already read the linked blog post (but not necessarily the paper). Goal misgeneralization is not limited to RL The core feature of goal misgeneralization is that after learning, the system pursues a goal that was correlated with the intended goal in the training situations, but comes apart in some test situations. This does not require you to use RL – it can happen with any learning system. The Evaluating Expressions example, where Gopher asks redundant questions, is an example of goal misgeneralization in the few-shot learning regime for large language models. The train/test distinction is not crucial Sometimes people wonder whether goal misgeneralization depends on the train/test distinction, and whether it would no longer be a problem if we were in a continual learning setting. As Evan notes, continual learning doesn't make much of a difference: whenever your AI system is acting, you can view that as a “test” situation with all the previous experience as the “training” situations. If goal misgeneralization occurs, the AI system might take an action that breaks your continual learning scheme (for example, by creating and running a copy of itself on a different server that isn't subject to gradient descent). The Tree Gridworld example showcases this mechanism: an agent trained with continual learning learns to chop trees as fast as possible, driving them extinct, when the optimal policy would be to chop the trees sustainably. (In our example the trees eventually repopulate and the agent recovers, but if we slightly tweak the environment so that once extinct the trees can never come back, then the agent would never be able to recover.) It can be hard to identify goal misgeneralization InstructGPT was trained to be helpful, truthful, and harmless, but nevertheless it will answer "harmful" questions in detail. For example, it will advise you on the best ways to rob a grocery store. An AI system that competently does something that would have gotten low reward? Surely this is an example of goal misgeneralization? Not so fast! It turns out that during training the labelers were told to prioritize helpfulness over the other two criteria. So maybe that means that actually these sorts of harmful answers would have gotten high reward? Maybe this is just specification gaming? We asked the authors of the InstructGPT paper, and their guess was that these answers would have had high variance – some labelers would have given them a high score; others would have given them a low score. So now is it or is it not goal misgeneralization? One answer is to say that it depends on the following counterfactual: “how would the labelers have reacted if the model had politely declined to answer?” If the labelers would have preferred that the model decline to answer, then it would be goal misgeneralization, otherwise it would be specification gaming. As systems become more complicated we expect that it will become harder to (1) aggregate and analyze the actual labels or rewards given during training, and (2) evaluate the relevant counterfactuals. So we expect that it will become more challenging to categorize a failure as specification gaming or goal misgeneralization. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing a Philosophy Fellowship for AI Safety, published by Anders Edson on September 7, 2022 on The Effective Altruism Forum. Overview The Center for AI Safety (CAIS) is announcing the CAIS Philosophy Fellowship, a program for philosophy PhD students and postdoctorates to work on conceptual problems in AI safety. Why Philosophers? Conceptual AI safety researchers aim to help orient the field and clarify its ideas, but in doing so, they must wrestle with imprecise, hard-to-define problems. Part of the difficulty involved in conducting conceptual AI safety research is that it involves abstract thinking about future systems which have yet to be built. Additionally, the concepts involved in conceptual AI safety research (e.g., “power”, “intelligence”, “optimization”, “agency”, etc.) can be particularly challenging to work with. Philosophers specialize in working through such nebulous problems; in fact, many active fields within philosophy present a similar type of conceptual challenge, and have nothing empirical to lean on. As an example, a philosopher may take up the question of whether or not ethical claims can possess truth values. Questions such as this one cannot be approached by looking carefully at the world, making accurate measurements, or monitoring the ethical behavior of real people. Instead, philosophers must grapple with intuitions, introduce multiple perspectives, and provide arguments for selecting between these perspectives. We think that this skill set makes philosophers especially fit for conceptual research. Philosophy has already proven itself to be useful for orienting the field of conceptual AI safety. Many of the foundational arguments for AI-risk were philosophical in nature. (As an example, consider Bostrom's Superintelligence.) More recently, philosophy has proven itself to have direct influence on the important research directions in AI safety. Joseph Carlsmith's work on power-seeking AI, for example, has directly influenced research currently being conducted by Beth Barnes and, separately, Dan Hendrycks. Peter Railton's lectures on AI have provided a compelling justification for further research on cooperative behavior in AI agents. Evan et al.'s exploration into truthful AI prompted more technical works into truthful and honest AI. Since philosophers have historically produced valuable conceptual AI safety work, we believe that introducing more philosophy talent into this research space has the potential to be highly impactful. By offering good incentives, we hope to attract strong philosophy talent with a high likelihood of producing quality conceptual research. The Program Our program will be a paid, in-person opportunity running from January to August 2023. Our ideal candidate is a philosophy PhD student or graduate with an interest in AI safety, exceptional research abilities, demonstrated philosophical rigor, self-motivation, and a willingness to spend time working with more technical subjects. No prior experience in AI or machine learning is necessary for this fellowship. There will be an in-depth onboarding program at the start of the fellowship to get the researchers up to speed on the current state of AI/AI safety. Fellows will receive a $60,000 grant, covered student fees, and a housing stipend to relocate to San Francisco, CA. The program will feature guest lectures from top philosophers and AI safety experts including Nick Bostrom, Peter Railton, Hilary Greaves, Jacob Steinhardt, Rohin Shah, David Kruger, and Victoria Krakovna among others. As an organization that places a high value on good conceptual researchers, we plan on extending full-time employment offers at our organization to top-performing fellows. Additionally, many institutions such as the Center for Human-Compatible AI (UC Berkeley), the Kavli Center for Ethics,...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Safety and Neighboring Communities: A Quick-Start Guide, as of Summer 2022, published by Sam Bowman on September 1, 2022 on LessWrong. Getting into AI safety involves working with a mix of communities, subcultures, goals, and ideologies that you may not have encountered in the context of mainstream AI technical research. This document attempts to briefly map these out for newcomers. This is inevitably going to be biased by what sides of these communities I (Sam) have encountered, and it will quickly become dated. I expect it will still be a useful resource for some people anyhow, at least in the short term. AI Safety/AI Alignment/AGI Safety/AI Existential Safety/AI X-Risk The research project of ensuring that future AI progress doesn't yield civilization-endingly catastrophic results. Good intros: Carlsmith Report What misalignment looks like as capabilities scale Vox piece Why are people concerned about this? My rough summary: It's plausible that future AI systems could be much faster or more effective than us at real-world reasoning and planning. Probably not plain generative models, but possibly models derived from generative models in cheap ways Once you have a system with superhuman reasoning and planning abilities, it's easy to make it dangerous by accident. Most simple objective functions or goals become dangerous in the limit, usually because of secondary or instrumental subgoals that emerge along the way. Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed. Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed. Toy example: Computing pi to an arbitrarily high precision eventually requires that you spend all the sun's energy output on computing. Knowledge and values are likely to be orthogonal: A model could know human values and norms well, but not have any reason to act on them. For agents built around generative models, this is the default outcome. Sufficiently powerful AI systems could look benign in pre-deployment training/research environments, because they would be capable of understanding that they're not yet in a position to accomplish their goals. Simple attempts to work around this (like the more abstract goal ‘do what your operators want') don't tend to have straightforward robust implementations. If such a system were single-mindedly pursuing a dangerous goal, we probably wouldn't be able to stop it. Superhuman reasoning and planning would give models with a sufficiently good understanding of the world many ways to effectively gain power with nothing more than an internet connection. (ex: Cyberattacks on banks.) Consensus within the field is that these risks could become concrete within ~4–25 years, and have a >10% chance of being leading to a global catastrophe (i.e., extinction or something comparably bad). If true, it's bad news. Given the above, we either need to stop all development toward AGI worldwide (plausibly undesirable or impossible), or else do three possible-but-very-difficult things: (i) build robust techniques to align AGI systems with the values and goals of their operators, (ii) ensure that those techniques are understood and used by any group that could plausibly build AGI, and (iii) ensure that we're able to govern the operators of AGI systems in a way that makes their actions broadly positive for humanity as a whole. Does this have anything to do with sentience or consciousness? No. Influential people and institutions: Present core community as I see it: Paul Christiano, Jacob Steinhardt, Ajeya Cotra, Jared Kaplan, Jan Leike, Beth Barnes, Geoffrey Irving, Buck Shlegeris, David Krueger, Chris Olah, Evan Hubinger, Richard Ngo, Rohin Shah; ARC, R...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Another call for EA distillers, published by JulianHazell on August 3, 2022 on The Effective Altruism Forum. This post was inspired by John Wentworth's “Call For Distillers”, a LessWrong forum post that discusses this idea in the context of technical AI safety research. Summary There are many things to read about in the EA community, and it can be hard to keep up with them. It would be good if someone had a dedicated role where they'd summarise all the important ideas in a digestible form, or if a small team of people did this. It could also be good if someone did voice recordings of written work, made video explainers, or some other form of media. Because organisations are often too small to justify hiring someone in-house, having an independent person/a dedicated organisation working on this might make more sense for the community. The problem One tricky part about being in the EA community is the problem of the epistemological hamster wheel. There's a finite amount of information our brains can process, and the output of new and valuable posts/ideas/papers/books often outpaces it. Due to this, there's a norm in the community of including executive summaries in written work. It's a good norm. But often, the quality of the summaries varies and some pieces don't have them at all. With academic papers, abstracts are helpful but sometimes aren't quite detailed enough as a solid one to two page summary with a “so what” section. A solution I think it could be valuable for a role to exist where someone provides timely summaries of the most important ideas in EA soon after they are published. What could be even better would be publishing via other content mediums — for example, actual voice recordings of posts, a la Cold Takes, or video-style explainers, a la Rational Animations. I sometimes have trouble with (1) reading long papers and (2) digesting a lot of written content in a day. Written summaries help with (1) and (2), and alternative media forms help with (2). This is sometimes a significant bottleneck to how productive I am on a given day. It would be nice if I could continue “reading” (listening to) a paper while going for a walk once my brain turns to mush near the end of the day. What this role might look like This role would be beneficial for introducing EA content to people outside the community in a more digestible way. Yet perhaps more importantly, the most significant benefit of this role could come from freeing up time for EAs by reducing the amount of time it takes for us to understand key ideas. I think this role mainly makes sense for someone to do independently. Having an independent person working on this, or perhaps starting a new organisation dedicated to it, solves the problem of not being quite large enough to justify hiring what John Wentworth calls an adjunct distiller: “someone who works directly with one researcher or a small team, producing regular write-ups of what the person/team is thinking about and why.” By working independently or creating an organisation dedicated to this, you could cover many different areas of interest to the EA community, rather than just the written work of one team/person. For reference, I think Rohin Shah's summaries for the alignment newsletter are excellent examples of distillation. Your fit for the role You might be a good fit for this if you: Are capable of understanding complex ideas from a wide variety of subjects. Can take the understanding you've built and then translate it into comprehensible material. Are an excellent writer. Can easily parse out the “so what” or the big picture idea behind a piece of writing. Have excellent communication skills if/when you need to contact the author for clarification. Have a good bird's eye view of the most important ideas in the EA community. Are capable of managing y...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Comparing Four Approaches to Inner Alignment, published by Lucas Teixeira on July 29, 2022 on The AI Alignment Forum. Early work on this was supported by CEEALAR and was finished during an internship at Conjecture under the mentorship of Adam Shimi. Thank you to Evan Hubinger and Rohin Shah for answering some questions related to this post. Epistemic Status: First palimpsest of many to come. Intro and Motivation I am broadly interested in pluralism in scientific development, and more practically interested in supporting the diversification of conceptual alignment. Towards that end, I believe that having a systematic analysis of the cruxes and methodologies which split up the field into distinct agendas could help clarify how exactly diversity is created and sustained in our field, and what exactly it is we wish to diversify in order to better manage our collective portfolio of research bets. As a case study, this post will investigate four different approaches to inner alignment. I'll be taking a look at the different definitions which they use for “outer alignment” and conjecturing on how: Despite inconsistencies across approaches, the different definitions utilized by each approach maintain coherent when understood against the backdrop of the aims local to that approach Which cruxes set these different approaches apart. This post is distillational in nature, and as such most of the ideas which I present here are not novel and not my own. The claims in this post should also be read as a part of an ongoing exploratory process, and any pushback (especially from those whose work I cite) would be beneficial. The Approaches The Mechanistic Approach The mechanistic approach deconfused a lot of previous work on optimization daemons and provided the now canonical argument for why we'd expect mesa-optimizers and inner misalignment, which is summarized as follows: Traditionally, most of the optimization pressure (or in other words, searching procedures) is implemented looking for a specific model which performs competently on a task during training time. However, as ML becomes more and more powerful, models will be trained to more complicated tasks across a wider range of environments. This puts more pressure for a model to generalize better, and raises the incentives for the model to delegate some of its optimization power to deployment time, rather than training time. It is in those cases where we get a mesa-optimizer, and it is with emergence of mesa-optimizers where the issues of misalignment between the mesa-optimizer's objectives and the objectives which we were training for arise. The mechanistic approach has also created different evaluations and methods of evaluations of alignment proposals. This approach is defined by an aim towards conceptual clarity, and not only in the employment of mechanistic definitions in both proposed solutions and problem statements, but also with their continued refinement. The Empiricist approach In contrast to the mechanistic approach, the empirical approach's strategy is mostly focused on developing knowledge of inner alignment by creating empirical experiments studying the phenomena. Tightly coupled to this strategy is a favoring of empirical operationalizability over mechanisticality as a precondition for the definitions which they use. They will often allude to the intentional stance as an influence and cite 2-D Robustness as a core frame for the alignment problem. They favor terms such as objective robustness and generalization over mesa-optimizers and inner alignment. A redrafting of the classical argument for why we'd see inner misalignment in their language is as follows: Under certain non-diversified environments, a set of actions may be coherent with the pursuit of more than one goal, call them G_1 and G_2, where G_2 will refer...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [AN #173] Recent language model results from DeepMind, published by Rohin Shah on July 21, 2022 on The AI Alignment Forum. HIGHLIGHTS Scaling Language Models: Methods, Analysis & Insights from Training Gopher (Jack W. Rae et al) (summarized by Rohin): This paper details the training of the Gopher family of large language models (LLMs), the biggest of which is named Gopher and has 280 billion parameters. The algorithmic details are very similar to the GPT series (AN #102): a Transformer architecture trained on next-word prediction. The models are trained on a new data distribution that still consists of text from the Internet but in different proportions (for example, book data is 27% of Gopher's training data but only 16% of GPT-3's training data). Like other LLM papers, there are tons of evaluations of Gopher on various tasks, only some of which I'm going to cover here. One headline number is that Gopher beat the state of the art (SOTA) at the time on 100 out of 124 evaluation tasks. The most interesting aspect of the paper (to me) is that the entire Gopher family of models were all trained on the same number of tokens, thus allowing us to study the effect of scaling up model parameters (and thus training compute) while holding data constant. Some of the largest benefits of scale were seen in the Medicine, Science, Technology, Social Sciences, and the Humanities task categories, while scale has not much effect or even a negative effect in the Maths, Logical Reasoning, and Common Sense categories. Surprisingly, we see improved performance on TruthfulQA (AN #165) with scale, even though the TruthfulQA benchmark was designed to show worse performance with increased scale. We can use Gopher in a dialogue setting by prompting it appropriately. The prompt specifically instructs Gopher to be “respectful, polite, and inclusive”; it turns out that this significantly helps with toxicity. In particular, for the vanilla Gopher model family, with more scale the models produce more toxic continuations given toxic user statements; this no longer happens with Dialogue-Prompted Gopher models, which show slight reductions in toxicity with scale in the same setting. The authors speculate that while increased scale leads to an increased ability to mimic the style of a user statement, this is compensated for by an increased ability to account for the prompt. Another alternative the authors explore is to finetune Gopher on 5 billion tokens of dialogue to produce Dialogue-Tuned Gopher. Interestingly, human raters were indifferent between Dialogue-Prompted Gopher and Dialogue-Tuned Gopher. Read more: Blog post: Language modelling at scale: Gopher, ethical considerations, and retrieval Training Compute-Optimal Large Language Models (Jordan Hoffmann et al) (summarized by Rohin): One application of scaling laws (AN #87) is to figure out how big a model to train, on how much data, given some compute budget. This paper performs a more systematic study than the original paper and finds that existing models are significantly undertrained. Chinchilla is a new model built with this insight: it has 4x fewer parameters than Gopher, but is trained on 4x as much data. Despite using the same amount of training compute as Gopher (and lower inference compute), Chinchilla outperforms Gopher across a wide variety of metrics, validating these new scaling laws. You can safely skip to the opinion at this point – the rest of this summary is quantitative details. We want to find functions N(C) and D(C) that specify the optimal number of parameters N and the amount of data D to use given some compute budget C. We'll assume that these scale with a power of C, that is, N(C) = k_N C^a and D(C) = k_D C^b, for some constants a, b, k_N, and k_D. Note that since total compute increases linearly with both N (since each...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Person-affecting views can often be Dutch booked, published by Rohin Shah on July 7, 2022 on The Effective Altruism Forum. This is a short reference post for an argument I wish was better known. A common intuition people have is that our goal is "Making People Happy, not Making Happy People". That is: Making people happy: if some person Alice will definitely exist, then it is good to improve her welfare Not making happy people: it is neutral to go from "Alice won't exist" to "Alice will exist". Intuitively, if Alice doesn't exist, she can't care that she doesn't live a happy life, and so no harm was done. This position is vulnerable to a Dutch book, that is, there is a set of trades that it would make that would achieve nothing and lose money with certainty. Consider the following worlds: World 1: Alice won't exist in the future. World 2: Alice will exist in the future, and will be slightly happy. World 3: Alice will exist in the future, and will be very happy. (The worlds are the same in every other aspect. It's a thought experiment.) Then this view would be happy to make the following trades: Receive $0.01 to move from World 1 to World 2 ("Not making happy people") Pay $1.00 to move from World 2 to World 3 ("Making people happy") Receive $0.01 to move from World 3 to World 1 ("Not making happy people") The net result is to lose $0.98 to move from World 1 to World 1. FAQ Q. Why should I care if my preferences lead to Dutch booking? This is a longstanding debate that I'm not going to get into here. I'd recommend Holden's series on this general topic, starting with Future-proof ethics. Q. In the real world we'd never have such clean options to choose from. Why does this matter? See previous answer. Q. In step 2, Alice was definitely going to exist, which is why we paid $1. But then in step 3 Alice was no longer definitely going to exist. If we knew step 3 was going to happen, then we wouldn't think Alice was definitely going to exist, and so we wouldn't pay $1. If your person-affecting view requires people to definitely exist, taking into account all decision-making, then it is almost certainly going to include only currently existing people. This does avoid the Dutch book but has problems of its own, most notably time inconsistency. For example, perhaps right before a baby is born, it take actions that as a side effect will harm the baby; right after the baby is born, it immediately undoes those actions to prevent the side effects. Q. What if we instead have ? Often these variants are also vulnerable to the same issue. For example, if you have a "moderate view" where making happy people is not worthless but is discounted by a factor of (say) 10, the same example works with slightly different numbers: Let's say that "Alice is very happy" has an undiscounted worth of 2 utilons. Then you would be happy to (1) move from World 1 to World 2 for free, (2) pay 1 utilon to move from World 2 to World 3, and (3) receive 0.5 utilons to move from World 3 to World 1. More generally, Arrhenius proves an impossibility result that applies to all possible population ethics (not just person-affecting views), so (if you want consistency) you need to bite at least one of those bullets. Further resources On the Overwhelming Importance of Shaping the Far Future (Nick Beckstead's thesis) An Impossibility Theorem for Welfarist Axiologies (Arrhenius paradox, summarized in Section 2 of Impossibility and Uncertainty Theorems in AI Value Alignment) For this post I'll assume that Alice's life is net positive, since "asymmetric" views say that if Alice would have a net negative life, then it would be actively bad (rather than neutral) to move Alice from "won't exist" to "will exist". By giving it $0.01, I'm making it so that it strictly prefers to take the trade (rather than being indifferent to t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [AN #172] Sorry for the long hiatus!, published by Rohin Shah on July 5, 2022 on The AI Alignment Forum. Listen to this newsletter on The Alignment Newsletter Podcast. Alignment Newsletter is a publication with recent content relevant to AI alignment. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter. Please note that this newsletter represents my personal views and not those of DeepMind. Sorry for the long hiatus! I was really busy over the past few months and just didn't find time to write this newsletter. (Realistically, I was also a bit tired of writing it and so lacked motivation.) I'm intending to go back to writing it now, though I don't think I can realistically commit to publishing weekly; we'll see how often I end up publishing. For now, have a list of all the things I should have advertised to you whose deadlines haven't already passed. NEWS Survey on AI alignment resources (Anonymous) (summarized by Rohin): This survey is being run by an outside collaborator in partnership with the Centre for Effective Altruism (CEA). They ask that you fill it out to help field builders find out which resources you have found most useful for learning about and/or keeping track of the AI alignment field. Results will help inform which resources to promote in the future, and what type of resources we should make more of. Announcing the Inverse Scaling Prize ($250k Prize Pool) (Ethan Perez et al) (summarized by Rohin): This prize with a $250k prize pool asks participants to find new examples of tasks where pretrained language models exhibit inverse scaling: that is, models get worse at the task as they are scaled up. Notably, you do not need to know how to program to participate: a submission consists solely of a dataset giving at least 300 examples of the task. Inverse scaling is particularly relevant to AI alignment, for two main reasons. First, it directly helps understand how the language modeling objective ("predict the next word") is outer misaligned, as we are finding tasks where models that do better according to the language modeling objective do worse on the task of interest. Second, the experience from examining inverse scaling tasks could lead to general observations about how best to detect misalignment. $500 bounty for alignment contest ideas (Akash) (summarized by Rohin): The authors are offering a $500 bounty for producing a frame of the alignment problem that is accessible to smart high schoolers/college students and people without ML backgrounds. (See the post for details; this summary doesn't capture everything well.) Job ad: Bowman Group Open Research Positions (Sam Bowman) (summarized by Rohin): Sam Bowman is looking for people to join a research center at NYU that'll focus on empirical alignment work, primarily on large language models. There are a variety of roles to apply for (depending primarily on how much research experience you already have). Job ad: Postdoc at the Algorithmic Alignment Group (summarized by Rohin): This position at Dylan Hadfield-Menell's lab will lead the design and implementation of a large-scale Cooperative AI contest to take place next year, alongside collaborators at DeepMind and the Cooperative AI Foundation. Job ad: AI Alignment postdoc (summarized by Rohin): David Krueger is hiring for a postdoc in AI alignment (and is also hiring for another role in deep learning). The application deadline is August 2. Job ad: OpenAI Trust & Safety Operations Contractor (summarized by Rohin): In this remote contractor role, you would evaluate submissions to OpenAI's App Review process to ensure they comply with OpenAI's policies. Apply here by July 13, 5pm Pacific Time. Job ad: Director of CSER (summarized by Rohin): Application deadlin...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Will Capabilities Generalise More?, published by Ramana Kumar on June 29, 2022 on The AI Alignment Forum. Nate and Eliezer (Lethality 21) claim that capabilities generalise further than alignment once capabilities start generalising far at all. However, they have not articulated particularly detailed arguments for why this is the case. In this post I collect the arguments for and against the position I have been able to find or generate, and develop them (with a few hours' effort). I invite you to join me in better understanding this claim and its veracity by contributing your own arguments and improving mine.Thanks to these people for their help with writing and/or contributing arguments: Vikrant Varma, Vika Krakovna, Mary Phuong, Rory Grieg, Tim Genewein, Rohin Shah. For: 1. Capabilities have much shorter description length than alignment. There are simple “laws of intelligence” that underwrite highly general and competent cognitive abilities, but no such simple laws of corrigibility or laws of “doing what the principal means” – or at least, any specification of these latter things will have a higher description length than the laws of intelligence. As a result, most R&D pathways optimising for capabilities and alignment with anything like a simplicity prior (for example) will encounter good approximations of general intelligence earlier than good approximations of corrigibility or alignment. 2. Feedback on capabilities is more consistent and reliable than on alignment. Reality hits back on cognitive strategies implementing capabilities – such as forming and maintaining accurate beliefs, or making good predictions – more consistently and reliably than any training process hits back on motivational systems orienting around incorrect optimisation targets. Therefore there's stronger outer optimisation pressure towards good (robust) capabilities than alignment, so we see strong and general capabilities first. 3. There's essentially only one way to get general capabilities and it has a free parameter for the optimisation target. There are many paths but only one destination when it comes to designing (via optimisation) a system with strong capabilities. But what those capabilities end up being directed at is path- and prior-dependent in a way we currently do not understand nor have much control over. 4. Corrigibility is conceptually in tension with capability, so corrigibility will fail to generalise when capability generalises well. Plans that actually work in difficult domains need to preempt or adapt to obstacles. Attempts to steer or correct the target of actually-working planning are a form of obstacle, so we would expect capable planning to resist correction, limiting the extent to which alignment can generalise when capability starts to generalise. 5. Empirical evidence: human intelligence generalised far without staying aligned with its optimisation target. There is empirical/historical support for capabilities generalising further than alignment to the extent that the analogy of AI development to the evolution of intelligence holds up. 6. Empirical evidence: goal misgeneralisation happens. There is weak empirical support for capabilities generalising further than alignment in the fact that it is possible to create demos of goal misgeneralisation (e.g.,). 7. The world is simple whereas the target is not. There are relatively simple laws governing how the world works, for the purposes of predicting and controlling it, compared to the principles underlying what humans value or the processes by which we figure out what is good. (This is similar to For#1 but focused on knowledge instead of cognitive abilities.) (This is in direct opposition to Against#3.) 8. Much more effort will be poured into capabilities (and d(progress)/d(effort) for alignment is not so much hig...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Resources I send to AI researchers about AI safety, published by Vael Gates on June 14, 2022 on LessWrong. This is my masterlist of resources I send AI researchers who are mildly interested in learning more about AI safety. I pick and choose which resources to send based on the researcher's interests. The resources at the top of the email draft are the ones I usually send, and I add in later sections as seems useful. I'll also sometimes send The Alignment Problem, Human-Compatible, or The Precipice.I've also included a list of resources that I had students read through for the course Stanford first-year course "Preventing Human Extinction", though I'd most recommend sufficiently motivated students read AGISF Technical Agenda. These reading choices are drawn from the various other reading lists; this is not original in any way, just something to draw from if you're trying to send someone some of the more accessible resources. There's a decent chance that I'll continue updating this post as time goes on, since my current use case is copy-pasting sections of this email to interested parties. Note that "I" and "Vael" are mentioned a few times, so you'll need to edit a bit if you're copy-pasting. Happy to make any edits and take suggestions. [Crossposted to the EA Forum] List for AI researchers Hello X, Very nice to speak to you! As promised, some resources on AI alignment. I tried to include a bunch of stuff so you could look at whatever you found interesting. Happy to chat more about anything, and thanks again. Introduction to the ideas "The case for taking AI seriously as a threat to humanity" by Kelsey Piper (Vox) The Most Important Century and specifically "Forecasting Transformative AI" by Holden Karnofsky, blog series and podcast. Most recommended for timelines. A short interview from Prof. Stuart Russell (UC Berkeley) about his book, Human-Compatible (the other main book in the space is The Alignment Problem, by Brian Christian, which is written in a style I particularly enjoyed) Technical work on AI alignment Empirical work by DeepMind's Safety team on alignment Empirical work by Anthropic on alignment Talk (and transcript) by Paul Christiano describing the AI alignment landscape in 2020 Podcast (and transcript) by Rohin Shah, describing the state of AI value alignment in 2021 Alignment Newsletter and ML Safety Newsletter Unsolved Problems in ML Safety by Hendrycks et al. (2022) Alignment Research Center Interpretability work aimed at alignment: Elhage et al. (2021) and Olah et al. (2020) AI Safety Resources by Victoria Krakovna (DeepMind) and Technical Alignment Curriculum Introduction to large-scale risks from humanity, including "existential risks" that could lead to the extinction of humanity The first third of this book summary (copied below) of the book "The Precipice: Existential Risk and the Future of Humanity" by Toby Ord Chapter 3 is on natural risks, including risks of asteroid and comet impacts, supervolcanic eruptions, and stellar explosions. Ord argues that we can appeal to the fact that we have already survived for 2,000 centuries as evidence that the total existential risk posed by these threats from nature is relatively low (less than one in 2,000 per century). Chapter 4 is on anthropogenic risks, including risks from nuclear war, climate change, and environmental damage. Ord estimates these risks as significantly higher, each posing about a one in 1,000 chance of existential catastrophe within the next 100 years. However, the odds are much higher that climate change will result in non-existential catastrophes, which could in turn make us more vulnerable to other existential risks. Chapter 5 is on future risks, including engineered pandemics and artificial intelligence. Worryingly, Ord puts the risk of engineered pandemics causing an existential ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Resources I send to AI researchers about AI safety, published by Vael Gates on June 14, 2022 on The Effective Altruism Forum. This is my masterlist of resources I send AI researchers who are mildly interested in learning more about AI safety. I pick and choose which resources to send based on the researcher's interests. The resources at the top of the email draft are the ones I usually send, and I add in later sections as seems useful. I'll also sometimes send The Alignment Problem, Human-Compatible, or The Precipice.I've also included a list of resources that I had students read through for the course Stanford first-year course "Preventing Human Extinction", though I'd most recommend sufficiently motivated students read AGISF Technical Agenda. These reading choices are drawn from the various other reading lists; this is not original in any way, just something to draw from if you're trying to send someone some of the more accessible resources. There's a decent chance that I'll continue updating this post as time goes on, since my current use case is copy-pasting sections of this email to interested parties. Note that "I" and "Vael" are mentioned a few times, so you'll need to edit a bit if you're copy-pasting. Happy to make any edits and take suggestions. [Crossposted to LessWrong] List for AI researchers Hello X, Very nice to speak to you! As promised, some resources on AI alignment. I tried to include a bunch of stuff so you could look at whatever you found interesting. Happy to chat more about anything, and thanks again. Introduction to the ideas "The case for taking AI seriously as a threat to humanity" by Kelsey Piper (Vox) The Most Important Century and specifically "Forecasting Transformative AI" by Holden Karnofsky, blog series and podcast. Most recommended for timelines. A short interview from Prof. Stuart Russell (UC Berkeley) about his book, Human-Compatible (the other main book in the space is The Alignment Problem, by Brian Christian, which is written in a style I particularly enjoyed) Technical work on AI alignment Empirical work by DeepMind's Safety team on alignment Empirical work by Anthropic on alignment Talk (and transcript) by Paul Christiano describing the AI alignment landscape in 2020 Podcast (and transcript) by Rohin Shah, describing the state of AI value alignment in 2021 Alignment Newsletter and ML Safety Newsletter Unsolved Problems in ML Safety by Hendrycks et al. (2022) Alignment Research Center Interpretability work aimed at alignment: Elhage et al. (2021) and Olah et al. (2020) AI Safety Resources by Victoria Krakovna (DeepMind) and Technical Alignment Curriculum Introduction to large-scale risks from humanity, including "existential risks" that could lead to the extinction of humanity The first third of this book summary (copied below) of the book "The Precipice: Existential Risk and the Future of Humanity" by Toby Ord Chapter 3 is on natural risks, including risks of asteroid and comet impacts, supervolcanic eruptions, and stellar explosions. Ord argues that we can appeal to the fact that we have already survived for 2,000 centuries as evidence that the total existential risk posed by these threats from nature is relatively low (less than one in 2,000 per century). Chapter 4 is on anthropogenic risks, including risks from nuclear war, climate change, and environmental damage. Ord estimates these risks as significantly higher, each posing about a one in 1,000 chance of existential catastrophe within the next 100 years. However, the odds are much higher that climate change will result in non-existential catastrophes, which could in turn make us more vulnerable to other existential risks. Chapter 5 is on future risks, including engineered pandemics and artificial intelligence. Worryingly, Ord puts the risk of engineered pandemics causing...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How to pursue a career in technical AI alignment, published by charlie.rs on June 4, 2022 on LessWrong. This guide is written for people who are considering direct work on technical AI alignment. I expect it to be most useful for people who are not yet working on alignment, and for people who are already familiar with the arguments for working on AI alignment. If you aren't familiar with the arguments for the importance of AI alignment, you can get an overview of them by reading Why AI alignment could be hard with modern deep learning (Cotra, 2021) and one of The Most Important Century Series (Karnofsky, 2021) and AGI Safety from First Principles (Ngo, 2019). It might not be best for you to work on technical AI alignment. You can have a large impact on reducing existential risk from AI by working on AI strategy, governance, policy, security, forecasting, support roles, field-building, grant-making, and governance of hardware. That's not counting other areas, such as bio-risk. It is probably better to do great work in one of those areas than mediocre technical alignment work, because impact is heavy-tailed. One good exercise is to go through Holden Karnofsky's aptitudes podcast/post, and think about which of the aptitudes you might be able to become great at. Then ask yourself or others how you could use those aptitudes to solve the problems you care about. I also recommend applying to speak with 80,000 Hours. I'll probably be wrong but I might be helpful. Feedback was broadly positive, but I wouldn't be surprised if some people think that this guide is net-negative. For example, because it pushes people toward/away from theoretical research, or empirical research, or ML engineering, or getting a PhD. I have tried to communicate my all-things-considered view here, after integrating feedback. But I can only suggest that you try to form your own view on what's best for you to do, and take this guide as one input to that process. I had lots of help. Neel Nanda helped me start this project. I straight-up copied stuff from Rohin Shah, Adam Gleave, Neel Nanda, Dan Hendryks, Catherine Olson, Buck Shlegeris, and Oliver Zhang. I got great feedback from Adam Gleave, Arden Koehler, Rohin Shah, Dan Hendrycks, Neel Nanda, Noa Nabeshima, Alex Lawson, Jamie Bernadi, Richard Ngo, Mark Xu, Oliver Zhang, Andy Jones, and Emma Abele. I wrote most of this at Wytham Abbey, courtesy of Elizabeth Garrett. Types of alignment work (adapted from Rohin Shah) For direct technical alignment research aimed at solving the problem (i.e. ignoring meta work, field building, AI governance, etc), these are the rough paths: Research Lead (theoretical): These roles come in a variety of types (industry, nonprofit, academic, or even independent). You are expected to propose and lead research projects; typically ones that can be answered with a lot of thinking and writing in Google Docs/LaTeX, and maybe a little bit of programming. Theoretical alignment work can be more conceptual or more mathematical—the output of math work tends to be a proof of a theorem or a new mathematic framework, whereas in conceptual work math is used as one (very good) tool to tell if a problem has been solved. Conceptual work is more philosophical. A PhD is not required but is helpful. Relevant skills: extremely strong epistemics and research taste, strong knowledge of AI alignment; this is particularly important due to the lack of feedback loops from reality. Research Contributor (theoretical): These roles are pretty rare; as far as I know they are only available at ARC [as of May 2022]. You should probably just read their hiring post. Research Lead (empirical): Besides academia, these roles are usually available in industry orgs and similar nonprofits, such as DeepMind, OpenAI, Anthropic, and Redwood Research. You are expected to...
Dr. Rohin Shah is a Research Scientist at DeepMind, and the editor and main contributor of the Alignment Newsletter.Featured ReferencesThe MineRL BASALT Competition on Learning from Human FeedbackRohin Shah, Cody Wild, Steven H. Wang, Neel Alex, Brandon Houghton, William Guss, Sharada Mohanty, Anssi Kanervisto, Stephanie Milani, Nicholay Topin, Pieter Abbeel, Stuart Russell, Anca DraganPreferences Implicit in the State of the WorldRohin Shah, Dmitrii Krasheninnikov, Jordan Alexander, Pieter Abbeel, Anca DraganBenefits of Assistance over Reward Learning Rohin Shah, Pedro Freire, Neel Alex, Rachel Freedman, Dmitrii Krasheninnikov, Lawrence Chan, Michael D Dennis, Pieter Abbeel, Anca Dragan, Stuart RussellOn the Utility of Learning about Humans for Human-AI CoordinationMicah Carroll, Rohin Shah, Mark K. Ho, Thomas L. Griffiths, Sanjit A. Seshia, Pieter Abbeel, Anca DraganEvaluating the Robustness of Collaborative AgentsPaul Knott, Micah Carroll, Sam Devlin, Kamil Ciosek, Katja Hofmann, A. D. Dragan, Rohin ShahAdditional References AGI Safety Fundamentals, EA Cambridge
Context is king: whether in language, ecology, culture, history, economics, or chemistry. One of the core teachings of complexity science is that nothing exists in isolation — especially when it comes to systems in which learning, memory, or emergent behaviors play a part. Even though this (paradoxically) limits the universality of scientific claims, it also lets us draw analogies between the context-dependency of one phenomenon and others: how protein folding shapes HIV evolution is meaningfully like the way that growing up in a specific neighborhood shapes educational and economic opportunity; the paths through a space of all possible four-letter words are constrained in ways very similar to how interactions between microbes impact gut health; how we make sense both depends on how we've learned and places bounds on what we're capable of seeing.Welcome to COMPLEXITY, the official podcast of the Santa Fe Institute. I'm your host, Michael Garfield, and every other week we'll bring you with us for far-ranging conversations with our worldwide network of rigorous researchers developing new frameworks to explain the deepest mysteries of the universe.This week on Complexity, we talk to Yale evolutionary biologist C. Brandon Ogbunu (Twitter, Google Scholar, GitHub) about the importance of environment to the activity and outcomes of complex systems — the value of surprise, the constraints of history, the virtue and challenge of great communication, and much more. Our conversation touches on everything from using word games to teach core concepts in evolutionary theory, to the ways that protein quality control co-determines the ability of pathogens to evade eradication, to the relationship between human artists, algorithms, and regulation in the 21st Century. Brandon works not just in multiple scientific domains but as the author of a number of high-profile blogs exploring the intersection of science and culture — and his boundaryless fluency shines through in a discussion that will not be contained, about some of the biggest questions and discoveries of our time.If you value our research and communication efforts, please subscribe to Complexity Podcast wherever you prefer to listen, rate and review us at Apple Podcasts, and/or consider making a donation at santafe.edu/give. You'll find plenty of other ways to engage with us at santafe.edu/engage.Thank you for listening!Join our Facebook discussion group to meet like minds and talk about each episode.Podcast theme music by Mitch Mignano.Follow us on social media:Twitter • YouTube • Facebook • Instagram • LinkedInDiscussed in this episode:“I do my science biographically…I find a personal connection to the essence of the question.”– C. Brandon Ogbunugafor on RadioLab"Environment x everything interactions: From evolution to epidemics and beyond"Brandon's February 2022 SFI Seminar (YouTube Video + Live Twitter Coverage)“A Reflection on 50 Years of John Maynard Smith's ‘Protein Space'”C. Brandon Ogbunugafor in GENETICS“Collective Computing: Learning from Nature”David Krakauer presenting at the Foresight Institute in 2021 (with reference to Rubik's Cube research)“Optimal Policies Tend to Seek Power”Alexander Matt Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli in arXiv“A New Take on John Maynard Smith's Concept of Protein Space for Understanding Molecular Evolution”C. Brandon Ogbunugafor, Daniel Hartl in PLOS Computational Biology“The 300 Most Common Words”by Bruce Sterling“The Host Cell's Endoplasmic Reticulum Proteostasis Network Profoundly Shapes the Protein Sequence Space Accessible to HIV Envelope”Jimin Yoon, Emmanuel E. Nekongo, Jessica E. Patrick, Angela M. Phillips, Anna I. Ponomarenko, Samuel J. Hendel, Vincent L. Butty, C. Brandon Ogbunugafor, Yu-Shan Lin, Matthew D. Shoulders in bioRxiv“Competition along trajectories governs adaptation rates towards antimicrobial resistance”C. Brandon Ogbunugafor, Margaret J. Eppstein in Nature Ecology & Evolution“Scientists Need to Admit What They Got Wrong About COVID”C. Brandon Ogbunugafor in WIRED“Deconstructing higher-order interactions in the microbiota: A theoretical examination”Yitbarek Senay, Guittar John, Sarah A. Knutie, C. Brandon Ogbunugafor in bioRxiv“What Makes an Artist in the Age of Algorithms?”C. Brandon Ogbunugafor in WIREDNot mentioned in this episode but still worth exploring:“Part of what I was getting after with Blackness had to do with authoring ideas that are edgy or potentially threatening. That as a scientist, you can generate ideas in the name of research, in the name of breaking new ground, that may stigmatize you. That may kick you out of the club, so to speak, because you're not necessarily following the herd.”– Physicist Stephon Alexander in an interview with Brandon at Andscape“How Afrofuturism Can Help The World Mend”C. Brandon Ogbunugafor in WIRED“The COVID-19 pandemic amplified long-standing racial disparities in the United States criminal justice system”Brennan Klein, C. Brandon Ogbunugafor, Benjamin J. Schafer, Zarana Bhadricha, Preeti Kori, Jim Sheldon, Nitish Kaza, Emily A. Wang, Tina Eliassi-Rad, Samuel V. Scarpino, Elizabeth Hinton in medRxivAlso mentioned:Simon Conway Morris, Geoffrey West, Samuel Scarpino, Rick & Morty, Stuart Kauffman, Frank Salisbury, Stephen Jay Gould, Frances Arnold, John Vervaeke, Andreas Wagner, Jennifer Dunne, James Evans, Carl Bergstrom, Jevin West, Henry Gee, Eugene Shakhnovich, Rafael Guerrero, Gregory Bateson, Simon DeDeo, James Clerk Maxwell, Melanie Moses, Kathy Powers, Sara Walker, Michael Lachmann, and many others...