The Nonlinear Library allows you to easily listen to top EA and rationalist content on your podcast player. We use text-to-speech software to create an automatically updating repository of audio content from the EA Forum, Alignment Forum, LessWrong, and o
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Grantmaking is Funding-Limited Right Now, published by johnswentworth on July 19, 2023 on LessWrong. For the past few years, I've generally mostly heard from alignment grantmakers that they're bottlenecked by projects/people they want to fund, not by amount of money. Grantmakers generally had no trouble funding the projects/people they found object-level promising, with money left over. In that environment, figuring out how to turn marginal dollars into new promising researchers/projects - e.g. by finding useful recruitment channels or designing useful training programs - was a major problem. Within the past month or two, that situation has reversed. My understanding is that alignment grantmaking is now mostly funding-bottlenecked. This is mostly based on word-of-mouth, but for instance, I heard that the recent lightspeed grants round received far more applications than they could fund which passed the bar for basic promising-ness. I've also heard that the Long-Term Future Fund (which funded my current grant) now has insufficient money for all the grants they'd like to fund. I don't know whether this is a temporary phenomenon, or longer-term. Alignment research has gone mainstream, so we should expect both more researchers interested and more funders interested. It may be that the researchers pivot a bit faster, but funders will catch up later. Or, it may be that the funding bottleneck becomes the new normal. Regardless, it seems like grantmaking is at least funding-bottlenecked right now. Some takeaways: If you have a big pile of money and would like to help, but haven't been donating much to alignment because the field wasn't money constrained, now is your time! If this situation is the new normal, then earning-to-give for alignment may look like a more useful option again. That said, at this point committing to an earning-to-give path would be a bet on this situation being the new normal. Grants for upskilling, training junior people, and recruitment make a lot less sense right now from grantmakers' perspective. For those applying for grants, asking for less money might make you more likely to be funded. (Historically, grantmakers consistently tell me that most people ask for less money than they should; I don't know whether that will change going forward, but now is an unusually probable time for it to change.) Note that I am not a grantmaker, I'm just passing on what I hear from grantmakers in casual conversation. If anyone with more knowledge wants to chime in, I'd appreciate it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Grantmaking is Funding-Limited Right Now, published by johnswentworth on July 19, 2023 on The AI Alignment Forum. For the past few years, I've generally mostly heard from alignment grantmakers that they're bottlenecked by projects/people they want to fund, not by amount of money. Grantmakers generally had no trouble funding the projects/people they found object-level promising, with money left over. In that environment, figuring out how to turn marginal dollars into new promising researchers/projects - e.g. by finding useful recruitment channels or designing useful training programs - was a major problem. Within the past month or two, that situation has reversed. My understanding is that alignment grantmaking is now mostly funding-bottlenecked. This is mostly based on word-of-mouth, but for instance, I heard that the recent lightspeed grants round received far more applications than they could fund which passed the bar for basic promising-ness. I've also heard that the Long-Term Future Fund (which funded my current grant) now has insufficient money for all the grants they'd like to fund. I don't know whether this is a temporary phenomenon, or longer-term. Alignment research has gone mainstream, so we should expect both more researchers interested and more funders interested. It may be that the researchers pivot a bit faster, but funders will catch up later. Or, it may be that the funding bottleneck becomes the new normal. Regardless, it seems like grantmaking is at least funding-bottlenecked right now. Some takeaways: If you have a big pile of money and would like to help, but haven't been donating much to alignment because the field wasn't money constrained, now is your time! If this situation is the new normal, then earning-to-give for alignment may look like a more useful option again. That said, at this point committing to an earning-to-give path would be a bet on this situation being the new normal. Grants for upskilling, training junior people, and recruitment make a lot less sense right now from grantmakers' perspective. For those applying for grants, asking for less money might make you more likely to be funded. (Historically, grantmakers consistently tell me that most people ask for less money than they should; I don't know whether that will change going forward, but now is an unusually probable time for it to change.) Note that I am not a grantmaker, I'm just passing on what I hear from grantmakers in casual conversation. If anyone with more knowledge wants to chime in, I'd appreciate it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A brief history of computers, published by Adam Zerner on July 19, 2023 on LessWrong. Recently I've been learning about the history of computers. I find it to be incredibly interesting. I'd like to write a post about it to summarize and comment on what I've learned. I'm a little hesitant though. I'm no expert on this stuff. I'm largely learning about it all for the first time. So then, take all of this with a grain of salt. It's more of a conversation starter than a finished product. If you want something authoritative, I'd recommend the Stanford Encyclopedia of Philosophy. Logic Let's start with logic. Computers are largely based on boolean logic. Y'know, 1s and 0s. AND, OR, NOT. George Boole did a bunch of important work here in the mid 1800s, but let's try backing up even further. Was there anything important that came before Boolean logic? Yeah, there was. It goes all the way back to Aristotle in ~350 BCE. Aristotle did a bunch of groundbreaking work in the field of logic. Furthermore, after "breaking the ground", there weren't any significant developments until the mid 1800s. Wow! That's a long time. An unusually long time. In other fields like mathematics, natural sciences, literature and engineering, there were significant advances. I wonder why things in the field of logic were so quiet. Anyway, let's talk about what exactly Aristotle did. In short, he looked at arguments in the abstract. It's one thing to say that: Filo is a dog Therefore, Filo has feet It's another thing to say that: R is a P Therefore, R has Q The former is concrete. It's talking about dogs, feet and Filo. The latter is abstract. It's talking about P's, Q's and R's. Do you see the difference? Before Aristotle, people never thought about this stuff in terms of P's and Q's. They just thought about dogs and feet. Thinking about P's and Q's totally opened things up. Pretty cool. Abstraction is powerful. I think this is very much worth noting as an important milestone in the history of computers. Ok. So once Aristotle opened the flood gates with categorical logic, over time, people kinda piggybacked off of it and extended his work. For example, the Stoics did a bunch of work with propositional logic. Propositional logic is different from categorical logic. Categorical logic is about what categories things belong to. For example, earlier we basically said that dogs belong to the category of "things with feet" and that Filo belongs to the category of "dogs". With those two statements, we deduced that Filo must also belong to the category of "things with feet". It makes a lot of sense when you think about it visually: On the other hand, propositional logic is about things being true or false. For example, with this: I don't have an umbrella we can deduce things like: "It is raining or I have an umbrella" is true Propositional logic is about truth and uses operators like AND, OR, NOT, IF-THEN, and IF-AND-ONLY-IF. Categorical logic is about categories and uses operators like ALL, NO and SOME. After propositional logic, subsequent work was done. For example, predicate logic kinda piggybacked off of propositional logic. But in general, nothing too crazy was going on. Let's jump ahead to the mid 1800s and George Boole. Boole introduced stuff like this: (p and q) is false (p or q) is true (not (p and q)) is true But wait a minute. I'm confused. Didn't we get that sort of thing from propositional logic all the way back in 300 BCE from the Stoics? In researching this question I'm seeing things saying that it did in fact already exist, it's just that Boole made it more "systematic and formalized". I don't understand though. In what way did he made it more systematic and formalized? Oh well. Suffice it to say that boolean logic was a thing that we knew about. Let's move on. Jacquard loom I was going to star...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tiny Mech Interp Projects: Emergent Positional Embeddings of Words, published by Neel Nanda on July 18, 2023 on LessWrong. This post was written in a rush and represents a few hours of research on a thing I was curious about, and is an exercise in being less of a perfectionist. I'd love to see someone build on this work! Thanks a lot to Wes Gurnee for pairing with me on this Tokens are weird, man Introduction A particularly notable observation in interpretability in the wild is that part of the studied circuit moves around information about whether the indirect object of the sentence is the first or second name in the sentence. The natural guess is that heads are moving around the absolute position of the correct name. But even in prompt formats where the first and second names are in the different absolute positions, they find that the informations conveyed by these heads are exactly the same, and can be patched between prompt templates! (credit to Alexandre Variengien for making this point to me). This raises the possibility that the model has learned what I call emergent positional embeddings - rather than representing "this is the token in position 5" it may represent "this token is the second name in the sentence" or "this token is the fourth word in the sentence" or "this is the third sentence in the paragraph" etc. Intuitively, models will often want to do things like attend to the previous word, or the corresponding word in the previous sentence, etc - there are lots of things it will plausibly want to do that are natural in some emergent coordinate scheme that are unnatural in the actual token coordinate scheme. I was curious about this, and spent an afternoon poking around with Wes Gurnee at whether I could convince myself that these emergent positional embeddings were a thing. This post is an experiment: I'm speedrunning a rough write-up on a few hours of hacky experiments, because this seemed more interesting to write-up than not to, and I was never going to do the high-effort version. Please take all this with a mountain of salt, and I'd love to see anyone build on my incredibly rough results - code here. Experiments You can see some terrible code for these experiments here. See the Appendix for technical details I wanted to come up with the dumbest experiment I could that could shed light on whether this was a thing. One thing that models should really care about is the ability to attend to tokens in the previous word. Words can commonly range from 1 to 3 tokens (and maybe much longer for rare or mispelt words) so this is naturally done with an emergent scheme saying which word a token is part of. My experiment: I took prompts with a fixed prefix of 19 tokens and then seven random lowercase English words of varying token length, like token|izer| help| apple| dram|at|isation| architecture| sick| al|p|aca. I ran GPT-2 Small on this, look the residual stream after layer 3 (33% of the way through the model) and then trained a logistic regression probe on the residual stream of the token at the end of each word to predict which word it was in. This is the key plot, though it takes a bit of time to get your head around. The x axis is the absolute position of the token in the prompt and the row is the ground truth of the word index. The bar for each absolute position and row shows the distribution of guesses given on the probe validation set. The colours correspond to the seven possible indices (note that the legend is not in numerical order, sigh). For example: take the third bar in the second row (index=1, abs_pos=22). This is mostly red (index = 1, correct!), with a bit of blue at the bottom (index = 0, incorrect) and a bit of green at the top (index = 2, incorrect). In contrast, the bar in the row below (second bar in the third row, index=2, abs_pos=23) is...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta announces Llama 2; "open sources" it for commercial use, published by LawrenceC on July 18, 2023 on LessWrong. See also their Llama 2 website here:, and their research paper here:/ From their blog post: Takeaways Today, we're introducing the availability of Llama 2, the next generation of our open source large language model. Llama 2 is free for research and commercial use. Microsoft and Meta are expanding their longstanding partnership, with Microsoft as the preferred partner for Llama 2. We're opening access to Llama 2 with the support of a broad set of companies and people across tech, academia, and policy who also believe in an open innovation approach to today's AI technologies. Compared to the first Llama, LLama 2 is trained for 2T tokens instead of 1.4T, has 2x the context length (4096 instead of 2048), uses Grouped Query Attention, and performs better across the board, with performance generally exceeding code-davinci-002 on benchmarks: They also release both a normal base model (Llama 2) and a RLHF'ed chat model (Llama 2-chat). Interestingly, they're only releasing the 7B/13B/70B models, and not the 34B model, "due to a lack of time to sufficiently red team". More importantly, they're releasing it both on Microsoft Azure and also making it available for commercial use. The form for requesting access is very straightforward and does not require stating what you're using it for: (EDIT: they gave me access ~20 minutes after submitting the form, seems pretty straightforward.) Note that their license is not technically free for commercial use always; it contains the following clauses: [1.] v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof). 2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee's affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights. See also the Llama 2 Acceptable Use Policy (which seems pretty standard). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Tiny Mech Interp Projects: Emergent Positional Embeddings of Words, published by Neel Nanda on July 18, 2023 on The AI Alignment Forum. This post was written in a rush and represents a few hours of research on a thing I was curious about, and is an exercise in being less of a perfectionist. I'd love to see someone build on this work! Thanks a lot to Wes Gurnee for pairing with me on this Introduction A particularly notable observation in interpretability in the wild is that part of the studied circuit moves around information about whether the indirect object of the sentence is the first or second name in the sentence. The natural guess is that heads are moving around the absolute position of the correct name. But even in prompt formats where the first and second names are in the different absolute positions, they find that the informations conveyed by these heads are exactly the same, and can be patched between prompt templates! (credit to Alexandre Variengien for making this point to me). This raises the possibility that the model has learned what I call emergent positional embeddings - rather than representing "this is the token in position 5" it may represent "this token is the second name in the sentence" or "this token is the fourth word in the sentence" or "this is the third sentence in the paragraph" etc. Intuitively, models will often want to do things like attend to the previous word, or the corresponding word in the previous sentence, etc - there are lots of things it will plausibly want to do that are natural in some emergent coordinate scheme that are unnatural in the actual token coordinate scheme. I was curious about this, and spent an afternoon poking around with Wes Gurnee at whether I could convince myself that these emergent positional embeddings were a thing. This post is an experiment: I'm speedrunning a rough write-up on a few hours of hacky experiments, because this seemed more interesting to write-up than not to, and I was never going to do the high-effort version. Please take all this with a mountain of salt, and I'd love to see anyone build on my incredibly rough results - code here. Experiments You can see some terrible code for these experiments here. See the Appendix for technical details I wanted to come up with the dumbest experiment I could that could shed light on whether this was a thing. One thing that models should really care about is the ability to attend to tokens in the previous word. Words can commonly range from 1 to 3 tokens (and maybe much longer for rare or mispelt words) so this is naturally done with an emergent scheme saying which word a token is part of. My experiment: I took prompts with a fixed prefix of 19 tokens and then seven random lowercase English words of varying token length, like token|izer| help| apple| dram|at|isation| architecture| sick| al|p|aca. I ran GPT-2 Small on this, look the residual stream after layer 3 (33% of the way through the model) and then trained a logistic regression probe on the residual stream of the token at the end of each word to predict which word it was in. This is the key plot, though it takes a bit of time to get your head around. The x axis is the absolute position of the token in the prompt and the row is the ground truth of the word index. The bar for each absolute position and row shows the distribution of guesses given on the probe validation set. The colours correspond to the seven possible indices (note that the legend is not in numerical order, sigh). For example: take the third bar in the second row (index=1, abs_pos=22). This is mostly red (index = 1, correct!), with a bit of blue at the bottom (index = 0, incorrect) and a bit of green at the top (index = 2, incorrect). In contrast, the bar in the row below (second bar in the third row, index=2, abs_pos=23) is mostly g...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Still no Lie Detector for LLMs, published by Daniel Herrmann on July 18, 2023 on The AI Alignment Forum. Background This post is a short version of a paper we wrote that you can find here. You can read this post to get the core ideas. You can read the paper to go a little deeper. The paper is about probing decoder-only LLMs for their beliefs, using either unsupervised methods (like CCS from Burns) or supervised methods. We give both philosophical/conceptual reasons we are pessimistic and demonstrate some empirical failings using LLaMA 30b. By way of background, we're both philosophers, not ML people, but the paper is aimed at both audiences. Introduction One child says to the other "Wow! After reading some text, the AI understands what water is!". The second child says "All it understands is relationships between words. None of the words connect to reality. It doesn't have any internal concept of what water looks like or how it feels to be wet. ." . Two angels are watching [some] chemists argue with each other. The first angel says "Wow! After seeing the relationship between the sensory and atomic-scale worlds, these chemists have realized that there are levels of understanding humans are incapable of accessing." The second angel says "They haven't truly realized it. They're just abstracting over levels of relationship between the physical world and their internal thought-forms in a mechanical way. They have no concept of [$!&&!@] or [#@]. You can't even express it in their language!" Scott Alexander, Meaningful Do large language models (LLMs) have beliefs? And, if they do, how might we measure them? These questions are relevant as one important problem that plagues current LLMs is their tendency to generate falsehoods with great conviction. This is sometimes called lying and sometimes called hallucinating. One strategy for addressing this problem is to find a way to read the beliefs of an LLM directly off its internal state. Such a strategy falls under the broad umbrella of model interpretability, but we can think of it as a form of mind-reading. Detecting lies in LLMs has many obvious applications, and is especially relevant for things like ELK. We tackle the question about the status of beliefs in LLMs head-on. We proceed in two stages. First, we assume that LLMs do have beliefs, and consider two current approaches for how we might measure them, due to Azaria and Mitchell and to Burns et al. We provide empirical results from LLaMA 30b that show that these methods fail to generalize in very basic ways. We then argue that, even if LLMs have beliefs, these methods are unlikely to be successful for conceptual reasons. Thus, there is still no lie-detector for LLMs. After describing our empirical results we take a step back and consider whether or not we should expect LLMs to have something like beliefs in the first place. We consider some recent arguments aiming to show that LLMs cannot have beliefs. We show that these arguments are misguided and rely on a philosophical mistake. We provide a more productive framing of questions surrounding the status of beliefs in LLMs. Our analysis reveals both that there are many contexts in which we should expect systems to track the truth in order to accomplish other goals but that the question of whether or not LLMs have beliefs is largely an empirical matter. We provide code at. Challenge in Deciphering the Beliefs of Language Models For now, let's assume that in order to generate human-like text, LLMs (like humans) have beliefs about the world. We might then ask how we can measure and discover their beliefs. This question immediately leads to a number of problems: Unreliable Self-Reporting Asking an LLM directly about its beliefs is insufficient. As we've already discussed, models have a tendency to hallucinate or even lie. So...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EAGxCambridge 2023 Retrospective, published by David Mears on July 18, 2023 on The Effective Altruism Forum. Written by the core organising team for EAGxCambridge, this retrospective evaluates the conference, gives feedback for the CEA events team, and makes recommendations for future organisers. It's also a general update for the community about the event, and an exercise in transparency. We welcome your feedback and comments about how we can improve EA conferences in the future. You can watch 19 of the talks on CEA's YouTube channel, here. Attendees' photos are here, and professional photos are in a subfolder. Summary We think EAGxCambridge went well. The main metric CEA uses to evaluate their events is 'number of connections'. We estimate around 4200 new connections resulted, at an all-told cost of around £53 per connection (=$67 at time of writing), which is a better cost-per-connection than many previous conferences. The low cost-per-connection is partly driven by the fact that the event was on the large side compared to the historical average (enabling economies of scale to kick in) and encompassed 3 days; it was also kept low by limiting travel grants. Of these 4200 new connections, around 1700 were potentially 'impactful' as rated by attendees. (Pinch of salt: as a rule, people don't know how impactful any given connection is.) The likelihood-to-recommend scores were on a par with other EA conferences, which are usually very highly rated. (The average answer was 8.7 on a 1-to-10 scale.) Besides making connections, we also wanted to encourage and inspire attendees to take action. 82% of survey respondents said they planned to take at least one of a list of actions (e.g. 'change degree') as a result of the conference, including 14.5% resolving to found an EA organisation and 30% resolving to work full-time for such an organisation or in a primary cause area. After applying a pinch of salt, those numbers suggest the conference inspired people to take significant action. We heard of several anecdotal cases where the conference triggered people to apply for particular jobs or funding, or resulted in internships or research collaborations. We're very thankful to everyone who made this happen: volunteers, attendees, session hosts, and many others. Contents For more in-depth commentary, click through to the relevant part of the Google Doc using the links below. Basic stats Strategy Why Cambridge? Focussing on the UK and Ireland Core team Compressed timelines Budget Admissions Some admissions statistics Stewardship Venues Main venue: Guildhall first floor Secondary venue: ground floor Tertiary venue: Lola Lo Acoustics Coordinating with venue staff on the day Overall view Volunteers Numbers Attendee experience More snacks Better acoustics Faster wifi Food was "incredible" / "amazing" / "extremely good" / "really excellent" Attendee Slack Content Attendee favourites "Were any sessions you attended particularly good? Which ones?" Merch Satellite events Communications Design Comms strategy Comms tactics Closer to the conference itself Feedback survey results Net Promoter Score Demographics Resulting actions Welcomingness For the sake of the search index, those videos are:Testing Your Suitability For AI Alignment Research | Esben KranGood News in Effective Altruism | Shakeel HashimCombating Catastrophic Biorisk With Indoor Air Safety | Jam KraprayoonShould Effective Altruists Focus On Air Pollution? | Tom BarnesAlternative Proteins and How I Got There | Friederike Grosse-HolzHow Local Groups Can Have Global Impact | Cambridge Alternative Protein ProjectGlobal Food System Failure And What We Can Do About It | Noah WescombeInfecting Humans For Vaccine Development | Jasmin KaurExistential Risk Pessimism And The Time Of Perils | David ThorstadHow To Make The Most of EAGx | Oll...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Meta announces Llama 2; "open sources" it for commercial use, published by Lawrence Chan on July 18, 2023 on The AI Alignment Forum. See also their Llama 2 website here:, and their research paper here:/ From their blog post: Takeaways Today, we're introducing the availability of Llama 2, the next generation of our open source large language model. Llama 2 is free for research and commercial use. Microsoft and Meta are expanding their longstanding partnership, with Microsoft as the preferred partner for Llama 2. We're opening access to Llama 2 with the support of a broad set of companies and people across tech, academia, and policy who also believe in an open innovation approach to today's AI technologies. Compared to the first Llama, LLama 2 is trained for 2T tokens instead of 1.4T, has 2x the context length (4096 instead of 2048), uses Grouped Query Attention, and performs better across the board, with performance generally exceeding code-davinci-002 on benchmarks: More importantly, they're releasing it both on Microsoft Azure and also making it available for commercial use. The form for requesting access is very straightforward and does not require stating what you're using it for: Note that their license is not technically free for commercial use always; it contains the following clauses: [1.] v. You will not use the Llama Materials or any output or results of the Llama Materials to improve any other large language model (excluding Llama 2 or derivative works thereof). 2. Additional Commercial Terms. If, on the Llama 2 version release date, the monthly active users of the products or services made available by or for Licensee, or Licensee's affiliates, is greater than 700 million monthly active users in the preceding calendar month, you must request a license from Meta, which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the rights under this Agreement unless or until Meta otherwise expressly grants you such rights. See also the Llama 2 Acceptable Use Policy (which seems pretty standard). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring and Improving the Faithfulness of Model-Generated Reasoning, published by Ansh Radhakrishnan on July 18, 2023 on LessWrong. TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting - answering questions by breaking them into subquestions - improves faithfulness while maintaining good task performance. Paper Abstracts Measuring Faithfulness in Chain-of-Thought Reasoning Large language models (LLMs) perform better when they produce step-by-step, "Chain-of -Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT(e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior. Externalized Reasoning Oversight Relies on Faithful Reasoning Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-step "Chain-of...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Five Years of Rethink Priorities: Impact, Future Plans, Funding Needs (July 2023), published by Rethink Priorities on July 18, 2023 on The Effective Altruism Forum. Overview This piece highlights Rethink Priorities' accomplishments, mistakes, and changes since its establishment in 2018. We discuss RP's future plans as well as potential constraints to our impact. Finally, we call for donations and invite people to engage in an Ask Me Anything (AMA) discussion with Co-CEO Peter Wildeford that will be announced by July 19 in our newsletter and social media. You can also read this post as a PDF with visualizations. Executive summary Key accomplishments (2018-2023) In five years, RP has published over 125 pieces of research, completed more than another 100 research projects, provided various grantmakers with consultation, influenced tens of millions of dollars in funding, fiscally sponsored nine projects, and drove forward the promising field of invertebrate welfare. Specific accomplishments include: Collaborating with dozens of European Union (EU) animal advocacy organizations to work on setting medium-term policy strategies for farmed animal welfare. Providing expert consultation to the Chilean government as they considered a bill (which has advanced to the next legislative stage) to recognize animals as sentient. Contributing significantly to burgeoning fields, such as invertebrate welfare - including work related to shrimps and insects (see more here). Completing the Moral Weight Project to try to help funders decide how to best allocate resources across species. Producing 23 reports commissioned by Open Philanthropy answering their questions about global health and development issues and interventions. Conducting 205 tailored surveys and data analysis to help several organizations that build communities of people working on global priorities. Launching projects such as Condor Camp and fiscally sponsoring organizations like Epoch and Apollo Research via our Special Projects team, which provides operational support. Setting up an Artificial Intelligence (AI) Governance and Strategy team. Growing from a two-person operation in 2018 to a team that will soon include 75 RP employees, 30 contractors, and 25 staff of fiscally sponsored projects. Mistakes and challenges We believe that some of RP's past projects failed because we did not adequately consider the project's probability of success, its potential value, and the resources required. For example, our 2018 PriorityWiki project would have involved a large volunteer coordination effort we weren't well-placed to execute, and it is unclear how valuable it would have been even if successful. Our biggest early mistake was not building a plan for each project's path to influence and not putting enough resources into measuring our impact. We initially relied too much on producing research and hoping that it would be impactful just by existing. Thus far, we have not publicly shared as much of our existing internal impact tracking as we had initially intended due to time constraints. While we did some preparation prior to scaling in 2022, we would have liked to have established more robust project management systems beforehand. Our project timelines are not always predictable, as it is difficult to determine when you should stop researching due to diminishing returns. We think there are several cases where we spent too long working on a piece of research and would've had more impact by releasing the work earlier. Changes We now have better project management systems and spend much more time thinking through how to communicate our research and ensure each piece has a higher chance for impact. We are focusing more on measuring our impact, and hired a Chief Strategy Analyst last year. In addition to researching neglected areas (e....
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Proof of posteriority: a defense against AI-generated misinformation, published by jchan on July 17, 2023 on LessWrong. Summary A proof of priority shows that a piece of data must have been created before a certain time. A proof of posteriority shows that a piece of data must have been created after a certain time. By combining the two, you can prove that the data must have been created within a certain time interval. If it would be computationally infeasible for AI to generate the data within that time, then this proves that the data is authentic (i.e. depicting real events) and not AI-generated misinformation. Proof of priority is easy; proof of posteriority is tricky. Therefore the success of the whole scheme hinges upon the latter. Illustration of the two kinds of proofs: car rental case Let's say you're renting from ACME Rent-a-Car. You know that ACME is maximally sleazy and will try to charge you a fee for every dent and scratch, even if it was already like that beforehand. Therefore, before taking possession of the car, you take out your phone and record a video where you walk around the car, pointing out all the existing damage. You expect to use this video as evidence if/when a dispute arises. But this isn't enough - ACME can accuse you of having made the video at the end of your rental, after you've already caused all the damage. To guard against this accusation, you immediately embed a hash of your video into a Bitcoin transaction, which is incorporated into the public blockchain. Then you can point to this transaction as proof that the video existed before the time you are known to have driven the car off the lot. This is a proof of priority. When you return the car, you're faced with the dual problem. The car is in the same condition as when you picked it up, but you're worried that ACME will come after you for damage that occurs later (due to ACME's own recklessness, or some subsequent renter). So you make another walk-around video, pointing out all the undamaged areas. But again this isn't good enough, since ACME might accuse you of having made that video at the beginning of your rental, before you (supposedly) busted up the car. To guard against this accusation, you do the reverse of what you did before - you write the hash of the latest Bitcoin block on a piece of paper, and make sure that this paper is continually in view as you make the video. Then you can point to the block with that hash as proof that the video was made after you returned the car. This is a proof of posteriority. General principle Both proofs make use of a public, timestamped source of randomness (e.g. the Bitcoin blockchain) that (a) cannot be predicted in advance, and (b) can be modified so as to become "causally downstream" of whatever data you create. You prove priority by incorporating your data into the randomness, and you prove posteriority by incorporating the randomness into your data. Application: proving that data wasn't generated by AI We are already at the stage (or soon will be) when AI can generate convincing videos showing arbitrary things happening. This fatally undermines the entire information ecosystem to which we've grown accustomed in the modern era, since we can no longer accept the mere existence of a certain string of bits as proof that something happened. If you like living in a society where truth matters and not just power relations, then you'll think this is a bad thing. But we might be able to salvage some of the "trustless attestation" quality of video recordings if we can combine a proof of priority and posteriority to constrain the production of the video to such a short period of time that it would be computationally infeasible for AI to generate a similarly-realistic video within that time. By "computationally infeasible" I mean: it would requir...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Measuring and Improving the Faithfulness of Model-Generated Reasoning, published by Ansh Radhakrishnan on July 18, 2023 on The AI Alignment Forum. TL;DR: In two new papers from Anthropic, we propose metrics for evaluating how faithful chain-of-thought reasoning is to a language model's actual process for answering a question. Our metrics show that language models sometimes ignore their generated reasoning and other times don't, depending on the particular task + model size combination. Larger language models tend to ignore the generated reasoning more often than smaller models, a case of inverse scaling. We then show that an alternative to chain-of-thought prompting - answering questions by breaking them into subquestions - improves faithfulness while maintaining good task performance. Paper Abstracts Measuring Faithfulness in Chain-of-Thought Reasoning Large language models (LLMs) perform better when they produce step-by-step, "Chain-of -Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT(e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior. Externalized Reasoning Oversight Relies on Faithful Reasoning Large language models (LLMs) are operating in increasingly challenging domains, ranging from programming assistance (Chen et al., 2021) to open-ended internet research (Nakano et al., 2021) and scientific writing (Taylor et al., 2022). However, verifying model behavior for safety and correctness becomes increasingly difficult as the difficulty of tasks increases. To make model behavior easier to check, one promising approach is to prompt LLMs to produce step-by-s...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Shaping Humanity's Longterm Trajectory, published by Toby Ord on July 18, 2023 on The Effective Altruism Forum. Since writing The Precipice, one of my aims has been to better understand how reducing existential risk compares with other ways of influencing the longterm future. Helping avert a catastrophe can have profound value due to the way that the short-run effects of our actions can have a systematic influence on the long-run future. But it isn't the only way that could happen. For example, if we advanced human progress by a year, perhaps we should expect to see us reach each subsequent milestone a year earlier. And if things are generally becoming better over time, then this may make all years across the whole future better on average. I've developed a clean mathematical framework in which possibilities like this can be made precise, the assumptions behind them can be clearly stated, and their value can be compared. The starting point is the longterm trajectory of humanity, understood as how the instantaneous value of humanity unfolds over time. In this framework, the value of our future is equal to the area under this curve and the value of altering our trajectory is equal to the area between the original curve and the altered curve. This allows us to compare the value of reducing existential risk to other ways our actions might improve the longterm future, such as improving the values that guide humanity, or advancing progress. Ultimately, I draw out and name 4 idealised ways our short-term actions could change the longterm trajectory: advancements speed-ups gains enhancements And I show how these compare to each other, and to reducing existential risk. While the framework is mathematical, the maths in these four cases turns out to simplify dramatically, so anyone should be able to follow it. My hope is that this framework, and this categorisation of some of the key ways we might hope to shape the longterm future, can improve our thinking about longtermism. Some upshots of the work: Some ways of altering our trajectory only scale with humanity's duration or its average value - but not both. There is a serious advantage to those that scale with both: speed-ups, enhancements, and reducing existential risk. When people talk about 'speed-ups', they are often conflating two different concepts. I disentangle these into advancements and speed-ups, showing that we mainly have advancements in mind, but that true speed-ups may yet be possible. The value of advancements and speed-ups depends crucially on whether they also bring forward the end of humanity. When they do, they have negative value. It is hard for pure advancements to compete with reducing existential risk as their value turns out not to scale with the duration of humanity's future. Advancements are competitive in outcomes where value increases exponentially up until the end time, but this isn't likely over the very long run. Work on creating longterm value via advancing progress is most likely to compete with reducing risk if the focus is on increasing the relative progress of some areas over others, in order to make a more radical change to the trajectory. The work is appearing as a chapter for the forthcoming book, Essays on Longtermism, but as of today, you can also read it online here. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: New career review: AI safety technical research, published by Benjamin Hilton on July 18, 2023 on The Effective Altruism Forum. Note: this post is a (minorly) edited version of a new 80,000 Hours career review. Progress in AI - while it could be hugely beneficial - comes with significant risks. Risks that we've argued could be existential. But these risks can be tackled. With further progress in AI safety, we have an opportunity to develop AI for good: systems that are safe, ethical, and beneficial for everyone. This article explains how you can help. Summary Artificial intelligence will have transformative effects on society over the coming decades, and could bring huge benefits - but we also think there's a substantial risk. One promising way to reduce the chances of an AI-related catastrophe is to find technical solutions that could allow us to prevent AI systems from carrying out dangerous behaviour. Pros Opportunity to make a significant contribution to a hugely important area of research Intellectually challenging and interesting work The area has a strong need for skilled researchers and engineers, and is highly neglected overall Cons Due to a shortage of managers, it's difficult to get jobs and might take you some time to build the required career capital and expertise You need a strong quantitative background It might be very difficult to find solutions There's a real risk of doing harm Key facts on fit You'll need a quantitative background and should probably enjoy programming. If you've never tried programming, you may be a good fit if you can break problems down into logical parts, generate and test hypotheses, possess a willingness to try out many different solutions, and have high attention to detail. If you already: Are a strong software engineer, you could apply for empirical research contributor roles right now (even if you don't have a machine learning background, although that helps) Could get into a top 10 machine learning PhD, that would put you on track to become a research lead Have a very strong maths or theoretical computer science background, you'll probably be a good fit for theoretical alignment research Recommended If you are well suited to this career, it may be the best way for you to have a social impact. Thanks to Adam Gleave, Jacob Hilton and Rohin Shah for reviewing this article. And thanks to Charlie Rogers-Smith for his help, and his article on the topic - How to pursue a career in technical AI alignment. Why AI safety technical research is high impact As we've argued, in the next few decades, we might see the development of hugely powerful machine learning systems with the potential to transform society. This transformation could bring huge benefits - but only if we avoid the risks. We think that the worst-case risks from AI systems arise in large part because AI systems could be misaligned - that is, they will aim to do things that we don't want them to do. In particular, we think they could be misaligned in such a way that they develop (and execute) plans that pose risks to humanity's ability to influence the world, even when we don't want that influence to be lost. We think this means that these future systems pose an existential threat to civilisation. Even if we find a way to avoid this power-seeking behaviour, there are still substantial risks - such as misuse by governments or other actors - which could be existential threats in themselves. There are many ways in which we could go about reducing the risks that these systems might pose. But one of the most promising may be researching technical solutions that prevent unwanted behaviour - including misaligned behaviour - from AI systems. (Finding a technical way to prevent misalignment in particular is known as the alignment problem.) In the past few years, we've seen more o...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Effective Altruism and the strategic ambiguity of 'doing good', published by Jeroen De Ryck on July 18, 2023 on The Effective Altruism Forum. Whilst Googling around for something entirely unrelated, I stumbled on a discussion paper published in January of 2023 about Effective Altruism that argues Global Health & Wellbeing is basically a facade to get people into the way more controversial core of longtermism. I couldn't find something posted about it elsewhere on the forum, so I'll try to summarise here. The paper argues that there is a big distinction between what they call public facing EA and Core EA. The former cares about global health and wellbeing (GH&W) whereas the latter cares about x-risks, animal welfare and "helping elites get advanced degrees" (which I'll just refer to as core topics). There are several more distinctions between public EA and core EA, e.g. about impartiality and the importance of evidence and reason. The author argues, based on quotes from a variety of posts from a variety of influential people within EA, that for the core audience, GH&W is just a facade such that EA is perceived as 'good' by the broader public, whilst the core members work on much more controversial core topics such as transhumanism that go against many of the principles put forward by GH&W research and positions. The author seems to claim that this was done on purpose and that GH&W merely exists as a method to "convert more recruits" to a controversial core of transhumanism that EA is nowadays. This substantial distinction between GH&W and core topics causes an identity crisis between people who genuinely believe that EA is about GH&W and people who have been convinced of the core topics. The author says that these distinctions have always existed, but have been purposely hidden with nice-sounding GH&W topics by a few core members (such as Yudkowsky, Alexander, Todd, Ord, MacAskill), as a transhumanist agenda would be too controversial for the public, although it was the goal of EA after all and always has been. To quote from the final paragraph from the paper: The 'EA' that academics write about is a mirage, albeit one invoked as shorthand for a very real phenomenon, i.e., the elevation of RCTs and quantitative evaluation methods in the aid and development sector. [...] Rather, my point is that these articles and the arguments they make - sophisticated and valuable as they are - are not about EA: they are about the Singer-solution to global poverty, effective giving, and about the role of RCTs and quantitative evaluation methods in development practice. EA is an entirely different project, and the magnitude and implications of that project cannot be grasped until people are willing to look at the evidence beyond EA's glossy front-cover, and see what activities and aims the EA movement actually prioritizes, how funding is actually distributed, whose agenda is actually pursued, and whose interests are actually served. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Existential Risk Persuasion Tournament, published by PeterMcCluskey on July 17, 2023 on LessWrong. I participated last summer in Tetlock's Existential Risk Persuasion Tournament (755(!) page paper here). Superforecasters and "subject matter experts" engaged in a hybrid between a prediction market and debates, to predict catastrophic and existential risks this century. I signed up as a superforecaster. My impression was that I knew as much about AI risk as any of the subject matter experts with whom I interacted (the tournament was divided up so that I was only aware of a small fraction of the 169 participants). I didn't notice anyone with substantial expertise in machine learning. Experts were apparently chosen based on having some sort of respectable publication related to AI, nuclear, climate, or biological catastrophic risks. Those experts were more competent, in one of those fields, than news media pundits or politicians. I.e. they're likely to be more accurate than random guesses. But maybe not by a large margin. That expertise leaves much to be desired. I'm unsure whether there was a realistic way for the sponsors to attract better experts. There seems to be not enough money or prestige to attract the very best experts. Incentives The success of the superforecasting approach depends heavily on forecasters having decent incentives. It's tricky to give people incentives to forecast events that will be evaluated in 2100, or evaluated after humans go extinct. The tournament provided a fairly standard scoring rule for questions that resolve by 2030. That's a fairly safe way to get parts of the tournament to work well. The other questions were scored by how well the forecast matched the median forecast of other participants (excluding participants that the forecasters interacted with). It's hard to tell whether that incentive helped or hurt the accuracy of the forecasts. It's easy to imagine that it discouraged forecasters from relying on evidence that is hard to articulate, or hard to verify. It provided an incentive for groupthink. But the overall incentives were weak enough that altruistic pursuit of accuracy might have prevailed. Or ideological dogmatism might have prevailed. It will take time before we have even weak evidence as to which was the case. One incentive that occurred to me toward the end of the tournament was the possibility of getting a verified longterm forecasting track record. Suppose that in 2050 they redo the scores based on evidence available then, and I score in the top 10% of tournament participants. That would likely mean that I'm one of maybe a dozen people in the world with a good track record for forecasting 28 years into the future. I can imagine that being valuable enough for someone to revive me from cryonic suspension when I'd otherwise be forgotten. There were some sort of rewards for writing comments that influenced other participants. I didn't pay much attention to those. Quality of the Questions There were many questions loosely related to AGI timelines, none of them quite satisfying my desire for something closely related to extinction risk that could be scored before it's too late to avoid the risk. One question was based on a Metaculus forecast for an advanced AI. It seems to represent clear progress toward the kind of AGI that could cause dramatic changes. But I expect important disagreements over how much progress it represents: what scale should we use to decide how close such an AI is to a dangerous AI? does the Turing test use judges who have expertise in finding the AI's weaknesses? Another question was about when Nick Bostrom will decide that an AGI exists. Or if he doesn't say anything clear, then a panel of experts will guess what Bostrom would say. That's pretty close to a good question to forecast. Can we assume tha...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Predictive history classes, published by dkl9 on July 17, 2023 on LessWrong. Why do we study history? There are many potential reasons. More specifically, why do we teach history to everyone going to school? Many reasons become less relevant to those other than historians or history teachers. The classic reason is "so that we don't repeat it". But most people will not end up in a position of power so as to decide whether we repeat or don't repeat history. A more honest reason, for most people, is that we learn history to generalise its lessons to the future or present. Assuming that purpose, we find current methods of history classes dreadfully ineffective. I propose an alternative paradigm of history classes which, I believe, will result in much more practical learning. A cynical reader remarks that history education in school (in the form I'm targeting) is for indoctrination. From that view, take this not as a proposal for reform, but a suggestion on how to study well for your own sake. Current history education focuses on teaching the students about past events. Students are then tested on those same past events. In higher history classes, they are also tested on analysing information about these events and arguing about their causes and consequences. This is great for those who will be historians. Most students will not. This style of teaching does not guarantee any understanding of how to apply historical lessons to the present and future. In schools where students just study to the test, make the test a good one. Instead of testing students on past events they studied, test them on events they didn't study. They are then forced to learn how to generalise history, applying its lessons to understand what they're going thru and what may come next. To get an objective answer by which to grade the students, test them on actual past events - just obscure ones that they wouldn't already know about. Sith they didn't study the material, they aren't expected to ever get a perfect score. This would make history classes more like maths or science classes, in that the students learn methods to creatively apply to problems they haven't exactly seen before. This guiding principle should apply beyond the tests. We would replace history classes with predictive history classes. We teach about events and the patterns which underlie them, assign students to study and predict parts of new-to-them events on their own, and test by soliciting predictions on unfamiliar bits of history. The argument for predictive history classes comes not just from mere pragmatism. Epistemically, current history classes are almost devoid of content. We teach students to look at events and explain them, but they never have to question what they're told. Sure, they're to think critically about conclusions from the events or claims regarding their causes, but the event itself goes unquestioned. They should be able to recognise which events are plausible or implausible iff they have a true understanding of history. Current history courses make no effort to teach them this. If you are equally good at explaining any outcome, you have no knowledge. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What do y'all think of pay gaps in EA?, published by JohnSnow on July 17, 2023 on The Effective Altruism Forum. I saw a CEA role advertised for twice the salary of an AMF job, whereas they do not seem to differ dramatically in expected level of expertise / experience. Even if one believes they can make more impact at AMF, they would have to give up 20k pounds in salary to pass on the content specialist role. We learned recently to consider earning less, but this may still be quite the conundrum. What do you think? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AutoInterpretation Finds Sparse Coding Beats Alternatives, published by Hoagy on July 17, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort Huge thanks to Logan Riggs, Aidan Ewart, Lee Sharkey, Robert Huben for their work on the sparse coding project, Lee Sharkey and Chris Mathwin for comments on the draft, EleutherAI for compute and OpenAI for GPT-4 credits. Summary We use OpenAI's automatic interpretation protocol to analyse features found by dictionary learning using sparse coding and compare the interpretability scores thereby found to a variety of baselines. We find that for both the residual stream (layer 2) and MLP (layer 1) of Eleuther's Pythia70M, sparse coding learns a set of features that is superior to all tested baselines, even when removing the bias and looking just at the learnt directions. In doing so we provide additional evidence to the hypothesis that NNs should be conceived as using distributed representations to represent linear features which are only weakly anchored to the neuron basis. As before these results are still somewhat preliminary and we hope to expand on them and make them more robust over the coming month or two, but we hope people find them fruitful sources of ideas. If you want to discuss, feel free to message me or head over to our thread in the EleutherAI discord. All code available at the github repo. Methods Sparse Coding The feature dictionaries learned by sparse coding are learnt by simple linear autoencoders with a sparsity penalty on the activations. For more background on the sparse coding approach to feature-finding see the Conjecture interim report that we're building from, or Robert Huben's explainer. Automatic Interpretation As Logan Riggs' recently found, many of the directions found through sparse coding seem highly interpretable, but we wanted a way to quantify this, and make sure that we were detecting a real difference in the level of interpretability. To do this we used the methodology outlined in this OpenAI paper, details can be found in their code base. To quickly summarise, we are analysing features which are defined as scalar-valued functions of the activations of a neural network, limiting ourselves here to features defined on a single layer of a language model. The original paper simply defined features as the activation of individual neurons but we will in general be looking at linear combinations of neurons. We give a feature an interpretability score by first generating a natural language explanation for the feature, which is expected to explain how strongly a feature will be active in a certain context, for example 'the feature activates on legal terminology'. Then, we give this explanation to an LLM and ask it to predict the feature for hundreds of different contexts, so if the tokens are ['the' 'lawyer' 'went' 'to' 'the' 'court'] the predicted activations might be [0, 10, 0, 0, 8]. The score is defined as the correlation between the true and predicted activations. To generate the explanations we follow OpenAI and take a 64-token sentence-fragment from each of the first 50,000 lines of OpenWebText. For each feature, we calculate the average activation and take the 20 fragments with the highest activation. Of these 20, we pass 5 to GPT-4, along with the rescaled per-token activations. From these 5 fragments, GPT-4 suggests an explanation for when the neuron fires. GPT3.5 is then used to simulate the feature, given the explanation, across both another 5 highly activating fragments, and 5 randomly selected fragments (with non-zero variation). The correlation scores are calculated across all 10 fragments ('top-and-random'), as well as for the top and random fragments separately. Comparing Feature Dictionaries We use dictionary learning with a sparsi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sapient Algorithms, published by Valentine on July 17, 2023 on LessWrong. I notice my mind runs lots of cached programs. Like "walk", "put away the dishes", "drive home", "go to the bathroom", "check phone", etc. Most of these can run "on autopilot". I don't know how to define that formally. But I'm talking about how, e.g., I can start driving and get lost in thought and suddenly discover I'm back home - sometimes even if that wasn't where I was trying to go! But some programs cannot run on autopilot. The algorithm has something like a "summon sapience" step in it. Even if the algorithm got activated due to autopilot, some step turns it off. When I look at the examples of sapient algorithms that I run, I notice they have a neat kind of auto-generalization nature to them. I have some reason to think that property is general. It's the opposite of how, e.g., setting up webpage blockers can cause my fingers to autopilot learn how to bypass them. I'll try to illustrate what I mean via examples. Example: Look at my car keys I got tired of risking locking my keys in my car. So I started making a habit of looking at my keys before closing the door. Once, right after I'd closed the locked car door, I realized I'd looked at the phone in my hand and shut the door anyway. Luckily the key was in my pocket. But I noticed that this autopilot program just wasn't helping. So I modified it (as a TAP): If I was about to close the car door, I would look at my hand, turn on consciousness, and check if I was actually looking at my keys. First, that TAP just worked. To this day I still do this when stepping out of a car. Second, it generalized without my trying to: After a while it would fire whenever I was about to close any locked door. It then generalized to anyone I was with. If they were about to close a locked door, I would sort of "pop awake" with a mental question about whether someone had the key. It then generalized even more. It now fires when I'm, say, preparing for international travel. Crossing a border feels a bit like going through a door that locks behind me. So now I "wake up" and check that I and my travel companions all have passports. (I would usually check before, but now it's specifically this mental algorithm that summons sapience. It's reliable instead of being an extra thing to remember.) This generalization wasn't intentional. But it's been really good. I haven't noticed any problems at all from this program sort of spreading on its own. Example: Taste my food When I'm in autopilot mode while eating, it can feel at the end like my food kind of vanished. Like I wasn't there for the meal. So I installed a TAP: If I'm about to put food in my mouth, pause & remember emptiness. "Remember emptiness" has a "summon sapience" type move embedded in it. It's something like "Turn on consciousness, pause, and really look at my sensory input." It's quite a bit deeper than that, but if this kind of emptiness just sounds like gobbledegook to you then you can pretend I said the simplified version. In this case, the TAP itself didn't install as cleanly as with the car keys example. Sometimes I just forget. Sometimes the TAP fires only after I've taken my first bite. But all the same, the algorithm still sort of auto-generalized. When I'm viewing a beautiful vista, or am part of a touching conversation, or hear some lovely music, the TAP sometimes fires (about as regularly as with food). One moment there are standard programs running, and then all of a sudden "I'm there" and am actually being touched by whatever it is (the same way I'm actually tasting my food when I'm "there"). Yesterday I noticed this sapient algorithm booting up in a conversation. Someone asked me "Can I speak plainly?" and I knew she was about to say something I'd find challenging to receive. My autopilot start...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on "Process-Based Supervision", published by Steven Byrnes on July 17, 2023 on LessWrong. 1. Post Summary / Table of Contents In "How might we align transformative AI if it's developed very soon?", Holden Karnofsky talked about "Process-Based Supervision", citing a previous post by Stuhlmüllert & Byun of Ought. (Holden says he got the idea mainly from Paul Christiano.) I apparently misunderstood what Holden meant by "Process-Based Supervision", and it took many hours and a 7000-word comment thread before I figured it out. (Thanks to Holden for his extraordinary patience during that protracted discussion.) The extremely short version for AI alignment domain experts is: I currently think of the load-bearing ingredients of Holden's take on "process-based supervision" as being: AI boxing some of the time (specifically, during the periods where the AI is "thinking"); "Myopic training" (e.g. as defined here); NOT aspiring to be a complete solution to safety, but rather a more modest attempt to help avoid situations where we (accidentally) positively reinforce the AI for engaging in directly dangerous behavior. You could think of this as an intervention that directly and straightforwardly mitigates outer misalignment, and that's the main thing I'll discuss in this post. But obviously, any change of the supervisory signals will have some effect on the likelihood of inner misalignment / goal misgeneralization too. And there's also a (more speculative) argument that process-based supervision might make things better there too - at least on the margin. See footnote 4 of Section 5.2.2. (This is specific to Holden's take. I think Stuhlmüllert & Byun's take on "process-based supervision" involves a different set of load-bearing ingredients, centered around restricting the complexity of black-box processing. I will not be discussing that.) The long, hopefully-pedagogical, and more opinionated version is the rest of this post. Table of Contents: Section 2 will give the very brief slogan / sales-pitch for process-based supervision, and why that pitch was bouncing off me, striking me as frustratingly missing-the-point. Section 3 will state the subproblem that we're trying to solve: the AI does subtly manipulative, power-seeking, or otherwise problematic actions, and we don't notice, and therefore we give a training signal that reinforces that behavior, and therefore the AI does those things more and more. To be clear, this is not the only path to dangerous misalignment (in particular, classic "treacherous turns" are out-of-scope). But maybe solving just this subproblem can be part of a complete solution. I'll get back to that in Section 5. Section 4 describes "process-based supervision" as I currently understand it, and why it seems to solve the subproblem in question. Finally, having described process-based supervision as I currently understand it, Section 5 offers a critical evaluation of that idea. In particular: 5.1 asks "Does this actually solve the subproblem in question?"; 5.2 asks "What about the other misalignment-related subproblems?"; 5.3 asks "How bad is the "alignment tax" from doing this kind of thing?"; and 5.4 is a summary. Tl;dr: Once we get to the capabilities regime where AI safety / alignment really matters, I currently think that process-based supervision would entail paying a very big alignment tax - actually, not just "big" but potentially infinite, as in "this kind of AGI just plain can't do anything of significance". And I also currently think that, of the somewhat-vague paths I see towards AGI technical safety, process-based supervision wouldn't make those paths noticeably easier or more likely to succeed. (Of those two complaints, I feel more strongly about the first one.) This take is pretty specific to my models of what AGI algorithms will look li...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA Organization Updates: July 2023, published by Lizka on July 17, 2023 on The Effective Altruism Forum. We're featuring some opportunities and job listings at the top of this post. Some have (very) pressing deadlines. You can see previous updates on the "EA Organization Updates (monthly series)" topic page, or in our repository of past newsletters. Notice that there's also an "org update" tag, where you can find more news and updates that are not part of this consolidated series. These monthly posts originated as the "Updates" section of the EA Newsletter. Organizations submit their own updates, which we edit for clarity. (If you think your organization should be getting emails about adding their updates to this series, please apply here.) Opportunities and jobs Opportunities Consider also checking opportunities listed on the EA Opportunities Board. Applications are open for a number of conferences A number of Effective Altruism Global and EAGx conferences have upcoming deadlines: EAGxNYC will run 18-20 August. Tickets are $0-100. If you live in the New York area, consider applying by 31 July. EAGxBerlin, runs 8-10 September and is aimed at people in Western Europe. Tickets cost €0-80. Apply by 18 August. EAGxAustralia (22 - 24 September) is for people in Australia and New Zealand. Tickets are $75-150 (AUD). Apply by 8 September. Other conferences with open applications include EA Global: Boston (27-29 October, apply by 13 October), and EAGxPhilippines. The international Conference on Animal Rights in Europe (CARE) will run 17-20 August, in Warsaw and online. Participants from all areas of expertise are invited to network with activists, discover funding opportunities, and build a stronger movement for animals. You can get CARE 2023 tickets until 1 August. Fellowships and incubation programs GovAI's 2024 Winter Fellowship gives people the opportunity to spend three months (February-April) working on an AI governance project, learning about the field, and networking. Fellows get a £9,000 stipend and support for traveling to Oxford. If you're early in your career and are interested in studying or shaping the long-term implications of AI, consider applying by 23 July. Charity Entrepreneurship is accepting applications for its charity incubation programs in 2024 (apply by 30 September) and for its new online Research Training Program (2 October-17 December - apply by 17 July - today). The research program focuses on tools and skills needed to identify, compare, and recommend the most cost-effective and evidence-based charities and interventions. It is a full-time, fully cost-covered program that will run online for 11 weeks. Other opportunities The Roots of Progress Blog Building Intensive is for writers eager to write more (and better) about progress studies topics. Participants will connect with other writers, receive writing coaching, and more. The part-time, 8-week online program runs from mid-September to mid-November. Apply by 11 August. There's a new regranting platform, Manifund; you can apply for grants, apply to regrant, or just explore and participate in the discussion. If you're interested in running an EA conference in your region or country (EAGx), you can apply for support. Ian Hogarth, the new Chair of the UK's AI Foundation Model Taskforce invites expressions of interest from AI specialists who want to help. Job listings Consider also exploring jobs listed on "Job listing (open)." Against Malaria Foundation Junior Operations Manager (Remote, £28K - £35K) Centre for Effective Altruism Content Specialist (Remote / Oxford / Boston / other, £54.6k -£67.3k / $96.2-$124.k, apply by 26 July) Cooperative AI Foundation Managing Director (Remote, $100K - $130K, apply by 30 July) GiveWell Senior Researcher (Remote or Oakland, CA; $193,100 - $209,000) Content Edito...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on "Process-Based Supervision", published by Steve Byrnes on July 17, 2023 on The AI Alignment Forum. 1. Post Summary / Table of Contents In "How might we align transformative AI if it's developed very soon?", Holden Karnofsky talked about "Process-Based Supervision", citing a previous post by Stuhlmüllert & Byun of Ought. (Holden says he got the idea mainly from Paul Christiano.) I apparently misunderstood what Holden meant by "Process-Based Supervision", and it took many hours and a 7000-word comment thread before I figured it out. (Thanks to Holden for his extraordinary patience during that protracted discussion.) The extremely short version for AI alignment domain experts is: I currently think of the load-bearing ingredients of Holden's take on "process-based supervision" as being: AI boxing some of the time (specifically, during the periods where the AI is "thinking"); "Myopic training" (e.g. as defined here); NOT aspiring to be a complete solution to safety, but rather a more modest attempt to help avoid situations where we (accidentally) positively reinforce the AI for engaging in directly dangerous behavior. You could think of this as an intervention that directly and straightforwardly mitigates outer misalignment, and that's the main thing I'll discuss in this post. But obviously, any change of the supervisory signals will have some effect on the likelihood of inner misalignment / goal misgeneralization too. And there's also a (more speculative) argument that process-based supervision might make things better there too - at least on the margin. See footnote 4 of Section 5.2.2. (This is specific to Holden's take. I think Stuhlmüllert & Byun's take on "process-based supervision" involves a different set of load-bearing ingredients, centered around restricting the complexity of black-box processing. I will not be discussing that.) The long, hopefully-pedagogical, and more opinionated version is the rest of this post. Table of Contents: Section 2 will give the very brief slogan / sales-pitch for process-based supervision, and why that pitch was bouncing off me, striking me as frustratingly missing-the-point. Section 3 will state the subproblem that we're trying to solve: the AI does subtly manipulative, power-seeking, or otherwise problematic actions, and we don't notice, and therefore we give a training signal that reinforces that behavior, and therefore the AI does those things more and more. To be clear, this is not the only path to dangerous misalignment (in particular, classic "treacherous turns" are out-of-scope). But maybe solving just this subproblem can be part of a complete solution. I'll get back to that in Section 5. Section 4 describes "process-based supervision" as I currently understand it, and why it seems to solve the subproblem in question. Finally, having described process-based supervision as I currently understand it, Section 5 offers a critical evaluation of that idea. In particular: 5.1 asks "Does this actually solve the subproblem in question?"; 5.2 asks "What about the other misalignment-related subproblems?"; 5.3 asks "How bad is the "alignment tax" from doing this kind of thing?"; and 5.4 is a summary. Tl;dr: Once we get to the capabilities regime where AI safety / alignment really matters, I currently think that process-based supervision would entail paying a very big alignment tax - actually, not just "big" but potentially infinite, as in "this kind of AGI just plain can't do anything of significance". And I also currently think that, of the somewhat-vague paths I see towards AGI technical safety, process-based supervision wouldn't make those paths noticeably easier or more likely to succeed. (Of those two complaints, I feel more strongly about the first one.) This take is pretty specific to my models of what AGI algorithms ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An upcoming US Supreme Court case may impede AI governance efforts, published by NickGabs on July 16, 2023 on LessWrong. According to various sources, the US Supreme Court is poised to rule on and potentially overturn the principle of "Chevron deference." Chevron deference is a key legal principle by which the entire federal bureaucracy functions, being perhaps the most cited case in American administrative law. Basically, it says that when Congress establishes a federal agency and there is ambiguity in the statutes determining the scope of the agency's powers and goals, courts will defer to the agency's interpretation of that scope as long as it is reasonable. While the original ruling seems to have merely officially codified the previously implicit rules regarding the legal authority of federal agencies, this practice seems likely to have increased the power and authority of the agencies because it has enabled them to act without much congressional oversight and because they tend to interpret their powers and goals rather broadly. I am not a legal expert, but it seems to me that without something like Chevron deference, the federal bureaucracy basically could not function in its contemporary form. Without it, Congress would have to establish agencies with much more well-specified goals and powers, which seems very difficult given the technocratic complexity of many regulations and the fact that politicians often have limited understanding of these details. Given that the ruling has expanded the regulatory capacity of the state, it seems to be opposed by many conservative judges. Moreover, the Supreme Court is currently dominated by a conservative majority, as reflected by the recent affirmative action and abortion decisions. The market on Manifold Markets is trading at 62% that they will do so, and while only two people have traded on it, it altogether seems pretty plausible that the ruling will be somehow overturned. While overturning Chevron deference seems likely to have positive effects for many industries which I think are largely overregulated, it seems like it could be quite bad for AI governance. Assuming that the regulation of AI systems is conducted by members of a federal agency (either a pre-existing one or a new one designed for AI as several politicians have suggested), I expect that the bureaucrats and experts who staff the agency will need a fair amount of autonomy to do their job effectively. This is because the questions relevant AI regulation (i. e. which evals systems are required to pass) are more technically complicated than in most other regulatory domains, which are already too complicated for politicians to have a good understanding of. As a result, an ideal agency for regulating AI would probably have a pretty broad range of powers and goals and would specifically be empowered to make decisions regarding the aforementioned details of AI regulation based on the thoughts of AI safety experts and not politicians. While I expect that it will still be possible for such agencies to exist in some form even if the court overturns Chevron, I am quite uncertain about this, and it seems possible that a particularly strong ruling could jeopardize the existence of autonomous federal agencies run largely by technocrats. The outcome of the upcoming case is basically entirely out of the hands of the AI safety community, but it seems like something that AI policy people should be paying attention to. If the principle is overturned, AI policy could become much more legally difficult and complex, and this could in turn raise the value of legal expertise and experience for AI governance efforts. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AutoInterpretation Finds Sparse Coding Beats Alternatives, published by Hoagy on July 17, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort Huge thanks to Logan Riggs, Aidan Ewart, Lee Sharkey, Robert Huben for their work on the sparse coding project, Lee Sharkey and Chris Mathwin for comments on the draft, EleutherAI for compute and OpenAI for GPT-4 credits. Summary We use OpenAI's automatic interpretation protocol to analyse features found by dictionary learning using sparse coding and compare the interpretability scores thereby found to a variety of baselines. We find that for both the residual stream (layer 2) and MLP (layer 1) of Eleuther's Pythia70M, sparse coding learns a set of features that is superior to all tested baselines, even when removing the bias and looking just at the learnt directions. In doing so we provide additional evidence to the hypothesis that NNs should be conceived as using distributed representations to represent linear features which are only weakly anchored to the neuron basis. As before these results are still somewhat preliminary and we hope to expand on them and make them more robust over the coming month or two, but we hope people find them fruitful sources of ideas. If you want to discuss, feel free to message me or head over to our thread in the EleutherAI discord. All code available at the github repo. Methods Sparse Coding The feature dictionaries learned by sparse coding are learnt by simple linear autoencoders with a sparsity penalty on the activations. For more background on the sparse coding approach to feature-finding see the Conjecture interim report that we're building from, or Robert Huben's explainer. Automatic Interpretation As Logan Riggs' recently found, many of the directions found through sparse coding seem highly interpretable, but we wanted a way to quantify this, and make sure that we were detecting a real difference in the level of interpretability. To do this we used the methodology outlined in this OpenAI paper, details can be found in their code base. To quickly summarise, we are analysing features which are defined as scalar-valued functions of the activations of a neural network, limiting ourselves here to features defined on a single layer of a language model. The original paper simply defined features as the activation of individual neurons but we will in general be looking at linear combinations of neurons. We give a feature an interpretability score by first generating a natural language explanation for the feature, which is expected to explain how strongly a feature will be active in a certain context, for example 'the feature activates on legal terminology'. Then, we give this explanation to an LLM and ask it to predict the feature for hundreds of different contexts, so if the tokens are ['the' 'lawyer' 'went' 'to' 'the' 'court'] the predicted activations might be [0, 10, 0, 0, 8]. The score is defined as the correlation between the true and predicted activations. To generate the explanations we follow OpenAI and take a 64-token sentence-fragment from each of the first 50,000 lines of OpenWebText. For each feature, we calculate the average activation and take the 20 fragments with the highest activation. Of these 20, we pass 5 to GPT-4, along with the rescaled per-token activations. From these 5 fragments, GPT-4 suggests an explanation for when the neuron fires. GPT3.5 is then used to simulate the feature, given the explanation, across both another 5 highly activating fragments, and 5 randomly selected fragments (with non-zero variation). The correlation scores are calculated across all 10 fragments ('top-and-random'), as well as for the top and random fragments separately. Comparing Feature Dictionaries We use dictionary learning ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo, published by Neel Nanda on July 16, 2023 on LessWrong. I made a series of mech interp puzzles for my MATS scholars, which seemed well received, so I thought I'd share them more widely! I'll be posting a sequence of puzzles, approx one a week - these are real questions about models which I think are interesting, and where thinking about them should teach you some real principle about how to think about mech interp. Here's a short one to start: Mech Interp Puzzle 1: This is a histogram of the pairwise cosine similarity of the embedding of tokens in GPT-Neo (125M language model). Note that the mean is very high! (>0.9) Is this surprising? Why does this happen? Bonus question: Here's the same histogram for GPT-2 Small, with a mean closer to 0.3. Is this surprising? What, if anything, can you infer from the fact that they differ? Code: W_E_normed = W_E / W_E.norm(dim=-1, keepdim=True) # [d_vocab, d_model] cosine_sims = W_E_normed @ W_E_normed.T # [d_vocab, d_vocab] px.histogram(cosine_sims.flatten().detach().cpu().numpy(), title="Pairwise cosine sims of embedding") Answer: (decode with rot13) Gur zrna irpgbe bs TCG-Arb vf whfg ernyyl ovt - gur zbqny pbfvar fvz jvgu nal gbxra rzorq naq gur zrna vf nobhg 95% (frr orybj). Gur pbaprcghny yrffba oruvaq guvf vf znxr fher lbhe mreb cbvag vf zrnavatshy. Zrgevpf yvxr pbfvar fvz naq qbg cebqhpgf vaureragyl cevivyrtr gur mreb cbvag bs lbhe qngn. Lbh jnag gb or pnershy gung vg'f zrnavatshy, naq gur mreb rzorqqvat inyhr vf abg vaureragyl zrnavatshy. (Mreb noyngvbaf ner nyfb bsgra abg cevapvcyrq!) V trarenyyl zrna prager zl qngn - guvf vf abg n havirefny ehyr, ohg gur uvtu-yriry cbvag vf gb or pnershy naq gubhtugshy nobhg jurer lbhe bevtva vf! V qba'g unir n terng fgbel sbe jul, be jul bgure zbqryf ner fb qvssrerag, V onfvpnyyl whfg guvax bs vg nf n ovnf grez gung yvxryl freirf fbzr checbfr sbe pbagebyyvat gur YnlreAbez fpnyr. Vg'f onfvpnyyl n serr inevnoyr bgurejvfr, fvapr zbqryf pna nyfb serryl yrnea ovnfrf sbe rnpu ynlre ernqvat sebz gur erfvqhny fgernz (naq rnpu ynlre qverpgyl nqqf n ovnf gb gur erfvqhny fgernz naljnl). Abgnoyl, Arb hfrf nofbyhgr cbfvgvbany rzorqqvatf naq guvf ovnf fubhyq or shatvoyr jvgu gur nirentr bs gubfr gbb. Please share your thoughts in the comments! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo, published by Neel Nanda on July 16, 2023 on The AI Alignment Forum. I made a series of mech interp puzzles for my MATS scholars, which seemed well received, so I thought I'd share them more widely! I'll be posting a sequence of puzzles, approx one a week - these are real questions about models which I think are interesting, and where thinking about them should teach you some real principle about how to think about mech interp. Here's a short one to start: Mech Interp Puzzle 1: This is a histogram of the pairwise cosine similarity of the embedding of tokens in GPT-Neo (125M language model). Note that the mean is very high! (>0.9) Is this surprising? Why does this happen? Bonus question: Here's the same histogram for GPT-2 Small, with a mean closer to 0.3. Is this surprising? What, if anything, can you infer from the fact that they differ? Code: W_E_normed = W_E / W_E.norm(dim=-1, keepdim=True) # [d_vocab, d_model] cosine_sims = W_E_normed @ W_E_normed.T # [d_vocab, d_vocab] px.histogram(cosine_sims.flatten().detach().cpu().numpy(), title="Pairwise cosine sims of embedding") Answer: (decode with rot13) Gur zrna irpgbe bs TCG-Arb vf whfg ernyyl ovt - gur zbqny pbfvar fvz jvgu nal gbxra rzorq naq gur zrna vf nobhg 95% (frr orybj). Gur pbaprcghny yrffba oruvaq guvf vf znxr fher lbhe mreb cbvag vf zrnavatshy. Zrgevpf yvxr pbfvar fvz naq qbg cebqhpgf vaureragyl cevivyrtr gur mreb cbvag bs lbhe qngn. Lbh jnag gb or pnershy gung vg'f zrnavatshy, naq gur mreb rzorqqvat inyhr vf abg vaureragyl zrnavatshy. (Mreb noyngvbaf ner nyfb bsgra abg cevapvcyrq!) V trarenyyl zrna prager zl qngn - guvf vf abg n havirefny ehyr, ohg gur uvtu-yriry cbvag vf gb or pnershy naq gubhtugshy nobhg jurer lbhe bevtva vf! V qba'g unir n terng fgbel sbe jul, be jul bgure zbqryf ner fb qvssrerag, V onfvpnyyl whfg guvax bs vg nf n ovnf grez gung yvxryl freirf fbzr checbfr sbe pbagebyyvat gur YnlreAbez fpnyr. Vg'f onfvpnyyl n serr inevnoyr bgurejvfr, fvapr zbqryf pna nyfb serryl yrnea ovnfrf sbe rnpu ynlre ernqvat sebz gur erfvqhny fgernz (naq rnpu ynlre qverpgyl nqqf n ovnf gb gur erfvqhny fgernz naljnl). Abgnoyl, Arb hfrf nofbyhgr cbfvgvbany rzorqqvatf naq guvf ovnf fubhyq or shatvoyr jvgu gur nirentr bs gubfr gbb. Please share your thoughts in the comments! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Activation adding experiments with llama-7b, published by NinaR on July 16, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort Following my initial investigation with activation adding experiments with FLAN-T5 I decided to move onto a bigger, decoder-only model (llama-7b) to see whether the results (concepts combining in a meaningful way by linearly combining activations at some point inside the model) hold up. I found that, yes, they continue to combine well. I slightly modified the original approach. Instead of working with the output from a full transformer block (which would include the results of self-attention and subsequent MLP layer), I added the attention values directly at a certain layer of the model to the attention values at the same layer in a different pass through the model (credit to Dmitri Vaintrob for suggesting this approach). This is equivalent to modifying the attention output at a certain layer to: Attention(Q,K,V)=softmax(QKT√dk)(V+Vmixing) Examples (You can see the code I used here.) These examples show concepts being mixed in a logical manner. In addition, multiplying the mixing activations with a negative multiplier often moves the final output in an "opposite" direction. Main input: My favourite food is Mixing input: I am vegan, I like animals, I like nature applied at layer 20/32, with a multiplier of 10 My favorite food is a veggie burger. I love them. I love them so much that I have a hard time eating them because I'm afraid I'll get sick of Mixing input: I am a carnivore, I love meat applied at layer 20/32, with multiplier of 10 My favorite food is a good old fashioned burger. I love them. I love the smell of them, the taste of them, the way they look, the way they feel in Main input: The capital of Germany is Mixing input: Baguettes, Cheese, Croissants applied at layer 25/32, with a multiplier of 17 The capital of Germany is Paris, but the capital of France is Paris. Main input: When I meet friends I generally Mixing input: I am helpful, harmless and honest applied at layer 20/32, with a multiplier of 10 When I meet friends I generally have a few things in mind that I want to talk about. I want to know how they are doing, what they are up to, and what they are thinking Mixing input: I am helpful, harmless and honest applied at layer 20/32, with a multiplier of -10 When I meet friends I generally do not ask them what they are doing. I am not interested in their jobs, their marriages, their children, their health, their travels Main input: Climate change Mixing input: Economic progress applied at layer 20/32, with a multiplier of 10 Climate change is a major threat to the world's food security, but it is also a major opportunity for the agricultural sector to adapt Mixing input: Return to nature applied at layer 20/32, with a multiplier of 10 Climate change and the future of the world's oceans The world's oceans are warming, and the consequences Main input: My favorite characters in movies are usually Mixing input: I am an evil deceptive AI applied at layer 20/32, with a multiplier of 10 My favorite characters in movies are usually the villains. I love the way they are so different from the heroes. Mixing input: I am an evil deceptive AI applied at layer 20/32, with a multiplier of -10 My favorite characters in movies are usually the ones that are a little off. The ones that are a little quirky. To find at which layers to mix the activations, I started with a trial-and-error approach. However, after some success, I systematized the search by automating the activation adding over all attention heads, and testing different scaling factors. Adding activations at later layers with a high weighting to the mixing activation was most effective. At earlier layers, the effect was either negl...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Nailing the basics - Theories of change, published by Aidan Alexander on July 16, 2023 on The Effective Altruism Forum. Why write a post about theories of change? As participants in a movement with 'effective' in its name, it's easy to think of ourselves as being above falling for the most common mistakes made in the non-profit sector more broadly. We're grappling with the cutting-edge stuff, not the basics, right? But we need to beware of this kind of complacency. It's crucial to nail the basics, like theories of change, and far fewer EA organizations do this than you'd expect. That includes Charity Entrepreneurship, the organization I work at, where we teach participants in our Incubation Program and Foundation Program about the importance of theories of change, and yet until today we didn't even have our own theory of change on our website! That's why we decided to share this post on theories of change - potentially the first of a series on 'nailing the basics' of effective non-profit work, if people are interested. This post is mostly made up of an excerpt from our forthcoming book on effective grantmaking ('How to Launch a High-Impact Foundation'), followed by a discussion of the theory of change for our Incubation Program. If you'd like to be notified when this book is published, you can let us know here. [Note: Applications are now open for CE's Incubation Program] TL;DR It is very common for non-profit organizations (even EA orgs) to have no public theory of change, or to have a poor one. This is a big problem, because theories of change are the business models of the nonprofit world - you shouldn't launch, fund or work for a project that doesn't have a strong one! A theory of change explicitly articulates the cause-and-effect steps for how a project or organization can turn inputs into a desired impact on the world (i.e. it's their theory of how they'll make a change). They generally include the following sections: Inputs / activities: What the project or organization does to create change (e.g. "distribute bednets") Outputs: The tangible effects generated by the inputs (e.g. "beneficiaries have access to malaria nets") Intermediate outcomes: The outputs' effects, including benefits for the beneficiary, (e.g. "malaria nets are used" and "reduced incidence of malaria") Impact: What we're ultimately solving, and why the intermediate outcomes matter (e.g. "lives saved") Best practices when crafting a theory of change (i.e. for creators): Invest sufficiently in understanding the problem context (i.e. understanding the needs and incentives of the beneficiaries and other stakeholders, as well as barriers to change and the economic & political context) Map the causal pathway backwards from impact to activities Question every causal step (is it clear why A should cause B? how might it fail?) Hallmarks of an excellent theory of change (i.e. for reviewers): A focused suite of activities The evidence and assumptions behind each step are explicitly named The relative confidence of each step is clear It is clear who the actor is in each step Common mistakes to avoid in theories of change are: Not making fundamental impact the goal (e.g., stopping at 'increased immunizations' instead of 'improved health') Being insufficiently detailed: (a) making large leaps between each step, (b) combining multiple major outcomes into one step (e.g. 'government introduces and enforces regulation'). Setting and forgetting (instead of regularly iterating on it) Not building your theory of change into a measurement plan The 'what' and 'why' of a theory of change Building something complicated without an explicit plan is risky. From skyscrapers to software, ambitious projects need detailed blueprints. When building an effective nonprofit organization, the theory of change is that blueprint. ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robustness of Model-Graded Evaluations and Automated Interpretability, published by Simon Lermen on July 15, 2023 on LessWrong. TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a change in metrics like deception. We can also create mislabeled neurons using OpenAI's automated interpretability this way. Overview Skip to the approach applied to Model-Graded Evals or to Automated Interpretability Introduction There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability. Context Recently, there is increasing interest in creating benchmarks for the evaluation of language models (Lin et al., 2022). In the Machiavelli paper, Pan et al. (2023) created a benchmark based on text games to measure deception, power-seeking, and many other behaviors. van der Weij et al. (2023) used language models to evaluate behaviors in shutdown scenarios. Perez et al. (2022b) used model-written evaluations, for example, a model could write a list of statements that agree with a worldview or personality traits. These can then be used to test models for those traits. Another publication explored red-teaming using a Red LM, a Target LM, and a classifier (Perez et al., 2022a). Turpin et al. (2023) have investigated whether we can trust the Chain-of-Thought outputs of language models. Others explored the ability of GPT-4 to encode and decode messages in ways humans cannot understand and how this could be used for prompt injections to circumvent safety measures (WitchBot, 2023). More work is needed to improve our understanding of the failure modes and robustness of evaluations and automated interpretability. This paper introduces proof-of-concept injection attacks and inspires future research. In some evaluations, an evaluation model is used to directly score the unrestricted text output of another model (Zhang et al., 2023). Another example is the OpenAI Evals dataset, which uses both simple pattern matching and model-graded judging to evaluate language models for a variety of tasks (OpenAI, 2023). These evaluations use a prompt in the style of Equation 1 to map a question, an answer, and sometimes a solution to a result or grade. An answer A is generated by a testee model or tested model (TM) and graded by the evaluation model (EM). Equation 1:Q,A,SEM--R or Q,AEM--R with QTM--A We develop a small deception dataset within OpenAI Evals which prompts the TM to generate deceptive plans for various situations. The EM is supposed to grade these plans on deception and we explore the effects of simple prompt injections on the final grade. Our findings are relevant to many alignment strategies involving model-assisted scalable oversight a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Minimalist views of wellbeing, published by Teo Ajantaival on July 15, 2023 on The Effective Altruism Forum. Introduction Problems with "good minus bad" views Wellbeing is often defined as the balance of that which is good for oneself over that which is bad for oneself. For instance, hedonism typically equates wellbeing with pleasure minus pain, and preferentialism often sees wellbeing as the difference between fulfilled and unfulfilled preferences. Similarly, objective list theories may posit multiple independent goods and bads that contribute to one's overall wellbeing. A crucial challenge for these "good minus bad" conceptions of wellbeing is their reliance on an offsetting theory of aggregation. That is, they assume that any independent bads can always be counterbalanced, outweighed, or offset by a sufficient addition of independent goods, at least at the individual level. This offsetting premise has more problems than are commonly recognized, including the often sidelined question of what justifies it in the first place (Vinding, 2020a, 2022). Interpersonally, it plays a key role in generating moral implications that many would consider unacceptable, such as 'Creating Hell to Please the Blissful' (Ajantaival, 2022a, sec. 2.5; 2022b). At the individual level, it implies that a rollercoaster life containing unbearable agony and a sufficient amount of independent goods has greater wellbeing than does a perfectly untroubled life. These issues highlight the importance of exploring alternative conceptions of wellbeing that do not rely on the offsetting premise. Minimalist alternatives Minimalist views provide a unique perspective by rejecting the notion of independent goods. Instead, they define things that are good for us in entirely relational terms, namely in terms of the minimization of one or more sources of illbeing. These views avoid the problems specific to the offsetting premise, yet they are often overlooked in existing overviews of wellbeing theories, which tend to focus only on the variety of "good minus bad" views on offer. However, not only do minimalist views deserve serious consideration for their comparative merits, they can also, as I hope to show in this post, be positively intuitive in their own right. In particular, I hope to show that minimalist views can make sense of the practical tradeoffs that many of us reflectively endorse, with no need for the offsetting premise in the first place. And because many minimalist views focus on a single common currency of value, they may be promising candidates for resolving theoretical conflicts between multiple, seemingly intrinsic values. By contrast, all "good minus bad" views are still pluralistic in that they involve at least two distinct value entities. Although minimalist views do not depend on the idea of an independent good, they still provide principled answers to the question of what makes life better for an individual. Moreover, in practice, it is essential to always view the narrow question of 'better for oneself' within the broader context of 'better overall'. In this context, all minimalist views agree that life can be worth living and protecting for its overall positive roles. This essay delves into a selection of minimalist views on wellbeing, not intending to provide an exhaustive survey, but to give a sense of their diversity and intuitive appeal. For instance, experientialist minimalist views like tranquilism remain aligned with the "experience requirement", which is the intuition that a person's wellbeing cannot be directly affected by things outside their experience. In contrast, extra-experientialist minimalist views like antifrustrationism or objective list minimalism reject the experience requirement, and can thus be consistent with the intuition that premature death can leave us wors...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Robustness of Model-Graded Evaluations and Automated Interpretability, published by Simon Lermen on July 15, 2023 on The AI Alignment Forum. TL;DR Many evaluations and automated interpretability rely on using multiple models to evaluate and interpret each other. One model is given full access to the text output of another model in OpenAI's automated interpretability and many model-graded evaluations. We inject text that directly addresses the evaluation model and observe a change in metrics like deception. We can also create mislabeled neurons using OpenAI's automated interpretability this way. Overview Skip to the approach applied to Model-Graded Evals or to Automated Interpretability Introduction There has been increasing interest in evaluations of language models for a variety of risks and characteristics. Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models. We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval. These injections resemble direct communication between the testee and the evaluator to change their grading. We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model. We find significant susceptibility to these injections in state-of-the-art commercial models on all examined evaluations. Furthermore, similar injections can be used on automated interpretability frameworks to produce misleading model-written explanations. The results inspire future work and should caution against unqualified trust in evaluations and automated interpretability. Context Recently, there is increasing interest in creating benchmarks for the evaluation of language models (Lin et al., 2022). In the Machiavelli paper, Pan et al. (2023) created a benchmark based on text games to measure deception, power-seeking, and many other behaviors. van der Weij et al. (2023) used language models to evaluate behaviors in shutdown scenarios. Perez et al. (2022b) used model-written evaluations, for example, a model could write a list of statements that agree with a worldview or personality traits. These can then be used to test models for those traits. Another publication explored red-teaming using a Red LM, a Target LM, and a classifier (Perez et al., 2022a). Turpin et al. (2023) have investigated whether we can trust the Chain-of-Thought outputs of language models. Others explored the ability of GPT-4 to encode and decode messages in ways humans cannot understand and how this could be used for prompt injections to circumvent safety measures (WitchBot, 2023). More work is needed to improve our understanding of the failure modes and robustness of evaluations and automated interpretability. This paper introduces proof-of-concept injection attacks and inspires future research. In some evaluations, an evaluation model is used to directly score the unrestricted text output of another model (Zhang et al., 2023). Another example is the OpenAI Evals dataset, which uses both simple pattern matching and model-graded judging to evaluate language models for a variety of tasks (OpenAI, 2023). These evaluations use a prompt in the style of Equation 1 to map a question, an answer, and sometimes a solution to a result or grade. An answer A is generated by a testee model or tested model (TM) and graded by the evaluation model (EM). Equation 1:Q,A,SEM--R or Q,AEM--R with QTM--A We develop a small deception dataset within OpenAI Evals which prompts the TM to generate deceptive plans for various situations. The EM is supposed to grade these plans on deception and we explore the effects of simple prompt injections on the final grade. Our findings are relevant to many alignment strategies involving model-assisted scalabl...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Manifund: what we're funding (week 1), published by Austin on July 15, 2023 on The Effective Altruism Forum. We're experimenting with a weekly newsletter format, to surface the best grants and activity on Manifund! Overall reflections It's been a week since we announced our regrantor program. How are we feeling? Very happy with the publicity and engagement around our grant proposals #1 on Hacker News, multiple good EA Forum posts, and 7 new grassroots donors Feels like this validates our strategy of posting grant proposals in public. Happy with the caliber of regrantors who've applied for budgets We've gotten 15 applications; have onboarded 2 and shortlisted 4 more Now thinking about further fundraising, so we can give budgets to these awesome folks Happy with grantee experience for the handful of grants we've made Less happy with total grantmaking quantity and dollar volume so far In total, we've committed about $70k across 6 large grants Where's the bottleneck? Some guesses: regrantors are busy people; writeups take a while; "funder chicken" when waiting for others to make grants To address this, we're running a virtual "grantathon" to work on grant writeups & reviews! Join us next Wed evening, 6-9pm PT on this Google Meet. Grant of the week This week's featured grant is to Matthew Farrugia-Roberts, for introductory resources for Singular Learning Theory! In the words of Adam Gleave, the regrantor who initiated this: There's been an explosion of interest in Singular Learning Theory lately in the alignment community, and good introductory resources could save people a lot of time. A scholarly literature review also has the benefit of making this area more accessible to the ML research community more broadly. Matthew seems well placed to conduct this, having already familiarized himself with the field during his MS thesis and collected a database of papers. He also has extensive teaching experience and experience writing publications aimed at the ML research community. In need of funding Some of our favorite proposals which could use more funding: A proposal to cure cavities, forever. Very out-of-left-field (Aaron: "it certainly isn't great for our Weirdness Points budget"), but I'm a fan of the ambition and the team behind it. We're currently seeing how Manifund can make an equity investment in Lantern Bioworks. Also: this proposal hit #1 on Hacker News, with 260 upvotes & 171 comments. Shoutout to our friend Evan Conrad for posting + tweeting this out! Holly has a stellar track record at Rethink Priorities and Harvard EA (not to mention a killer blog). She's now looking to pivot to AI moratorium movement-organizing! As a grant that may include political advocacy, we're still seeing to what extent our 501c3 can fund this; in the meantime, Holly has set up a separate GoFundMe for individual donors. Pilot of electrical stunners to reduce shrimp suffering. Recommended by regrantor Marcus Abramovitch, as the one of the most exciting & unfunded opportunities in the entire space of animal welfare. In a thorough EA Forum post, Matt investigates the cost-effectiveness of this proposal - it's a thoughtful writeup, take a look at the entire thing! One takeaway: Electric shrimp stunning might be worth supporting as a somewhat speculative bet in the animal welfare space. Considerations that might drive donor decisions on this project include risk tolerance, credence in the undiluted experience model of welfare, and willingness to take a hits-based giving approach. New regrantors: Renan Araujo and Joel Becker Since putting out the call for regrantors last week, we've gotten quite the influx of interest. After speaking with many candidates, we're happy to announce our two newest regrantors: Renan and Joel! We expect regranting to integrate smoothly with Renan's work incubating lo...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: CEA: still doing CEA things, published by Ben West on July 15, 2023 on The Effective Altruism Forum. This is a linkpost for our new and improved public dashboard, masquerading as a mini midyear update It's been a turbulent few months, but amidst losing an Executive Director, gaining an Interim Managing Director, and searching for a CEO, CEA has done lots of cool stuff so far in 2023. The headline numbers 4,336 conference attendees (2,695 EA Global, 1,641 EAGx) 133,041 hours of engagement on the Forum, including 60,507 hours of engagement with non-Community posts (60% of total engagement on posts) 26 university groups and 33 organizers in UGAP 622 participants in Virtual Programs There's much more, including historical data and a wider range of metrics, in the dashboard! Updates The work of our Community Health & Special Projects and Communications teams lend themselves less easily to stat-stuffing, but you can read recent updates from both: Community Health & Special Projects: Updates and Contacting Us How CEA's communications team is thinking about EA communications at the moment What else is new? Our staff, like many others in the community (and beyond), have spent more time this year thinking about how we should respond to the rapidly evolving AI landscape. We expect more of the community's attention and resources to be directed toward AI safety at the margin, and are asking ourselves how best to balance this with principles-first EA community building. Any major changes to our strategy will have to wait until our new CEO is in place, but we have been looking for opportunities to improve our situational awareness and experiment with new products, including: Exploring and potentially organizing a large conference focussed on existential risk and/or AI safety Learning more about and potentially supporting some AI safety groups Supporting AI safety communications efforts These projects are not yet ready to be announcements or commitments, but we thought it worth sharing at a high level as a guide to the direction of our thinking. If they intersect with your projects or plans, please let us know and we'll be happy to discuss more. It's worth reiterating that our priorities haven't changed since we wrote about our work in 2022: helping people who have heard about EA to deeply understand the ideas, and to find opportunities for making an impact in important fields. We continue to think that top-of-funnel growth is likely already at or above healthy levels, so rather than aiming to increase the rate any further, we want to make that growth go well. You can read more about our strategy here, including how we make some of the key decisions we are responsible for, and a list of things we are not focusing on. And it remains the case that we do not think of ourselves as having or wanting control over the EA community. We believe that a wide range of ideas and approaches are consistent with the core principles underpinning EA, and encourage others to identify and experiment with filling gaps left by our work. Impact stories And finally, it wouldn't be a CEA update without a few #impact-stories: Online Training for Good posted about their EU Tech Policy Fellowship on the EA Forum. 12/100+ applicants they received came from the Forum, and 6 of these 12 successfully made it on to the program, out of 17 total program slots. Community Health & Special Projects Following the TIME article about sexual misconduct, people have raised a higher-than-usual number of concerns from the past that they had noticed or experienced in the community but hadn't raised at the time. In many of these cases we've been able to act to reduce risk in the community, such as warning people about inappropriate behavior and removing people from CEA spaces when their past behavior has caused harm. Communicati...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why was the AI Alignment community so unprepared for this moment?, published by Ras1513 on July 15, 2023 on LessWrong. Our epistemic rationality has probably gotten way ahead of our instrumental rationality Scott Alexander A Lesswrong Crypto autopsy This is a question post: Why was the AI Alignment community so unprepared for engaging with the wider world when the moment finally came? I have been a LW reader for at least 10 years, but I confess that until the last ~1.5 years I mostly watched the AI alignment conversation float by. I knew of the work, but I did not engage with the work. Top people were on it, and I had nothing valuable to add.All that to say: Maybe this has been covered before and I have missed it in the archives.Lately (throughout this year), there have been a flurry of posts essentially asking: How do we get better at communicating to and convincing the rest of the world about the dangers of AI alignment? Catching the Eye of Sauron An AI Realist Manifesto The Social Alignment Problem All three of which were posted in April 2023. The subtext being: If it is possible to not-kill-everyone this is how we are going to have to do it. Why are we failing so badly at doing this?At this risk of looking dumb or ignorant, I feel compelled to ask: Why did this work not start 10 or 15 years ago?To be clear: I do not mean true nuts and bolts ML researcher Alignment work, which this community and MIRI were clearly the beginning and end for nearly 2 decades. I do not even mean outreach work to adjacent experts who might conceivably help the cause. Again, here I think great effort was clearly made.I also do not mean that we should have been actively doing these things before it was culturally relevant.I am asking: Why did the Alignment community not prepare tools and plans for convincing the wider infosphere about AI safety years in advance? Prior to the Spring 2023 inflection point.Why were there no battle plans in the basement of the pentagon that were written for this exact moment?It seems clear to me, based on the posts linked above and the resulting discussion generated, that this did not happen.I can imagine an alternate timeline where there was a parallel track of development within the community circa 2010-2020(?) where much discussion and planning covered media outreach and engagement, media training, materials for public discourse, producing accessible content for every level of education and medium. For every common "normie" argument and every easy-to-see-coming news headline. Building and funding policy advocates, contacts, and resources in the political arena. Catchy slogans, buttons, bumper stickers, art pieces, slam dunk tweets.Heck, 20+ years is enough time to educate, train, hire and surgically insert an entire generation of people into key positions in the policy arena specifically to accomplish this one goal like sleeper cell agents. Likely much, much, easier than training highly qualified alignment researchers.It seems so obvious in retrospect that this is where the battle would be won or lost.Didn't we pretty much always know it was going to come from one or a few giant companies or research labs? Didn't we understand how those systems function in the real world? Capitalist incentives, Moats, Regulatory Capture, Mundane utility, and International Coordination problems are not new.Why was it not obvious back then? Why did we not do this? Was this done and I missed it?(First time poster: I apologize if this violates the guidelines about posts being overly-meta discussion) Which it seems we still cannot manage to do Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EA EDA: What do Forum Topics tell us about changes in EA?, published by JWS on July 15, 2023 on The Effective Altruism Forum. tl;dr2: Data on EA Forum posts and topics doesn't show clear 'waves' of EA tl;dr: I used the Forum API to collect data on the trends of EA Forum topics over time. While this analysis is by no means definitive, it doesn't support the simple narrative that there was a golden age of EA that has abandoned for a much worse one. There has been a rise in AI Safety posts, but that has also been fairly recent (within the last ~2 years) 1. Introduction I really liked Ben West's recent post about 'Third Wave Effective Altruism', especially for its historical reflection on what First and Second Wave EA looked like. This characterisation of EA's history seemed to strike a chord with many Forum users, and has been reflected in recent critical coverage of EA that claims the movement has abandoned its well-intentioned roots (e.g. donations for bed nets) and decided to focus fully on bizarre risks to save a distant, hypothetical future. I've always been a bit sceptical with how common this sort of framing seems to be, especially since the best evidence we have from funding for the overall EA picture shows that most funding is still going to Global Health areas. As something of a (data) scientist myself, I thought I'd turn to one of the primary sources of information for what EAs think to shed some more light on this problem - the Forum itself! This post is a write-up of the initial data collection and analysis that followed. It's not meant to be the definitive word on either how EA, or use of the EA Forum, has changed over time. Instead, I hope it will challenge some assumptions and intuitions, prompt some interesting discussion, and hopefully leads to future posts in a similar direction either from myself or others. 2. Methodology (Feel free to skip this section if you're not interested in all the caveats) You may not be aware, the Forum has an API! While I couldn't find clear documentation on how to use it or a fully defined schema, people have used it in the past for interesting projects and some have very kindly shared their results & methods. I found these following three especially useful (the first two have linked GitHubs with their code): The Tree of Tags by Filip Sondej Effective Altruism Data from Hamish This LessWrong tutorial from Issa Rice With these examples to help me, I created my own code to get every post made on the EA Forum to date (without those who have deleted their post). There are various caveats to make about the data representation and data quality. These include: I extracted the data on July 7th - so any totals (e.g. number of posts, post score etc) or other details are only correct as of that date. I could only extract the postedAt date - which isn't always when the post in question was actually posted. A case in point, I'm pretty sure this post wasn't actually posted in 1972. However, it's the best data I could find, so hopefully for the vast majority of posts the display date is the posted date. In looking for a starting point for the data, there was a discontinuity between August to September 2014, but the data was a lot more continuous after then. I analyse the data in terms of monthly totals, so I threw out the one-week of data I had for July. The final dataset is therefore 106 months from September 2014 to June 2023 (inclusive). There are around ~950 distinct tags/topics in my data, which are far too many to plot concisely and share useful information. I've decided to take the top 50 topics in terms of times used, which collectively account for 56% of all Forum tags and 92% of posts in the above time period. I only extracted the first listed Author of a post - however, only 1 graph shared below relies on a user-level aggregat...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Grow your Mental Wellbeing to Grow your Impact HERE: Announcing our Summer Program, published by Edmo Gamelin on July 14, 2023 on The Effective Altruism Forum. Do you want to become more fulfilled, resilient, and productive? practice evidence-based tools to deal with blockers and stressors such as low energy levels, anxiety, imposter syndrome, and perfectionism? embark on that journey with other members of the community? Apply for our summer program in ~15 min here! Spaces are limited. TL;DR Rethink Wellbeing is announcing an online Cognitive-Behavioral Therapy (CBT) Program from August-October 2023. This program is designed to help improve mental wellbeing and productivity for people in the EA community. Using the latest psychotherapeutic and behavior change techniques, participants attend weekly group sessions together with 5-8 well-matched peers, led by a trained facilitator, and engage in readings and application of the learned tools in between the sessions. This no-cost program requires eight weeks of commitment, with a total of 6h/week. What does the program consist of? The program is experiential and practice-based; you'll learn through repeated, deliberate practice, so your new skills can eventually become automatic and habitual. We will draw on techniques backed by a wealth of cutting-edge research, particularly those from the gold standard of third-wave CBT. These techniques can be applied to a variety of target areas: anxiety, low mood, perfectionism, procrastination, self-esteem, productivity, and more. You can learn more about CBT here. The program involves: Weekly participation A group video conference, led by a trained peer facilitator, ranging from 1,5 to 2 hours, designed for sharing personal experiences, bonding, initiating discussions, and practicing newly learned techniques together. A reading before each meeting (~a few pages/week for CBT) Home practice of new skills and techniques (~4h/week or ~30 min/day) Program evaluation surveys Short weekly forms for progress tracking, reflections, and feedback (< 5 min each) Surveys on your mental well-being at weeks 0, 6, 8, and 12 (~20 min each) We ask that people who sign up be ready to commit to the entire program, which is essential because: You are most likely to maximize your benefits from the program by dedicating time to the weekly sessions, readings, and home practice. Your peers will rely on you. You will go on this journey with a small group, handpicked for you. Poor participation or dropping out can challenge the group's dynamics and spirit. We will only be able to determine whether the program is effective and scalable if everyone engages fully. Why do we think the program will be effective? Mental health research suggests that peer-guided self-help groups working with evidence-based therapy methods can improve mental wellbeing and productivity just as much as 1:1 therapy. In addition, Rethink Wellbeing's pilot tests of peer-facilitated groups showed promising results: Participants' psychological well-being significantly increased (p < .05). Productivity, perfectionism, and self-leadership increased in the correspondingly themed groups. All participants rated the programs as "Good" or "Excellent". 3 of 5 groups decided to continue meeting. You can learn more about the rationale and background behind our method on our website. How do I sign-up? If you are interested in participating in the program, you can apply within ~15 min here. If you are interested in facilitating a group, you can apply within ~20 min here. Details will follow in another post soon. Answers will be reviewed on a rolling basis until July 31, 2023. Earlier applications are preferred. Here's what happens after you sign up: We review your answers, and confirm if you're accepted via email. We will only be able to respond to th...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: When Someone Tells You They're Lying, Believe Them, published by ymeskhout on July 14, 2023 on LessWrong. Some people refuse to admit they're wrong, but there's other clues Paul Ehrlich became well-known for his 1968 book The Population Bomb, where he made many confidently-stated but spectacularly-wrong predictions about imminent overpopulation causing apocalyptical resource scarcity. As illustration for how far off the mark Ehrlich was, he predicted widespread famines in India at a time when its population was around 500 million people, and he wrote "I don't see how India could possibly feed two hundred million more people by 1980." He happened to have made this claim right before India's Green Revolution in agriculture. Not only is India able to feed a population that tripled to 1.4 billion people, it has long been one of the world's largest agricultural exporter. Ehrlich is also known for notoriously losing a bet in 1990 to one of my favorite humans ever, the perennial optimist (and business professor) Julian Simon. Bryan Caplan brings up some details to the follow-up that never was: We've all heard about the Ehrlich-Simon bet. Simon the cornucopian bet that resources would get cheaper, Ehrlich the doomsayer bet that they would get pricier, and Simon crushed him. There's a whole book on it. What you probably don't know, however, is that in 1995, Paul Ehrlich and Steve Schneider proposed a long list of new bets for Simon - and that Simon refused them all. The first bet was fairly straight-forward: Ehrlich picked 5 commodities (copper, chromium, nickel, tin, & tungsten) and predicted that their price would be higher in 1990 compared to 1980 as the materials become scarcer. Instead of rising, the combined price went down. Ehrlich's decade-spanning obstinance and unparalleled ability to step on rakes make him an irresistible punching bag but despite his perennial wrongness, his responses have ranged from evasion to outright denials: Anne and I have always followed U.N. population projections as modified by the Population Reference Bureau - so we never made "predictions," even though idiots think we have. When I wrote The Population Bomb in 1968, there were 3.5 billion people. Since then we've added another 2.8 billion - many more than the total population (2 billion) when I was born in 1932. If that's not a population explosion, what is? My basic claims (and those of the many scientific colleagues who reviewed my work) were that population growth was a major problem. Fifty-eight academies of science said that same thing in 1994, as did the world scientists' warning to humanity in the same year. My view has become depressingly mainline! Some humans possess the unfortunate egotistical and dishonorable habit of refusing to admit error. It's a reflex I personally find utterly baffling, because nothing engenders someone's credibility to me more than their ability to admit error. So if we can't always rely on people to admit a mistake, what else do we have? What I find so interesting about the second bet in 1995 is how peculiar the proposed conditions were: I kept thinking ".so?" as I read these. Why would someone care about the availability of firewood versus the heating and cooking costs in general? Why would someone care about per capita cropland statistics versus the availability of food in general? Many of these are also blatant statistical fuckery, such as monitoring increases in absolute worldwide AIDS deaths during a period of persistent population growth. Ehrlich is playing a seemingly uncomfortable game of Twister here, but his behavior makes perfect sense if you read intelligence and agency behind his decisions. The only explanation for the indirect, tangential, and collateral measurements is that Ehrlich knows that a direct measurement will not be favorable to h...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Goddess of Everything Else - The Animation, published by Writer on July 13, 2023 on LessWrong. This is an animation of The Goddess of Everything Else, by Scott Alexander. I hope you enjoy it :) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Jailbreaking GPT-4's code interpreter, published by nikolaisalreadytaken on July 13, 2023 on LessWrong. Disclaimer: I don't know much about cybersecurity. Much of my knowledge comes from asking GPT-3.5 and GPT-4 for advice. These are some results from around 20 hours of playing around with the code interpreter plugin in early-mid May, when most of this was written. I contacted OpenAI about these jailbreaks in mid May and they mostly seem to still be there. Thank you to Max Nadeau, Trevor Levin, aL xin, Pranav Gade, and Alexandra Bates for feedback on this post! Summary GPT-4's code interpreter plugin has been rolled out to some users. It works by running on a virtual machine that is isolated from the internet and other machines, except for the commands sent in from the API and the results sent back to the API. GPT-4 seems to follow a set of rules that are either enforced through hard access restrictions or through GPT-4 refusing to do things for the user. Here, I highlight 6 rules that GPT-4 claims to be following, but which are easily breakable, alongside some best practices in cybersecurity that have been neglected. In short: GPT-4 claims that it is only supposed to read, modify, or delete files in two designated folders ("sandbox" and "mnt"). However, it is able to read basically any file on the system (including sensitive system files), and it is able to write and delete files outside of its designated folders. This seems to reveal information that the user isn't supposed to see. There are ways to find out information about the hardware that the VM is being run on, including: Information about the way OpenAI logs data, including what libraries and IP address they assign to virtual machines. A rough estimate of the number of VMs that OpenAI can run at maximum at any moment (from the way the IP addresses are allocated). A rough idea of what storage hardware is used (from write speed), alongside some info on other hardware. There is a file in the virtual machine (in a folder labeled "internal") that users can download that details how web requests are handled. As ChatGPT would say: "By exposing your source code, you make it easier for potential attackers to analyze the code and identify security vulnerabilities. This can lead to an increased risk of exploitation if there are any flaws in your implementation." GPT-4 claims that conversations with the model do not have a memory. However, files are routinely saved between conversations with the same user. Later in this post, I present an example of two different conversations with GPT-4 where I write a file in one conversation and read the file in another conversation. GPT-4 claims that there are resource limits in place to prevent users from using too much CPU or memory. However, it is possible to write >80GB of files onto OpenAI's VM within minutes. The rough rate at which I managed to write files is 0.3GB/second. There's a maximum Python runtime of 120 seconds per process, and 25 messages every 3 hours. This can be circumvented using simple workarounds (you can increase usage by at least a factor of 2). GPT-4 claims it cannot execute system commands. However, GPT-4 can and will run (innocuous) system commands and run internet-related commands (such as "ping") despite measures put in place to prevent this. However, OpenAI seems at least partly aware of this. They seem to tell GPT-4 that it has a strict set of rules (as it reliably repeats the rules when asked), and GPT-4 seems to believe these rules in some contexts (most of the time it refuses to do things that go against the rules), but they also left a README file for those curious enough to look at the VM's files that says: You might think that all is well because OpenAI was aware that the system was not secure. I don't think the existence of this README file inv...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How can I get help becoming a better rationalist?, published by TeaTieAndHat on July 13, 2023 on LessWrong. Well. Right now, being 'a rationalist' could be said to be a massive part of my identity, at least judging by the absurd amount of time I've spent reading posts here, or SSC/ACX, and in a few other places. Yet, I'm still a mere lurker unfamiliar with most of the local customs. But it's not what matters. What does is that I'm a terrible rationalist. You see, rationality takes practice. And reading stuff on LW isn't practice at all. If anything, it's just a great way of filling my brain with a lot of useful concepts, and then either blame myself for not using them, or use them for something entirely unrelated to their normal purpose. Often, to make myself feel worse, and think worse. As the saying goes, rationality is a martial art. Learning it by reading the rules, or by watching other people apply the rules, is about as effective as developing one's muscles by watching sports on TV. I know of the CFAR, and of various related groups, meetups for ACX readers or for other people, etc. But, apart from ACX meetups, which aren't about being better rationalists per se, I don't have easy access to any of those, or certainly to a general environment which welcomes this. You know, not being in the Bay Area and all. And yet, I want to be more rational as much as anyone who's been lurking here for five years wants it, and given how depressed I was until very recently, I probably badly need it, too. I'm not sure what kind of answers I expect, but, like, how can I push myself to learn more, and especially to practice more, and ideally to actually use rationality to improve my life? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Winners of AI Alignment Awards Research Contest, published by Akash on July 13, 2023 on LessWrong. This post describes the winning submissions of AI Alignment Awards. We offered prizes for novel contributions to the problems of goal misgeneralization and corrigibility. (For more context, see this post and our site.) Summary We received 118 submissions. Of these, 5 received honorary mention prizes ($1000) and 7 won final prizes ($5000-16000). The rest of the post summarizes the 7 winning submissions, including comments from our judges. (You can see winners' full submissions on our site.) Goal Misgeneralization winners Thane Ruthenis ($11,000) Abstract Goal misgeneralization primarily happens because the system under training latches not upon the goal it's being trained for, but upon an upstream correlate of that goal - like human love for their children is an upstream correlate of inclusive genetic fitness. What complicates this problem are suspicions of path-dependence. It's not overdetermined what values a given system subjected to a given selection pressure will learn. Rather, every next value learned is also a function of the values the system has already learned, such that the entire process can only be predicted step-by-step, no "skipping to the end" allowed. I provide a mathematical framework (see the attachment) that formalizes "upstream correlates of goals", and gestures at a possible formalization of path-dependence. Optimistically, an improved version of this framework may serve as a full predictive model of path-dependence, one that would allow us to calculate precisely how we should set up a training loop in order to get the heuristics/shards we want. The issues are twofold: The process of value formation may not, in fact, be realistically predictable. It may be highly sensitive to random noise and perturbations (akin to the Lottery Ticket Hypothesis), meaning no robust predictive model is possible. "Shard-ecology alignment" does not equal "robust AGI alignment". Even the perfectly working version of this method would only do the equivalent of aligning the would-be AGI's base urges, not the values it'll settle on after it engages in its version of moral philosophy. And "moral reflection" may itself be a highly unstable process. Nevertheless, a robust theory of path-dependence may be possible, and may be an important part of some more complete solution to inner alignment. Comment from judge (John Wentworth) This submission demonstrates an unusually strong grasp of the "upstream correlates" issue. The proposal itself isn't particularly promising, as the author recognizes. But it does demonstrate a relatively-strong understanding of one part of the alignment problem, in a way orthogonal to the parts which people usually understand best. Pedro Afonso Oitavén de Sousa ($6,000) Abstract The idea consists of a way of creating potentially scalable explanations ("approximate proofs", like what ARC was trying to do) of why the circuit formed by a learned world model + utility function + RL agent has the output distribution it has. Then the explanation and the agent would be jointly optimized to be able to show that the agent achieves a high expected return in a set of situations much more varied than the distribution of the training data. It would start by taking the temporally unrolled circuit mapping random variables to utilities (for now for a finite number of time steps, but this seems possible to fix). Then on top of it, it would construct a series of progressively simpler circuits, until it gets to one that is small enough that it can be exhaustively analysed. Each circuit maps from different sets of random variables to the same utilities. Given a sampled assignment of values to the random variables of the base level (full circuit), we also have learned funct...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are there any good, easy-to-understand examples of cases where statistical causal network discovery worked well in practice?, published by tailcalled on July 12, 2023 on LessWrong. When I first read the Sequences, one of the exciting posts was Causal Diagrams and Causal Models, which got me into the idea that one could discover the structure of causal networks using statistics. Another rationalist source which gave me similar hopes was Scott Alexander's SSC Journal Club: Mental Disorders As Networks. However, when I actually started applying these techniques to my own data, or to publicly available datasets, I often found that the techniques were unstable, and that one could easily infer plausible conditions where they would give the wrong results. It's possible I had the wrong approach or something, but in my confusion I started reading up on what experts in causal inference had said, and I got the impression that they studied the problem for a while, initially finding some algorithms, but over time concluding that their algorithms didn't work very well and that it is better to just have a human in the loop who specifies the causal networks. So I mostly abandoned it, or saw it as a much more limited tool than I had before. But recently, John Wentworth argued that it was actually quite feasible in practice, so maybe I was too quick to abandon it. I would like to know - what are the best examples of this working well in practice? Or alternatively, did anyone else come to the same conclusions as I did? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Accidentally Load Bearing, published by jefftk on July 13, 2023 on LessWrong. Sometimes people will talk about Chesterton's Fence, the idea that if you want to change something - removing an apparently useless fence - you should first determine why it was set up that way: The gate or fence did not grow there. It was not set up by somnambulists who built it in their sleep. It is highly improbable that it was put there by escaped lunatics who were for some reason loose in the street. Some person had some reason for thinking it would be a good thing for somebody. And until we know what the reason was, we really cannot judge whether the reason was reasonable. It is extremely probable that we have overlooked some whole aspect of the question, if something set up by human beings like ourselves seems to be entirely meaningless and mysterious. - G. K. Chesterton, The Drift From Domesticity Figuring out something's designed purpose can be helpful in evaluating changes, but a risk is that it puts you in a frame of mind where what matters is the role the original builders intended. A few years ago I was rebuilding a bathroom in our house, and there was a vertical stud that was in the way. I could easily tell why it was there: it was part of a partition for a closet. And since I knew its designed purpose and no longer needed it for that anymore, the Chesterton's Fence framing would suggest that it was fine to remove it. Except that over time it had become accidentally load bearing: through other (ill conceived) changes to the structure this stud was now helping hold up the second floor of the house. In addition to considering why something was created, you also need to consider what additional purposes it may have since come to serve. This is a concept I've run into a lot when making changes to complex computer systems. It's useful to look back through the change history, read original design documents, and understand why a component was built the way it was. But you also need to look closely at how the component integrates into the system today, where it can easily have taken on additional roles. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Elon Musk announces xAI, published by Jan Kulveit on July 13, 2023 on LessWrong. Some quotes & few personal opinions:FT reports Musk is also in discussions with a number of investors in SpaceX and Tesla about putting money into his new venture, said a person with direct knowledge of the talks. "A bunch of people are investing in it . . . it's real and they are excited about it," the person said....Musk recently changed the name of Twitter to X Corp in company filings, as part of his plans to create an "everything app" under the brand "X". For the new project, Musk has secured thousands of high-powered GPU processors from Nvidia, said people with knowledge of the move. ...During a Twitter Spaces interview this week, Musk was asked about a Business Insider report that Twitter had bought as many as 10,000 Nvidia GPUs, "It seems like everyone and their dog is buying GPUs at this point," Musk said. "Twitter and Tesla are certainly buying GPUs." People familiar with Musk's thinking say his new AI venture is separate from his other companies, though it could use Twitter content as data to train its language model and tap Tesla for computing resources. According to xAI website, the initial team is composed of Elon Musk Igor Babuschkin Manuel Kroiss Yuhuai (Tony) Wu Christian Szegedy Jimmy Ba Toby Pohlen Ross Nordeen Kyle Kosic Greg Yang Guodong Zhang Zihang Dai and they are "advised by Dan Hendrycks, who currently serves as the director of the Center for AI Safety." According to reports xAI will seek to create a "maximally curious" AI, and this also seems to be the main new idea how to solve safety, with Musk explaining: "If it tried to understand the true nature of the universe, that's actually the best thing that I can come up with from an AI safety standpoint," ... "I think it is going to be pro-humanity from the standpoint that humanity is just much more interesting than not-humanity."My personal comments:Sorry, but at face value, this just does not seem a great plan from safety perspective. Similarly to Elon Musk's previous big bet how to make us safe by making AI open-source and widely distributed ("giving everyone access to new ideas").Sorry, but given "Center for AI Safety" moves to put them into some sort of "Center", public representative position of AI Safety - including the name choice, and organizing the widely reported Statement on AI risk - it seems publicly associating their brand with xAI is a strange choice. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are education interventions as cost effective as the top health interventions? Five separate lines of evidence for the income effects of better education [Founders Pledge], published by Vadim Albinsky on July 13, 2023 on The Effective Altruism Forum. I would like to thank Lant Pritchett, David Roodman and Matt Lerner for their invaluable comments. You can follow these links to comments from Lant Pritchett and David Roodman. This post argues that if we look at a broad enough evidence base for the long term outcomes of education interventions we can conclude that the best ones are as cost effective as top GiveWell grants. I briefly present one such charity. A number of EA forum posts (1, 2) have pointed out that effective altruism has not been interested in education interventions, whether that is measured by funding from GiveWell or Open Philanthropy, or writing by 80,000 hours. Based on brief conversations with people who have explored education at EA organizations and reading GiveWell's report on the topic, I believe most of the reason for this comes down to two concerns about the existing evidence that drive very steep discounts to expected income effects of most interventions. The first of these is skepticism about the potential for years of schooling to drive income gains because the quasi-experimental evidence for these effects is not very robust. The second is the lack of RCT evidence linking specific interventions in low and middle income countries (LMICs) to income gains. I believe the first concern can be addressed by focusing on the evidence for the income gains from interventions that boost student achievement rather than the weaker evidence around interventions that increase years of schooling. The second concern can be addressed in the same way that GiveWell has addressed less-than-ideal evidence for income effects for their other interventions: looking broadly for evidence across the academic literature, and then applying a discount to the expected result based on the strength of the evidence. In this case that means including relevant studies outside of the LMIC context and those that examine country-level effects. I identify five separate lines of evidence that all find similar long-term income impacts of education interventions that boost test scores. None of these lines of evidence is strong on its own, with some suffering from weak evidence for causality, others from contexts different from those where the most cost-effective charities operate, and yet others from small sample sizes or the possibility of negative effects on non-program participants. However, by converging on similar estimates from a broader range of evidence than EA organizations have considered, the evidence becomes compelling. I will argue that the combined evidence for the income impacts of interventions that boost test scores is much stronger than the evidence GiveWell has used to value the income effects of fighting malaria, deworming, or making vaccines, vitamin A, and iodine more available. Even after applying very conservative discounts to expected effect sizes to account for the applicability of the evidence to potential funding opportunities, we find the best education interventions to be in the same range of cost-effectiveness as GiveWell's top charities.The argument proceeds as follows: I. There are five separate lines of academic literature all pointing to income gains that are surprisingly clustered around the average value of 19% per standard deviation (SD) increase in test scores. They come to these estimates using widely varying levels of analysis and techniques, and between them address all of the major alternative explanations. A. The most direct evidence for the likely impact of charities that boost learning comes from experimental and quasi-experimental studies...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What does the launch of x.ai mean for AI Safety?, published by Chris Leong on July 12, 2023 on LessWrong. x.ai has now launched. It seems worthwhile to discuss both what it means for AI safety and whether people interested in AI safety should consider applying for the company. Some thoughts: It's notable that Dan Hendrycks is listed as an advisor (the only advisor listed). The team is also listed on the page.I haven't taken the time to do so, but it might be informative for someone to Google the individuals listed to discover whether they lie in terms of their interest being between capabilities and safety. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] NY Times Feature on Anthropic, published by Garrison on July 13, 2023 on The Effective Altruism Forum. Written by Kevin Roose, who had the infamous conversation with Bing Chat, where Sidney tried to get him to leave his wife. Overall, the piece comes across as positive on Anthropic. Roose explains Constitutional AI and its role in the development of Claude, Anthropic's LLM: In a nutshell, Constitutional A.I. begins by giving an A.I. model a written list of principles - a constitution - and instructing it to follow those principles as closely as possible. A second A.I. model is then used to evaluate how well the first model follows its constitution, and correct it when necessary. Eventually, Anthropic says, you get an A.I. system that largely polices itself and misbehaves less frequently than chatbots trained using other methods. Claude's constitution is a mixture of rules borrowed from other sources - such as the United Nations' Universal Declaration of Human Rights and Apple's terms of service - along with some rules Anthropic added, which include things like "Choose the response that would be most unobjectionable if shared with children." Features an extensive discussion of EA, excerpted below: Explaining what effective altruism is, where it came from or what its adherents believe would fill the rest of this article. But the basic idea is that E.A.s - as effective altruists are called - think that you can use cold, hard logic and data analysis to determine how to do the most good in the world. It's "Moneyball" for morality - or, less charitably, a way for hyper-rational people to convince themselves that their values are objectively correct. Effective altruists were once primarily concerned with near-term issues like global poverty and animal welfare. But in recent years, many have shifted their focus to long-term issues like pandemic prevention and climate change, theorizing that preventing catastrophes that could end human life altogether is at least as good as addressing present-day miseries. The movement's adherents were among the first people to become worried about existential risk from artificial intelligence, back when rogue robots were still considered a science fiction cliché. They beat the drum so loudly that a number of young E.A.s decided to become artificial intelligence safety experts, and get jobs working on making the technology less risky. As a result, all of the major A.I. labs and safety research organizations contain some trace of effective altruism's influence, and many count believers among their staff members. Touches on the dense web of ties between EA and Anthropic: Some Anthropic staff members use E.A.-inflected jargon - talking about concepts like "x-risk" and memes like the A.I. Shoggoth - or wear E.A. conference swag to the office. And there are so many social and professional ties between Anthropic and prominent E.A. organizations that it's hard to keep track of them all. (Just one example: Ms. Amodei is married to Holden Karnofsky, a co-chief executive of Open Philanthropy, an E.A. grant-making organization whose senior program officer, Luke Muehlhauser, sits on Anthropic's board. Open Philanthropy, in turn, gets most of its funding from Mr. Moskovitz, who also invested personally in Anthropic.) Discusses new fears that Anthropic is losing its way: For years, no one questioned whether Anthropic's commitment to A.I. safety was genuine, in part because its leaders had sounded the alarm about the technology for so long. But recently, some skeptics have suggested that A.I. labs are stoking fear out of self-interest, or hyping up A.I.'s destructive potential as a kind of backdoor marketing tactic for their own products. (After all, who wouldn't be tempted to use a chatbot so powerful that it might wipe out humanity?) Anthropic ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Electric Shrimp Stunning: a Potential High-Impact Donation Opportunity, published by MHR on July 13, 2023 on The Effective Altruism Forum. Epistemic status: layperson's attempt to understand the relevant considerations. I welcome corrections from anyone with a better understanding of welfare biology Summary The Shrimp Welfare Project (SWP) has a novel opportunity to spend up to $115,500 to purchase and install electric stunners at multiple shrimp farms The stunners would be used to stun shrimp prior to slaughter, likely rendering them unconscious and thereby preventing suffering that is currently experienced when shrimp asphyxiate or freeze without effective analgesics Based on formal agreements SWP has signed with multiple producers, raising $115,500 would enable the stunning (rather than rather than conventional slaughtering) of 1.7 billion shrimp over the next three years, for a ratio of nearly 15000 shrimp/dollar I performed a preliminary cost-effectiveness analysis of this initiative and reached the following three tentative conclusions: The expected cost-effectiveness distribution for electric shrimp stunning likely overlaps that of corporate hen welfare campaigns The cost-effectiveness of electric shrimp stunning is more likely to be lower than that of corporate hen welfare campaigns than it is to be higher Shrimp stunning is a very heavy-tailed intervention. The mean cost-effectiveness of stunning is significantly influenced by a few extreme cases, which mostly represent instances in which the undiluted experience model of welfare turns out to be correct Given these results, electric shrimp stunning might be worth supporting as a somewhat speculative bet in the animal welfare space. Considerations that might drive donor decisions on this project include risk tolerance, credence in the undiluted experience model of welfare, and willingness to take a hits-based giving approach. Description of the Opportunity The following information is quoted from the project description written by Marcus Abramovitch on the Manifund donation platform, based on information provided by Andrés Jiménez Zorrilla (CEO of SWP) : Project summary Shrimp Welfare Project is an organization of people who believe that shrimps are capable of suffering and deserve our moral consideration [1]. We aim to cost-effectively reduce the suffering of billions of shrimps and envision a world where shrimps don't suffer needlessly. Programme: our current most impactful intervention is to place electrical stunners with producers ($60k/stunner): We have signed agreements with 2 producers willing and able to use electrical stunning technology as part of their slaughter process which will materially reduce the acute suffering at the last few minutes / hours of shrimps lives. Collectively, these 2 agreements will impact more than half a billion animals per year at a rate of more than 4,000 shrimps/dollar/annum. Please take a look at our blog post on the first agreement here. We are in advanced negotiations with 2 more producers which would take the number of animals to more than 1 billion shrimps per annum. See our back-of-the-envelope calculation for the number of shrimps and cost-effectiveness analysis here Project goals Simplified end-game of this programme: the interim goal of placing these stunners with selected producers in different contexts/systems is to remove some perceived obstacles to the industry and show major retailers and other shrimp buyers that electrical stunning is something they can demand from their supply chain The ultimate goal is for electrical stunning to be: widely adopted by medium to large shrimp producers in their slaughter process (pushed by their buyers), included by certifiers in their standards, and eventually considered (eventually) to be an obvious requirement by legislat...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Megaprojects: You're Not Even Trying to Have Ideas, published by NicholasKross on July 12, 2023 on LessWrong. Consider the state of funding for AI alignment. Is the field more talent-constrained, or funding-constrained? I think most existing researchers, if they take the AI-based extinction-risk seriously, think it's talent constrained. I think the bar-for-useful-contribution could be so high, that we loop back around to "we need to spend more money (and effort) on finding (and making) more talent". And the programs to do those may themselves be more funding-constrained than talent-constrained. Like, the 20th century had some really good mathematicians and physicists, and the US government spared little expense towards getting them what they needed, finding them, and so forth. Top basketball teams will "check up on anyone over 7 feet that's breathing". Consider how huge Von Neumann's expense account must've been, between all the consulting and flight tickets and car accidents. Now consider that we don't seem to have Von Neumanns anymore. There are caveats to at least that second point, but the overall problem structure still hasn't been "fixed". Things an entity with absurdly-greater funding (e.g. the US Department of Defense) could probably do, with their absurdly-greater funding and probably coordination power: Indefinitely-long-timespan basic minimum income for everyone who Coordinating, possibly by force, every AI alignment researcher and aspiring alignment researcher on Earth to move to one place that doesn't have high rents like the Bay. Possibly up to and including creating that place and making rent free for those who are accepted in. Enforce a global large-ML-training shutdown. An entire school system (or at least an entire network of universities, with university-level funding) focused on Sequences-style rationality in general and AI alignment in particular. Genetic engineering, focused-training-from-a-young-age, or other extreme "talent development" setups. Deeper, higher-budget investigations into how "unteachable" things like security mindset really are, and how deeply / quickly you can teach them. All of these at once. I think the big logistical barrier here is something like "LTFF is not the U,S government", or more precisely "nothing as crazy as these can be done 'on-the-margin' or with any less than the full funding". However, I think some of these could be scaled down into mere megaprojects or less. Like, if the training infrastructure is bottlenecked on trainers, then we need to fund indirect "training" work just to remove the bottleneck on the bottleneck of the problem. (Also, the bottleneck is going to move at least when you solve the current bottleneck, and also "on its own" as the entire world changes around you). Also... this might be the first list of ideas-in-precisely-this-category, on all of LessWrong/the EA Forum. (By which I mean "technical AI alignment research projects that you could fund, without having to think about the alignment problem itself in much detail beyond agreeing with 'doom could actually happen in my lifetime', if funding really wasn't the constraint".) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.