The Nonlinear Library: Alignment Forum Top Posts

Follow The Nonlinear Library: Alignment Forum Top Posts
Share on
Copy link to clipboard

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio.


    • Dec 10, 2021 LATEST EPISODE
    • infrequent NEW EPISODES
    • 19m AVG DURATION
    • 242 EPISODES


    Search for episodes from The Nonlinear Library: Alignment Forum Top Posts with a specific topic:

    Latest episodes from The Nonlinear Library: Alignment Forum Top Posts

    Discussion with Eliezer Yudkowsky on AGI interventions by Rob Bensinger, Eliezer Yudkowsky

    Play Episode Listen Later Dec 10, 2021 55:18


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discussion with Eliezer Yudkowsky on AGI interventions, published by Rob Bensinger, Eliezer Yudkowsky on the AI Alignment Forum. The following is a partially redacted and lightly edited transcript of a chat conversation about AGI between Eliezer Yudkowsky and a set of invitees in early September 2021. By default, all other participants are anonymized as "Anonymous". I think this Nate Soares quote (excerpted from Nate's response to a report by Joe Carlsmith) is a useful context-setting preface regarding timelines, which weren't discussed as much in the transcript: [...] My odds [of AGI by the year 2070] are around 85%[...] I can list a handful of things that drive my probability of AGI-in-the-next-49-years above 80%: 1. 50 years ago was 1970. The gap between AI systems then and AI systems now seems pretty plausibly greater than the remaining gap, even before accounting the recent dramatic increase in the rate of progress, and potential future increases in rate-of-progress as it starts to feel within-grasp. 2. I observe that, 15 years ago, everyone was saying AGI is far off because of what it couldn't do -- basic image recognition, go, starcraft, winograd schemas, programmer assistance. But basically all that has fallen. The gap between us and AGI is made mostly of intangibles. (Computer Programming That Is Actually Good? Theorem proving? Sure, but on my model, "good" versions of those are a hair's breadth away from full AGI already. And the fact that I need to clarify that "bad" versions don't count, speaks to my point that the only barriers people can name right now are intangibles.) That's a very uncomfortable place to be! 3. When I look at the history of invention, and the various anecdotes about the Wright brothers and Enrico Fermi, I get an impression that, when a technology is pretty close, the world looks a lot like how our world looks. Of course, the trick is that when a technology is a little far, the world might also look pretty similar! Though when a technology is very far, the world does look different -- it looks like experts pointing to specific technical hurdles. We exited that regime a few years ago. 4. Summarizing the above two points, I suspect that I'm in more-or-less the "penultimate epistemic state" on AGI timelines: I don't know of a project that seems like they're right on the brink; that would put me in the "final epistemic state" of thinking AGI is imminent. But I'm in the second-to-last epistemic state, where I wouldn't feel all that shocked to learn that some group has reached the brink. Maybe I won't get that call for 10 years! Or 20! But it could also be 2, and I wouldn't get to be indignant with reality. I wouldn't get to say "but all the following things should have happened first, before I made that observation". I have made those observations. 5. It seems to me that the Cotra-style compute-based model provides pretty conservative estimates. For one thing, I don't expect to need human-level compute to get human-level intelligence, and for another I think there's a decent chance that insight and innovation have a big role to play, especially on 50 year timescales. 6. There has been a lot of AI progress recently. When I tried to adjust my beliefs so that I was positively surprised by AI progress just about as often as I was negatively surprised by AI progress, I ended up expecting a bunch of rapid progress. [...] Further preface by Eliezer: In some sections here, I sound gloomy about the probability that coordination between AGI groups succeeds in saving the world. Andrew Critch reminds me to point out that gloominess like this can be a self-fulfilling prophecy - if people think successful coordination is impossible, they won't try to coordinate. I therefore remark in retrospective advance that it seems to me like at least some of the top...

    What failure looks like by Paul Christiano

    Play Episode Listen Later Dec 10, 2021 14:17


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What failure looks like, published by Paul Christiano on the AI Alignment Forum. The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity. I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I'll tell the story in two parts: Part I: machine learning will increase our ability to “get what we can measure,” which could cause a slow-rolling catastrophe. ("Going out with a whimper.") Part II: ML training, like competitive economies or natural ecosystems, can give rise to “greedy” patterns that try to expand their own influence. Such patterns can ultimately dominate the behavior of a system and cause sudden breakdowns. ("Going out with a bang," an instance of optimization daemons.) I think these are the most important problems if we fail to solve intent alignment. In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I'm scared even if we have several years. With fast enough takeoff, my expectations start to look more like the caricature---this post envisions reasonably broad deployment of AI, which becomes less and less likely as things get faster. I think the basic problems are still essentially the same though, just occurring within an AI lab rather than across the world. (None of the concerns in this post are novel.) Part I: You get what you measure If I want to convince Bob to vote for Alice, I can experiment with many different persuasion strategies and see which ones work. Or I can build good predictive models of Bob's behavior and then search for actions that will lead him to vote for Alice. These are powerful techniques for achieving any goal that can be easily measured over short time periods. But if I want to help Bob figure out whether he should vote for Alice---whether voting for Alice would ultimately help create the kind of society he wants---that can't be done by trial and error. To solve such tasks we need to understand what we are doing and why it will yield good outcomes. We still need to use data in order to improve over time, but we need to understand how to update on new data in order to improve. Some examples of easy-to-measure vs. hard-to-measure goals: Persuading me, vs. helping me figure out what's true. (Thanks to Wei Dai for making this example crisp.) Reducing my feeling of uncertainty, vs. increasing my knowledge about the world. Improving my reported life satisfaction, vs. actually helping me live a good life. Reducing reported crimes, vs. actually preventing crime. Increasing my wealth on paper, vs. increasing my effective control over resources. It's already much easier to pursue easy-to-measure goals, but machine learning will widen the gap by letting us try a huge number of possible strategies and search over massive spaces of possible actions. That force will combine with and amplify existing institutional and social dynamics that already favor easily-measured goals. Right now humans thinking and talking about the future they want to create are a powerful force that is able to steer our trajectory. But over time human reasoning will become weaker and weaker compared to new forms of reasoning honed by trial-and-error. Eventually our society's trajectory will be determined by powerful optimization with easily-measurable goals rather than by human intentions about the future. We will try to harness this power by constructing proxies for what we care about, but over time those proxies will come apart: Corporations will deliver value to consumers as measured by profit. Eventually th...

    The Parable of Predict-O-Matic by Abram Demski

    Play Episode Listen Later Dec 10, 2021 23:52


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Parable of Predict-O-Matic, published by Abram Demski on the AI Alignment Forum. I've been thinking more about partial agency. I want to expand on some issues brought up in the comments to my previous post, and on other complications which I've been thinking about. But for now, a more informal parable. (Mainly because this is easier to write than my more technical thoughts.) This relates to oracle AI and to inner optimizers, but my focus is a little different. 1 Suppose you are designing a new invention, a predict-o-matic. It is a wonderous machine which will predict everything for us: weather, politics, the newest advances in quantum physics, you name it. The machine isn't infallible, but it will integrate data across a wide range of domains, automatically keeping itself up-to-date with all areas of science and current events. You fully expect that once your product goes live, it will become a household utility, replacing services like Google. (Google only lets you search the known!) Things are going well. You've got investors. You have an office and a staff. These days, it hardly even feels like a start-up any more; progress is going well. One day, an intern raises a concern. "If everyone is going to be using Predict-O-Matic, we can't think of it as a passive observer. Its answers will shape events. If it says stocks will rise, they'll rise. If it says stocks will fall, then fall they will. Many people will vote based on its predictions." "Yes," you say, "but Predict-O-Matic is an impartial observer nonetheless. It will answer people's questions as best it can, and they react however they will." "But --" the intern objects -- "Predict-O-Matic will see those possible reactions. It knows it could give several different valid predictions, and different predictions result in different futures. It has to decide which one to give somehow." You tap on your desk in thought for a few seconds. "That's true. But we can still keep it objective. It could pick randomly." "Randomly? But some of these will be huge issues! Companies -- no, nations -- will one day rise or fall based on the word of Predict-O-Matic. When Predict-O-Matic is making a prediction, it is choosing a future for us. We can't leave that to a coin flip! We have to select the prediction which results in the best overall future. Forget being an impassive observer! We need to teach Predict-O-Matic human values!" You think about this. The thought of Predict-O-Matic deliberately steering the future sends a shudder down your spine. But what alternative do you have? The intern isn't suggesting Predict-O-Matic should lie, or bend the truth in any way -- it answers 100% honestly to the best of its ability. But (you realize with a sinking feeling) honesty still leaves a lot of wiggle room, and the consequences of wiggles could be huge. After a long silence, you meet the interns eyes. "Look. People have to trust Predict-O-Matic. And I don't just mean they have to believe Predict-O-Matic. They're bringing this thing into their homes. They have to trust that Predict-O-Matic is something they should be listening to. We can't build value judgements into this thing! If it ever came out that we had coded a value function into Predict-O-Matic, a value function which selected the very future itself by selecting which predictions to make -- we'd be done for! No matter how honest Predict-O-Matic remained, it would be seen as a manipulator. No matter how beneficent its guiding hand, there are always compromises, downsides, questionable calls. No matter how careful we were to set up its values -- to make them moral, to make them humanitarian, to make them politically correct and broadly appealing -- who are we to choose? No. We'd be done for. They'd hang us. We'd be toast!" You realize at this point that you've stood up and start...

    What 2026 looks like by Daniel Kokotajlo

    Play Episode Listen Later Dec 10, 2021 27:26


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What 2026 looks like, published by Daniel Kokotajlo on the AI Alignment Forum. This was written for the Vignettes Workshop.[1] The goal is to write out a detailed future history (“trajectory”) that is as realistic (to me) as I can currently manage, i.e. I'm not aware of any alternative trajectory that is similarly detailed and clearly more plausible to me. The methodology is roughly: Write a future history of 2022. Condition on it, and write a future history of 2023. Repeat for 2024, 2025, etc. (I'm posting 2022-2026 now so I can get feedback that will help me write 2027+. I intend to keep writing until the story reaches singularity/extinction/utopia/etc.) What's the point of doing this? Well, there are a couple of reasons: Sometimes attempting to write down a concrete example causes you to learn things, e.g. that a possibility is more or less plausible than you thought. Most serious conversation about the future takes place at a high level of abstraction, talking about e.g. GDP acceleration, timelines until TAI is affordable, multipolar vs. unipolar takeoff. vignettes are a neglected complementary approach worth exploring. Most stories are written backwards. The author begins with some idea of how it will end, and arranges the story to achieve that ending. Reality, by contrast, proceeds from past to future. It isn't trying to entertain anyone or prove a point in an argument. Anecdotally, various people seem to have found Paul Christiano's “tales of doom” stories helpful, and relative to typical discussions those stories are quite close to what we want. (I still think a bit more detail would be good — e.g. Paul's stories don't give dates, or durations, or any numbers at all really.)[2] “I want someone to ... write a trajectory for how AI goes down, that is really specific about what the world GDP is in every one of the years from now until insane intelligence explosion. And just write down what the world is like in each of those years because I don't know how to write an internally consistent, plausible trajectory. I don't know how to write even one of those for anything except a ridiculously fast takeoff.” --Buck Shlegeris This vignette was hard to write. To achieve the desired level of detail I had to make a bunch of stuff up, but in order to be realistic I had to constantly ask “but actually though, what would really happen in this situation?” which made it painfully obvious how little I know about the future. There are numerous points where I had to conclude “Well, this does seem implausible, but I can't think of anything more plausible at the moment and I need to move on.” I fully expect the actual world to diverge quickly from the trajectory laid out here. Let anyone who (with the benefit of hindsight) claims this divergence as evidence against my judgment prove it by exhibiting a vignette/trajectory they themselves wrote in 2021. If it maintains a similar level of detail (and thus sticks its neck out just as much) while being more accurate, I bow deeply in respect! I hope this inspires other people to write more vignettes soon. We at the Center on Long-Term Risk would like to have a collection to use for strategy discussions. Let me know if you'd like to do this, and I can give you advice & encouragement! I'd be happy to run another workshop. 2022 GPT-3 is finally obsolete. OpenAI, Google, Facebook, and DeepMind all have gigantic multimodal transformers, similar in size to GPT-3 but trained on images, video, maybe audio too, and generally higher-quality data. Not only that, but they are now typically fine-tuned in various ways--for example, to answer questions correctly, or produce engaging conversation as a chatbot. The chatbots are fun to talk to but erratic and ultimately considered shallow by intellectuals. They aren't particularly useful for anything supe...

    Are we in an AI overhang? by Andy Jones

    Play Episode Listen Later Dec 10, 2021 7:52


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Are we in an AI overhang?, published by Andy Jones on the AI Alignment Forum. Over on Developmental Stages of GPTs, orthonormal mentions it at least reduces the chance of a hardware overhang. An overhang is when you have had the ability to build transformative AI for quite some time, but you haven't because no-one's realised it's possible. Then someone does and surprise! It's a lot more capable than everyone expected. I am worried we're in an overhang right now. I think we right now have the ability to build an orders-of-magnitude more powerful system than we already have, and I think GPT-3 is the trigger for 100x larger projects at Google, Facebook and the like, with timelines measured in months. Investment Bounds GPT-3 is the first AI system that has obvious, immediate, transformative economic value. While much hay has been made about how much more expensive it is than a typical AI research project, in the wider context of megacorp investment, its costs are insignificant. GPT-3 has been estimated to cost $5m in compute to train, and - looking at the author list and OpenAI's overall size - maybe another $10m in labour. Google, Amazon and Microsoft each spend about $20bn/year on R&D and another $20bn each on capital expenditure. Very roughly, it totals to $100bn/year. Against this budget, dropping $1bn or more on scaling GPT up by another factor of 100x is entirely plausible right now. All that's necessary is that tech executives stop thinking of natural language processing as cutesy blue-sky research and start thinking in terms of quarters-till-profitability. A concrete example is Waymo, which is raising $2bn investment rounds - and that's for a technology with a much longer road to market. Compute Cost The other side of the equation is compute cost. The $5m GPT-3 training cost estimate comes from using V100s at $10k/unit and 30 TFLOPS, which is the performance without tensor cores being considered. Amortized over a year, this gives you about $1000/PFLOPS-day. However, this cost is driven up an order of magnitude by NVIDIA's monopolistic cloud contracts, while performance will be higher when taking tensor cores into account. The current hardware floor is nearer to the RTX 2080 TI's $1k/unit for 125 tensor-core TFLOPS, and that gives you $25/PFLOPS-day. This roughly aligns with AI Impacts' current estimates, and offers another >10x speedup to our model. I strongly suspect other bottlenecks stop you from hitting that kind of efficiency or GPT-3 would've happened much sooner, but I still think $25/PFLOPS-day is a lower useful bound. Other Constraints I've focused on money so far because most of the current 3.5-month doubling times come from increasing investment. But money aside, there are a couple of other things that could prove to be the binding constraint. Scaling law breakdown. The GPT series' scaling is expected to break down around 10k pflops-days (§6.3), which is a long way short of the amount of cash on the table. This could be because the scaling analysis was done on 1024-token sequences. Maybe longer sequences can go further. More likely I'm misunderstanding something. Sequence length. GPT-3 uses 2048 tokens at a time, and that's with an efficient encoding that cripples it on many tasks. With the naive architecture, increasing the sequence length is quadratically expensive, and getting up to novel-length sequences is not very likely. But there are a lot of plausible ways to fix that, and complexity is no bar AI. This constraint might plausibly not be resolved on a timescale of months, however. Data availability. From the same paper as the previous point, dataset size rises with the square-root of compute; a 1000x larger GPT-3 would want 10 trillion tokens of training data. It's hard to find a good estimate on total-words-ever-written, but our library of 130m...

    DeepMind: Generally capable agents emerge from open-ended play by Daniel Kokotajlo

    Play Episode Listen Later Dec 10, 2021 3:15


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: DeepMind: Generally capable agents emerge from open-ended play, published by Daniel Kokotajlo on the AI Alignment Forum. This is a linkpost for EDIT: Also see paper and results compilation video! Today, we published "Open-Ended Learning Leads to Generally Capable Agents," a preprint detailing our first steps to train an agent capable of playing many different games without needing human interaction data. ... The result is an agent with the ability to succeed at a wide spectrum of tasks — from simple object-finding problems to complex games like hide and seek and capture the flag, which were not encountered during training. We find the agent exhibits general, heuristic behaviours such as experimentation, behaviours that are widely applicable to many tasks rather than specialised to an individual task. The neural network architecture we use provides an attention mechanism over the agent's internal recurrent state — helping guide the agent's attention with estimates of subgoals unique to the game the agent is playing. We've found this goal-attentive agent (GOAT) learns more generally capable policies. Playing roughly 700,000 unique games in 4,000 unique worlds within XLand, each agent in the final generation experienced 200 billion training steps as a result of 3.4 million unique tasks. At this time, our agents have been able to participate in every procedurally generated evaluation task except for a handful that were impossible even for a human. And the results we're seeing clearly exhibit general, zero-shot behaviour across the task space — with the frontier of normalised score percentiles continually improving. Looking qualitatively at our agents, we often see general, heuristic behaviours emerge — rather than highly optimised, specific behaviours for individual tasks. Instead of agents knowing exactly the “best thing” to do in a new situation, we see evidence of agents experimenting and changing the state of the world until they've achieved a rewarding state. We also see agents rely on the use of other tools, including objects to occlude visibility, to create ramps, and to retrieve other objects. Because the environment is multiplayer, we can examine the progression of agent behaviours while training on held-out social dilemmas, such as in a game of “chicken”. As training progresses, our agents appear to exhibit more cooperative behaviour when playing with a copy of themselves. Given the nature of the environment, it is difficult to pinpoint intentionality — the behaviours we see often appear to be accidental, but still we see them occur consistently. My hot take: This seems like a somewhat big deal to me. It's what I would have predicted, but that's scary, given my timelines. I haven't read the paper itself yet but I look forward to seeing more numbers and scaling trends and attempting to extrapolate... When I do I'll leave a comment with my thoughts. EDIT: My warm take: The details in the paper back up the claims it makes in the title and abstract. This is the GPT-1 of agent/goal-directed AGI; it is the proof of concept. Two more papers down the line (and a few OOMs more compute), and we'll have the agent/goal-directed AGI equivalent of GPT-3. Scary stuff. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    Alignment Research Field Guide by Abram Demski

    Play Episode Listen Later Dec 10, 2021 26:15


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Alignment Research Field Guide, published by Abram Demski on the AI Alignment Forum. This field guide was written by the MIRI team with MIRIx groups in mind, though the advice may be relevant to others working on AI alignment research. Preamble I: Decision Theory Hello! You may notice that you are reading a document. This fact comes with certain implications. For instance, why are you reading this? Will you finish it? What decisions will you come to as a result? What will you do next? Notice that, whatever you end up doing, it's likely that there are dozens or even hundreds of other people, quite similar to you and in quite similar positions, who will follow reasoning which strongly resembles yours, and make choices which correspondingly match. Given that, it's our recommendation that you make your next few decisions by asking the question “What policy, if followed by all agents similar to me, would result in the most good, and what does that policy suggest in my particular case?” It's less of a question of trying to decide for all agents sufficiently-similar-to-you (which might cause you to make the wrong choice out of guilt or pressure) and more something like “if I were in charge of all agents in my reference class, how would I treat instances of that class with my specific characteristics?” If that kind of thinking leads you to read further, great. If it leads you to set up a MIRIx chapter, even better. In the meantime, we will proceed as if the only people reading this document are those who justifiably expect to find it reasonably useful. ⠀ Preamble II: Surface Area Imagine that you have been tasked with moving a cube of solid iron that is one meter on a side. Given that such a cube weighs ~16000 pounds, and that an average human can lift ~100 pounds, a naïve estimation tells you that you can solve this problem with ~150 willing friends. But of course, a meter cube can fit at most something like 10 people around it. It doesn't matter if you have the theoretical power to move the cube if you can't bring that power to bear in an effective manner. The problem is constrained by its surface area. MIRIx chapters are one of the best ways to increase the surface area of people thinking about and working on the technical problem of AI alignment. And just as it would be a bad idea to decree "the 10 people who happen to currently be closest to the metal cube are the only ones allowed to think about how to think about this problem", we don't want MIRI to become the bottleneck or authority on what kinds of thinking can and should be done in the realm of embedded agency and other relevant fields of research. The hope is that you and others like you will help actually solve the problem, not just follow directions or read what's already been written. This document is designed to support people who are interested in doing real groundbreaking research themselves. ⠀ Contents You and your research Logistics of getting started Models of social dynamics Other useful thoughts and questions ⠀ 1. You and your research We sometimes hear questions of the form “Even a summer internship feels too short to make meaningful progress on real problems. How can anyone expect to meet and do real research in a single afternoon?” There's a Zeno-esque sense in which you can't make research progress in a million years if you can't also do it in five minutes. It's easy to fall into a trap of (either implicitly or explicitly) conceptualizing “research” as “first studying and learning what's already been figured out, and then attempting to push the boundaries and contribute new content.” The problem with this frame (according to us) is that it leads people to optimize for absorbing information, rather than seeking it instrumentally, as a precursor to understanding. (Be mindful of what you're optimizing ...

    Hiring engineers and researchers to help align GPT-3 by Paul Christiano

    Play Episode Listen Later Dec 10, 2021 4:48


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Hiring engineers and researchers to help align GPT-3, published by Paul Christiano on the AI Alignment Forum. My team at OpenAI, which works on aligning GPT-3, is hiring ML engineers and researchers. Apply here for the ML engineer role and here for the ML researcher role. GPT-3 is similar enough to "prosaic" AGI that we can work on key alignment problems without relying on conjecture or speculative analogies. And because GPT-3 is already being deployed in the OpenAI API, its misalignment matters to OpenAI's bottom line — it would be much better if we had an API that was trying to help the user instead of trying to predict the next word of text from the internet. I think this puts our team in a great place to have an impact: If our research succeeds I think it will directly reduce existential risk from AI. This is not meant to be a warm-up problem, I think it's the real thing. We are working with state of the art systems that could pose an existential risk if scaled up, and our team's success actually matters to the people deploying those systems. We are working on the whole pipeline from “interesting idea” to “production-ready system,” building critical skills and getting empirical feedback on whether our ideas actually work. We have the real-world problems to motivate alignment research, the financial support to hire more people, and a research vision to execute on. We are bottlenecked by excellent researchers and engineers who are excited to work on alignment. What the team does In the past Reflection focused on fine-tuning GPT-3 using a reward function learned from human feedback. Our most recent results are here, and had the unusual virtue of simultaneously being exciting enough to ML researchers to be accepted at NeurIPS while being described by Eliezer as “directly, straight-up relevant to real alignment problems.” We're currently working on three things: [20%] Applying basic alignment approaches to the API, aiming to close the gap between theory and practice. [60%] Extending existing approaches to tasks that are too hard for humans to evaluate; in particular, we are training models that summarize more text than human trainers have time to read. Our approach is to use weaker ML systems operating over shorter contexts to help oversee stronger ones over longer contexts. This is conceptually straightforward but still poses significant engineering and ML challenges. [20%] Conceptual research on domains that no one knows how to oversee and empirical work on debates between humans (see our 2019 writeup). I think the biggest open problem is figuring out how and if human overseers can leverage “knowledge” the model acquired during training (see an example here). If successful, ideas will eventually move up this list, from the conceptual stage to ML prototypes to real deployments. We're viewing this as practice for integrating alignment into transformative AI deployed by OpenAI or another organization. What you'd do Most people on the team do a subset of these core tasks: Design+build+maintain code for experimenting with novel training strategies for large language models. This infrastructure needs to support a diversity of experimental changes that are hard to anticipate in advance, work as a solid base to build on for 6-12 months, and handle the complexity of working with large language models. Most of our code is maintained by 1-3 people and consumed by 2-4 people (all on the team). Oversee ML training. Evaluate how well models are learning, figure out why they are learning badly, and identify+prioritize+implement changes to make them learn better. Tune hyperparameters and manage computing resources. Process datasets for machine consumption; understand datasets and how they affect the model's behavior. Design and conduct experiments to answer questions about our mode...

    2018 AI Alignment Literature Review and Charity Comparison by Larks

    Play Episode Listen Later Dec 10, 2021 49:51


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2018 AI Alignment Literature Review and Charity Comparison, published by Larks on the AI Alignment Forum. Cross-posted to the EA forum. Introduction Like last year and the year before, I've attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to an securities analyst with regards to possible investments. It appears that once again no-one else has attempted to do this, to my knowledge, so I've once again undertaken the task. This year I have included several groups not covered in previous years, and read more widely in the literature. My aim is basically to judge the output of each organisation in 2018 and compare it to their budget. This should give a sense for the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2019 budgets to get a sense of urgency. Note that this document is quite long, so I encourage you to just read the sections that seem most relevant to your interests, probably the sections about the individual organisations. I do not recommend you skip to the conclusions! I'd like to apologize in advance to everyone doing useful AI Safety work whose contributions I may have overlooked or misconstrued. Methodological Considerations Track Records Judging organisations on their historical output is naturally going to favour more mature organisations. A new startup, whose value all lies in the future, will be disadvantaged. However, I think that this is correct. The newer the organisation, the more funding should come from people with close knowledge. As organisations mature, and have more easily verifiable signals of quality, their funding sources can transition to larger pools of less expert money. This is how it works for startups turning into public companies and I think the same model applies here. This judgement involves analysing a large number papers relating to Xrisk that were produced during 2018. Hopefully the year-to-year volatility of output is sufficiently low that this is a reasonable metric. I also attempted to include papers during December 2017, to take into account the fact that I'm missing the last month's worth of output from 2017, but I can't be sure I did this successfully. This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues. We focus on papers, rather than outreach or other activities. This is partly because they are much easier to measure; while there has been a large increase in interest in AI safety over the last year, it's hard to work out who to credit for this, and partly because I think progress has to come by persuading AI researchers, which I think comes through technical outreach and publishing good work, not popular/political work. Politics My impression is that policy on technical subjects (as opposed to issues that attract strong views from the general population) is generally made by the government and civil servants in consultation with, and being lobbied by, outside experts and interests. Without expert (e.g. top ML researchers at Google, CMU & Baidu) consensus, no useful policy will be enacted. Pushing directly for policy seems if anything likely to hinder expert consensus. Attempts to directly influence the government to regulate AI research seem very adversarial, and risk being pattern-matched to ignorant opposition to GM foods or nuclear power. We don't want the 'us-vs-them' situation, that has occurred with climate change, to happen here. AI researchers who are dismissive of safety law, regarding it as ...

    Another (outer) alignment failure story by Paul Christiano

    Play Episode Listen Later Dec 10, 2021 18:05


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Another (outer) alignment failure story, published by Paul Christiano on the AI Alignment Forum. Meta This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes. At the end I'm going to go through some salient ways you could vary the story. This isn't intended to be a particularly great story (and it's pretty informal). I'm still trying to think through what I expect to happen if alignment turns out to be hard, and this more like the most recent entry in a long journey of gradually-improving stories. I wrote this up a few months ago and was reminded to post it by Critch's recent post (which is similar in many ways). This story has definitely been shaped by a broader community of people gradually refining failure stories rather than being written in a vacuum. I'd like to continue spending time poking at aspects of this story that don't make sense, digging into parts that seem worth digging into, and eventually developing clearer and more plausible stories. I still think it's very plausible that my views about alignment will change in the course of thinking concretely about stories, and even if my basic views about alignment stay the same it's pretty likely that the story will change. Story ML starts running factories, warehouses, shipping, and construction. ML assistants help write code and integrate ML into new domains. ML designers help build factories and the robots that go in them. ML finance systems invest in companies on the basis of complicated forecasts and (ML-generated) audits. Tons of new factories, warehouses, power plants, trucks and roads are being built. Things are happening quickly, investors have super strong FOMO, no one really knows whether it's a bubble but they can tell that e.g. huge solar farms are getting built and something is happening that they want a piece of. Defense contractors are using ML systems to design new drones, and ML is helping the DoD decide what to buy and how to deploy it. The expectation is that automated systems will manage drones during high-speed ML-on-ML conflicts because humans won't be able to understand what's going on. ML systems are designing new ML systems, testing variations, commissioning giant clusters. The financing is coming from automated systems, the clusters are built by robots. A new generation of fabs is being built with unprecedented speed using new automation. At this point everything kind of makes sense to humans. It feels like we are living at the most exciting time in history. People are making tons of money. The US defense establishment is scared because it has no idea what a war is going to look like right now, but in terms of policy their top priority is making sure the boom proceeds as quickly in the US as it does in China because it now seems plausible that being even a few years behind would result in national irrelevance. Things are moving very quickly and getting increasingly hard for humans to evaluate. We can no longer train systems to make factory designs that look good to humans, because we don't actually understand exactly what robots are doing in those factories or why; we can't evaluate the tradeoffs between quality and robustness and cost that are being made; we can't really understand the constraints on a proposed robot design or why one design is better than another. We can't evaluate arguments about investments very well because they come down to claims about where the overall economy is going over the next 6 months that seem kind of alien (even the more recognizable claims are just kind of incomprehensible predictions about e.g. how t...

    Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More by Ben Pace

    Play Episode Listen Later Dec 10, 2021 26:10


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More , published by Ben Pace on the AI Alignment Forum. An actual debate about instrumental convergence, in a public space! Major respect to all involved, especially Yoshua Bengio for great facilitation. For posterity (i.e. having a good historical archive) and further discussion, I've reproduced the conversation here. I'm happy to make edits at the request of anyone in the discussion who is quoted below. I've improved formatting for clarity and fixed some typos. For people who are not researchers in this area who wish to comment, see the public version of this post here. For people who do work on the relevant areas, please sign up in the top right. It will take a day or so to confirm membership. Original Post Yann LeCun: "don't fear the Terminator", a short opinion piece by Tony Zador and me that was just published in Scientific American. "We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. [...] But intelligence per se does not generate the drive for domination, any more than horns do." Comment Thread #1 Elliot Olds: Yann, the smart people who are very worried about AI seeking power and ensuring its own survival believe it's a big risk because power and survival are instrumental goals for almost any ultimate goal. If you give a generally intelligent AI the goal to make as much money in the stock market as possible, it will resist being shut down because that would interfere with tis goal. It would try to become more powerful because then it could make money more effectively. This is the natural consequence of giving a smart agent a goal, unless we do something special to counteract this. You've often written about how we shouldn't be so worried about AI, but I've never seen you address this point directly. Stuart Russell: It is trivial to construct a toy MDP in which the agent's only reward comes from fetching the coffee. If, in that MDP, there is another "human" who has some probability, however small, of switching the agent off, and if the agent has available a button that switches off that human, the agent will necessarily press that button as part of the optimal solution for fetching the coffee. No hatred, no desire for power, no built-in emotions, no built-in survival instinct, nothing except the desire to fetch the coffee successfully. This point cannot be addressed because it's a simple mathematical observation. Comment Thread #2 Yoshua Bengio: Yann, I'd be curious about your response to Stuart Russell's point. Yann LeCun: You mean, the so-called "instrumental convergence" argument by which "a robot can't fetch you coffee if it's dead. Hence it will develop self-preservation as an instrumental sub-goal." It might even kill you if you get in the way. 1. Once the robot has brought you coffee, its self-preservation instinct disappears. You can turn it off. 2. One would have to be unbelievably stupid to build open-ended objectives in a super-intelligent (and super-powerful) machine without some safeguard terms in the objective. 3. One would have to be rather incompetent not to have a mechanism by which new terms in the objective could be added to prevent previously-unforeseen bad behavior. For humans, we have education and laws to shape our objective functions and complement the hardwired terms built into us by evolution. 4. The power of even the most super-intelligent machine is limited by physics, and its size and needs make it vulnerable to physical attacks. No need for much intelligence here. A virus is infinitely less intelligent than you, but it can still kill you. 5. A second machine, designed solely to neutralize an evil super-intelligent machine will win every time, if given similar...

    Some AI research areas and their relevance to existential safety by Andrew Critch

    Play Episode Listen Later Dec 10, 2021 86:09


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some AI research areas and their relevance to existential safety, published by Andrew Critch on the AI Alignment Forum. Followed by: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs), which provides examples of multi-stakeholder/multi-agent interactions leading to extinction events. Introduction This post is an overview of a variety of AI research areas in terms of how much I think contributing to and/or learning from those areas might help reduce AI x-risk. By research areas I mean “AI research topics that already have groups of people working on them and writing up their results”, as opposed to research “directions” in which I'd like to see these areas “move”. I formed these views mostly pursuant to writing AI Research Considerations for Human Existential Safety (ARCHES). My hope is that my assessments in this post can be helpful to students and established AI researchers who are thinking about shifting into new research areas specifically with the goal of contributing to existential safety somehow. In these assessments, I find it important to distinguish between the following types of value: The helpfulness of the area to existential safety, which I think of as a function of what services are likely to be provided as a result of research contributions to the area, and whether those services will be helpful to existential safety, versus The educational value of the area for thinking about existential safety, which I think of as a function of how much a researcher motivated by existential safety might become more effective through the process of familiarizing with or contributing to that area, usually by focusing on ways the area could be used in service of existential safety. The neglect of the area at various times, which is a function of how much technical progress has been made in the area relative to how much I think is needed. Importantly: The helpfulness to existential safety scores do not assume that your contributions to this area would be used only for projects with existential safety as their mission. This can negatively impact the helpfulness of contributing to areas that are more likely to be used in ways that harm existential safety. The educational value scores are not about the value of an existential-safety-motivated researcher teaching about the topic, but rather, learning about the topic. The neglect scores are not measuring whether there is enough “buzz” around the topic, but rather, whether there has been adequate technical progress in it. Buzz can predict future technical progress, though, by causing people to work on it. Below is a table of all the areas I considered for this post, along with their entirely subjective “scores” I've given them. The rest of this post can be viewed simply as an elaboration/explanation of this table: Existing Research Area Social Application Helpfulness to Existential Safety Educational Value 2015 Neglect 2020 Neglect 2030 Neglect Out of Distribution Robustness Zero/ Single 1/10 4/10 5/10 3/10 1/10 Agent Foundations Zero/ Single 3/10 8/10 9/10 8/10 7/10 Multi-agent RL Zero/ Multi 2/10 6/10 5/10 4/10 0/10 Preference Learning Single/ Single 1/10 4/10 5/10 1/10 0/10 Side-effect Minimization Single/ Single 4/10 4/10 6/10 5/10 4/10 Human-Robot Interaction Single/ Single 6/10 7/10 5/10 4/10 3/10 Interpretability in ML Single/ Single 8/10 6/10 8/10 6/10 2/10 Fairness in ML Multi/ Single 6/10 5/10 7/10 3/10 2/10 Computational Social Choice Multi/ Single 7/10 7/10 7/10 5/10 4/10 Accountability in ML Multi/ Multi 8/10 3/10 8/10 7/10 5/10 The research areas are ordered from least-socially-complex to most-socially-complex. This roughly (though imperfectly) correlates with addressing existential safety problems of increasing importance and neglect, according to me. Correspondingly, the second colu...

    Announcing the Alignment Research Center by Paul Christiano

    Play Episode Listen Later Dec 10, 2021 1:21


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing the Alignment Research Center, published by Paul Christiano on the AI Alignment Forum. (Cross-post from ai-alignment.com) I'm now working full-time on the Alignment Research Center (ARC), a new non-profit focused on intent alignment research. I left OpenAI at the end of January and I've spent the last few months planning, doing some theoretical research, doing some logistical set-up, and taking time off. For now it's just me, focusing on theoretical research. I'm currently feeling pretty optimistic about this work: I think there's a good chance that it will yield big alignment improvements within the next few years, and a good chance that those improvements will be integrated into practice at leading ML labs. My current goal is to build a small team working productively on theory. I'm not yet sure how we'll approach hiring, but if you're potentially interested in joining you can fill out this tiny form to get notified when we're ready. Over the medium term (and maybe starting quite soon) I also expect to implement and study techniques that emerge from theoretical work, to help ML labs adopt alignment techniques, and to work on alignment forecasting and strategy. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    The Rocket Alignment Problem by Eliezer Yudkowsky

    Play Episode Listen Later Dec 10, 2021 23:38


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Rocket Alignment Problem, published by Eliezer Yudkowsky on the AI Alignment Forum. The following is a fictional dialogue building off of AI Alignment: Why It's Hard, and Where to Start. (Somewhere in a not-very-near neighboring world, where science took a very different course.) ALFONSO: Hello, Beth. I've noticed a lot of speculations lately about “spaceplanes” being used to attack cities, or possibly becoming infused with malevolent spirits that inhabit the celestial realms so that they turn on their own engineers. I'm rather skeptical of these speculations. Indeed, I'm a bit skeptical that airplanes will be able to even rise as high as stratospheric weather balloons anytime in the next century. But I understand that your institute wants to address the potential problem of malevolent or dangerous spaceplanes, and that you think this is an important present-day cause. BETH: That's. really not how we at the Mathematics of Intentional Rocketry Institute would phrase things. The problem of malevolent celestial spirits is what all the news articles are focusing on, but we think the real problem is something entirely different. We're worried that there's a difficult, theoretically challenging problem which modern-day rocket punditry is mostly overlooking. We're worried that if you aim a rocket at where the Moon is in the sky, and press the launch button, the rocket may not actually end up at the Moon. ALFONSO: I understand that it's very important to design fins that can stabilize a spaceplane's flight in heavy winds. That's important spaceplane safety research and someone needs to do it. But if you were working on that sort of safety research, I'd expect you to be collaborating tightly with modern airplane engineers to test out your fin designs, to demonstrate that they are actually useful. BETH: Aerodynamic designs are important features of any safe rocket, and we're quite glad that rocket scientists are working on these problems and taking safety seriously. That's not the sort of problem that we at MIRI focus on, though. ALFONSO: What's the concern, then? Do you fear that spaceplanes may be developed by ill-intentioned people? BETH: That's not the failure mode we're worried about right now. We're more worried that right now, nobody can tell you how to point your rocket's nose such that it goes to the moon, nor indeed any prespecified celestial destination. Whether Google or the US Government or North Korea is the one to launch the rocket won't make a pragmatic difference to the probability of a successful Moon landing from our perspective, because right now nobody knows how to aim any kind of rocket anywhere. ALFONSO: I'm not sure I understand. BETH: We're worried that even if you aim a rocket at the Moon, such that the nose of the rocket is clearly lined up with the Moon in the sky, the rocket won't go to the Moon. We're not sure what a realistic path from the Earth to the moon looks like, but we suspect it might not be a very straight path, and it may not involve pointing the nose of the rocket at the moon at all. We think the most important thing to do next is to advance our understanding of rocket trajectories until we have a better, deeper understanding of what we've started calling the “rocket alignment problem”. There are other safety problems, but this rocket alignment problem will probably take the most total time to work on, so it's the most urgent. ALFONSO: Hmm, that sounds like a bold claim to me. Do you have a reason to think that there are invisible barriers between here and the moon that the spaceplane might hit? Are you saying that it might get very very windy between here and the moon, more so than on Earth? Both eventualities could be worth preparing for, I suppose, but neither seem likely. BETH: We don't think it's particularly likely that there...

    The case for aligning narrowly superhuman models by Ajeya Cotra

    Play Episode Listen Later Dec 10, 2021 50:12


    I wrote this post to get people's takes on a type of work that seems exciting to me personally; I'm not speaking for Open Phil as a whole. Institutionally, we are very uncertain whether to prioritize this (and if we do where it should be housed and how our giving should be structured). We are not seeking grant applications on this topic right now. Thanks to Daniel Dewey, Eliezer Yudkowsky, Evan Hubinger, Holden Karnofsky, Jared Kaplan, Mike Levine, Nick Beckstead, Owen Cotton-Barratt, Paul Christiano, Rob Bensinger, and Rohin Shah for comments on earlier drafts. A genre of technical AI risk reduction work that seems exciting to me is trying to align existing models that already are, or have the potential to be, “superhuman”[1] at some particular task (which I'll call narrowly superhuman models).[2] I don't just mean “train these models to be more robust, reliable, interpretable, etc” (though that seems good too); I mean “figure out how to harness their full abilities so they can be as useful as possible to humans” (focusing on “fuzzy” domains where it's intuitively non-obvious how to make that happen). Here's an example of what I'm thinking of: intuitively speaking, it feels like GPT-3 is “smart enough to” (say) give advice about what to do if I'm sick that's better than advice I'd get from asking humans on Reddit or Facebook, because it's digested a vast store of knowledge about illness symptoms and remedies. Moreover, certain ways of prompting it provide suggestive evidence that it could use this knowledge to give helpful advice. With respect to the Reddit or Facebook users I might otherwise ask, it seems like GPT-3 has the potential to be narrowly superhuman in the domain of health advice. But GPT-3 doesn't seem to “want” to give me the best possible health advice -- instead it “wants” to play a strange improv game riffing off the prompt I give it, pretending it's a random internet user. So if I want to use GPT-3 to get advice about my health, there is a gap between what it's capable of (which could even exceed humans) and what I can get it to actually provide me. I'm interested in the challenge of: How can we get GPT-3 to give “the best health advice it can give” when humans[3] in some sense “understand less” about what to do when you're sick than GPT-3 does? And in that regime, how can we even tell whether it's actually “doing the best it can”? I think there are other similar challenges we could define for existing models, especially large language models. I'm excited about tackling this particular type of near-term challenge because it feels like a microcosm of the long-term AI alignment problem in a real, non-superficial sense. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains.[4] So it seems like a promising form of practice to figure out how to get particular humans to oversee models that are more capable than them in specific ways, if this is done with an eye to developing scalable and domain-general techniques. I'll call this type of project aligning narrowly superhuman models. In the rest of this post, I: Give a more detailed description of what aligning narrowly superhuman models could look like, what does and doesn't “count”, and what future projects I think could be done in this space (more). Explain why I think aligning narrowly superhuman models could meaningfully reduce long-term existential risk from misaligned AI (more). Lay out the potential advantages that I think this work has over other types of AI alignment research: (a) conceptual thinking, (b) demos in small-scale artificial settings, and (c) mainstream ML safety such as interpretability and robustness (more). Answer some objections and questions about this research direction, e.g. concerns that it's not very neglected, feels suspiciously similar to commercialization, might cause harm by exacerbating AI race dynamics, or is dominated by another t...

    Realism about rationality by Richard Ngo

    Play Episode Listen Later Dec 10, 2021 7:29


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Realism about rationality , published by Richard Ngo on the AI Alignment Forum. This is a linkpost for http://thinkingcomplete.blogspot.com/2018/09/rational-and-real.html Epistemic status: trying to vaguely gesture at vague intuitions. A similar idea was explored here under the heading "the intelligibility of intelligence", although I hadn't seen it before writing this post. As of 2020, I consider this follow-up comment to be a better summary of the thing I was trying to convey with this post than the post itself. There's a mindset which is common in the rationalist community, which I call “realism about rationality” (the name being intended as a parallel to moral realism). I feel like my skepticism about agent foundations research is closely tied to my skepticism about this mindset, and so in this essay I try to articulate what it is. Humans ascribe properties to entities in the world in order to describe and predict them. Here are three such properties: "momentum", "evolutionary fitness", and "intelligence". These are all pretty useful properties for high-level reasoning in the fields of physics, biology and AI, respectively. There's a key difference between the first two, though. Momentum is very amenable to formalisation: we can describe it using precise equations, and even prove things about it. Evolutionary fitness is the opposite: although nothing in biology makes sense without it, no biologist can take an organism and write down a simple equation to define its fitness in terms of more basic traits. This isn't just because biologists haven't figured out that equation yet. Rather, we have excellent reasons to think that fitness is an incredibly complicated "function" which basically requires you to describe that organism's entire phenotype, genotype and environment. In a nutshell, then, realism about rationality is a mindset in which reasoning and intelligence are more like momentum than like fitness. It's a mindset which makes the following ideas seem natural: The idea that there is a simple yet powerful theoretical framework which describes human intelligence and/or intelligence in general. (I don't count brute force approaches like AIXI for the same reason I don't consider physics a simple yet powerful description of biology). The idea that there is an “ideal” decision theory. The idea that AGI will very likely be an “agent”. The idea that Turing machines and Kolmogorov complexity are foundational for epistemology. The idea that, given certain evidence for a proposition, there's an "objective" level of subjective credence which you should assign to it, even under computational constraints. The idea that Aumann's agreement theorem is relevant to humans. The idea that morality is quite like mathematics, in that there are certain types of moral reasoning that are just correct. The idea that defining coherent extrapolated volition in terms of an idealised process of reflection roughly makes sense, and that it converges in a way which doesn't depend very much on morally arbitrary factors. The idea that having having contradictory preferences or beliefs is really bad, even when there's no clear way that they'll lead to bad consequences (and you're very good at avoiding dutch books and money pumps and so on). To be clear, I am neither claiming that realism about rationality makes people dogmatic about such ideas, nor claiming that they're all false. In fact, from a historical point of view I'm quite optimistic about using maths to describe things in general. But starting from that historical baseline, I'm inclined to adjust downwards on questions related to formalising intelligent thought, whereas rationality realism would endorse adjusting upwards. This essay is primarily intended to explain my position, not justify it, but one important consideration for me is th...

    Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain by Daniel Kokotajlo

    Play Episode Listen Later Dec 10, 2021 20:29


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Birds, Brains, Planes, and AI: Against Appeals to the Complexity/Mysteriousness/Efficiency of the Brain, published by Daniel Kokotajlo on the AI Alignment Forum. [Epistemic status: Strong opinions lightly held, this time with a cool graph.] I argue that an entire class of common arguments against short timelines is bogus, and provide weak evidence that anchoring to the human-brain-human-lifetime milestone is reasonable. In a sentence, my argument is that the complexity and mysteriousness and efficiency of the human brain (compared to artificial neural nets) is almost zero evidence that building TAI will be difficult, because evolution typically makes things complex and mysterious and efficient, even when there are simple, easily understood, inefficient designs that work almost as well (or even better!) for human purposes. In slogan form: If all we had to do to get TAI was make a simple neural net 10x the size of my brain, my brain would still look the way it does. The case of birds & planes illustrates this point nicely. Moreover, it is also a precedent for several other short-timelines talking points, such as the human-brain-human-lifetime (HBHL) anchor. Plan: Illustrative Analogy Exciting Graph Analysis Extra brute force can make the problem a lot easier Evolution produces complex mysterious efficient designs by default, even when simple inefficient designs work just fine for human purposes. What's bogus and what's not Example: Data-efficiency Conclusion Appendix 1909 French military plane, the Antionette VII. By Deep silence (Mikaël Restoux) - Own work (Bourget museum, in France), CC BY 2.5, Illustrative Analogy AI timelines, from our current perspective Flying machine timelines, from the perspective of the late 1800's: Shorty: Human brains are giant neural nets. This is reason to think we can make human-level AGI (or at least AI with strategically relevant skills, like politics and science) by making giant neural nets. Shorty: Birds are winged creatures that paddle through the air. This is reason to think we can make winged machines that paddle through the air. Longs: Whoa whoa, there are loads of important differences between brains and artificial neural nets: [what follows is a direct quote from the objection a friend raised when reading an early draft of this post!] - During training, deep neural nets use some variant of backpropagation. My understanding is that the brain does something else, closer to Hebbian learning. (Though I vaguely remember at least one paper claiming that maybe the brain does something that's similar to backprop after all.) - It's at least possible that the wiring diagram of neurons plus weights is too coarse-grained to accurately model the brain's computation, but it's all there is in deep neural nets. If we need to pay attention to glial cells, intracellular processes, different neurotransmitters etc., it's not clear how to integrate this into the deep learning paradigm. - My impression is that several biological observations on the brain don't have a plausible analog in deep neural nets: growing new neurons (though unclear how important it is for an adult brain), "repurposing" in response to brain damage, . Longs: Whoa whoa, there are loads of important differences between birds and flying machines: - Birds paddle the air by flapping, whereas current machine designs use propellers and fixed wings. - It's at least possible that the anatomical diagram of bones, muscles, and wing surfaces is too coarse-grained to accurately model how a bird flies, but that's all there is to current machine designs (replacing bones with struts and muscles with motors, that is). If we need to pay attention to the percolation of air through and between feathers, micro-eddies in the air sensed by the bird and instinctively responded to, etc. it's not clear ...

    Goodhart Taxonomy by Scott Garrabrant

    Play Episode Listen Later Dec 10, 2021 15:19


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Goodhart Taxonomy, published by Scott Garrabrant on the AI Alignment Forum. Goodhart's Law states that "any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes." However, this is not a single phenomenon. I propose that there are (at least) four different mechanisms through which proxy measures break when you optimize for them. The four types are Regressional, Causal, Extremal, and Adversarial. In this post, I will go into detail about these four different Goodhart effects using mathematical abstractions as well as examples involving humans and/or AI. I will also talk about how you can mitigate each effect. Throughout the post, I will use V to refer to the true goal and use U to refer to a proxy for that goal which was observed to correlate with V and which is being optimized in some way. Quick Reference Regressional Goodhart - When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal. Model: When U is equal to V X , where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U Example: height is correlated with basketball ability, and does actually directly help, but the best player is only 6'3", and a random 7' person in their 20s would probably not be as good Causal Goodhart - When there is a non-causal correlation between the proxy and the goal, intervening on the proxy may fail to intervene on the goal. Model: If V causes U (or if V and U are both caused by some third thing), then a correlation between V and U may be observed. However, when you intervene to increase U through some mechanism that does not involve V , you will fail to also increase V Example: someone who wishes to be taller might observe that height is correlated with basketball skill and decide to start practicing basketball. Extremal Goodhart - Worlds in which the proxy takes an extreme value may be very different from the ordinary worlds in which the correlation between the proxy and the goal was observed. Model: Patterns tend to break at simple joints. One simple subset of worlds is those worlds in which U is very large. Thus, a strong correlation between U and V observed for naturally occuring U values may not transfer to worlds in which U is very large. Further, since there may be relatively few naturally occuring worlds in which U is very large, extremely large U may coincide with small V values without breaking the statistical correlation. Example: the tallest person on record, Robert Wadlow, was 8'11" (2.72m). He grew to that height because of a pituitary disorder, he would have struggled to play basketball because he "required leg braces to walk and had little feeling in his legs and feet." Adversarial Goodhart - When you optimize for a proxy, you provide an incentive for adversaries to correlate their goal with your proxy, thus destroying the correlation with your goal. Model: Consider an agent A with some different goal W . Since they depend on common resources, W and V are naturally opposed. If you optimize U as a proxy for V , and A knows this, A is incentivized to make large U values coincide with large W values, thus stopping them from coinciding with large V values. Example: aspiring NBA players might just lie about their height. Regressional Goodhart When selecting for a proxy measure, you select not only for the true goal, but also for the difference between the proxy and the goal. Abstract Model When U is equal to V X , where X is some noise, a point with a large U value will likely have a large V value, but also a large X value. Thus, when U is large, you can expect V to be predictably smaller than U The above description is whe...

    The ground of optimization by Alex Flint

    Play Episode Listen Later Dec 10, 2021 43:00


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The ground of optimization, published by Alex Flint on the AI Alignment Forum. This work was supported by OAK, a monastic community in the Berkeley hills. This document could not have been written without the daily love of living in this beautiful community. The work involved in writing this cannot be separated from the sitting, chanting, cooking, cleaning, crying, correcting, fundraising, listening, laughing, and teaching of the whole community. What is optimization? What is the relationship between a computational optimization process — say, a computer program solving an optimization problem — and a physical optimization process — say, a team of humans building a house? We propose the concept of an optimizing system as a physically closed system containing both that which is being optimized and that which is doing the optimizing, and defined by a tendency to evolve from a broad basin of attraction towards a small set of target configurations despite perturbations to the system. We compare our definition to that proposed by Yudkowsky, and place our work in the context of work by Demski and Garrabrant's Embedded Agency, and Drexler's Comprehensive AI Services. We show that our definition resolves difficult cases proposed by Daniel Filan. We work through numerous examples of biological, computational, and simple physical systems showing how our definition relates to each. Introduction In the field of computer science, an optimization algorithm is a computer program that outputs the solution, or an approximation thereof, to an optimization problem. An optimization problem consists of an objective function to be maximized or minimized, and a feasible region within which to search for a solution. For example we might take the objective function x 2 − 2 2 as a minimization problem and the whole real number line as the feasible region. The solution then would be x √ 2 and a working optimization algorithm for this problem is one that outputs a close approximation to this value. In the field of operations research and engineering more broadly, optimization involves improving some process or physical artifact so that it is fit for a certain purpose or fulfills some set of requirements. For example, we might choose to measure a nail factory by the rate at which it outputs nails, relative to the cost of production inputs. We can view this as a kind of objective function, with the factory as the object of optimization just as the variable x was the object of optimization in the previous example. There is clearly a connection between optimizing the factory and optimizing for x, but what exactly is this connection? What is it that identifies an algorithm as an optimization algorithm? What is it that identifies a process as an optimization process? The answer proposed in this essay is: an optimizing system is a physical process in which the configuration of some part of the universe moves predictably towards a small set of target configurations from any point in a broad basin of optimization, despite perturbations during the optimization process. We do not imagine that there is some engine or agent or mind performing optimization, separately from that which is being optimized. We consider the whole system jointly — engine and object of optimization — and ask whether it exhibits a tendency to evolve towards a predictable target configuration. If so, then we call it an optimizing system. If the basin of attraction is deep and wide then we say that this is a robust optimizing system. An optimizing system as defined in this essay is known in dynamical systems theory as a dynamical system with one or more attractors. In this essay we show how this framework can help to understand optimization as manifested in physically closed systems containing both engine and object of optimization. I...

    An overview of 11 proposals for building safe advanced AI by Evan Hubinger

    Play Episode Listen Later Dec 10, 2021 70:34


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An overview of 11 proposals for building safe advanced AI , published by Evan Hubinger on the AI Alignment Forum. This is the blog post version of the paper by the same name. Special thanks to Kate Woolverton, Paul Christiano, Rohin Shah, Alex Turner, William Saunders, Beth Barnes, Abram Demski, Scott Garrabrant, Sam Eisenstat, and Tsvi Benson-Tilsen for providing helpful comments and feedback on this post and the talk that preceded it. This post is a collection of 11 different proposals for building safe advanced AI under the current machine learning paradigm. There's a lot of literature out there laying out various different approaches such as amplification, debate, or recursive reward modeling, but a lot of that literature focuses primarily on outer alignment at the expense of inner alignment and doesn't provide direct comparisons between approaches. The goal of this post is to help solve that problem by providing a single collection of 11 different proposals for building safe advanced AI—each including both inner and outer alignment components. That being said, not only does this post not cover all existing proposals, I strongly expect that there will be lots of additional new proposals to come in the future. Nevertheless, I think it is quite useful to at least take a broad look at what we have now and compare and contrast some of the current leading candidates. It is important for me to note before I begin that the way I describe the 11 approaches presented here is not meant to be an accurate representation of how anyone else would represent them. Rather, you should treat all the approaches I describe here as my version of that approach rather than any sort of canonical version that their various creators/proponents would endorse. Furthermore, this post only includes approaches that intend to directly build advanced AI systems via machine learning. Thus, this post doesn't include other possible approaches for solving the broader AI existential risk problem such as: finding a fundamentally different way of approaching AI than the current machine learning paradigm that makes it easier to build safe advanced AI, developing some advanced technology that produces a decisive strategic advantage without using advanced AI, or achieving global coordination around not building advanced AI via (for example) a persuasive demonstration that any advanced AI is likely to be unsafe. For each of the proposals that I consider, I will try to evaluate them on the following four basic components that I think any story for how to build safe advanced AI under the current machine learning paradigm needs. Outer alignment. Outer alignment is about asking why the objective we're training for is aligned—that is, if we actually got a model that was trying to optimize for the given loss/reward/etc., would we like that model? For a more thorough description of what I mean by outer alignment, see “Outer alignment and imitative amplification.” Inner alignment. Inner alignment is about asking the question of how our training procedure can actually guarantee that the model it produces will, in fact, be trying to accomplish the objective we trained it on. For a more rigorous treatment of this question and an explanation of why it might be a concern, see “Risks from Learned Optimization.” Training competitiveness. Competitiveness is a bit of a murky concept, so I want to break it up into two pieces here. Training competitiveness is the question of whether the given training procedure is one that a team or group of teams with a reasonable lead would be able to afford to implement without completely throwing away that lead. Thus, training competitiveness is about whether the proposed process of producing advanced AI is competitive. Performance competitiveness. Performance competitiveness, on the othe...

    Chris Olah's views on AGI safety by Evan Hubinger

    Play Episode Listen Later Dec 10, 2021 18:57


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Chris Olah's views on AGI safety, published by Evan Hubinger on the AI Alignment Forum. Note: I am not Chris Olah. This post was the result of lots of back-and-forth with Chris, but everything here is my interpretation of what Chris believes, not necessarily what he actually believes. Chris also wanted me to emphasize that his thinking is informed by all of his colleagues on the OpenAI Clarity team and at other organizations. In thinking about AGI safety—and really any complex topic on which many smart people disagree—I've often found it very useful to build a collection of different viewpoints from people that I respect that I feel like I understand well enough to be able to think from their perspective. For example, I will often try to compare what an idea feels like when I put on my Paul Christiano hat to what it feels like when I put on my Scott Garrabrant hat. Recently, I feel like I've gained a new hat that I've found extremely valuable that I also don't think many other people in this community have, which is my Chris Olah hat. The goal of this post is to try to give that hat to more people. If you're not familiar with him, Chris Olah leads the Clarity team at OpenAI and formerly used to work at Google Brain. Chris has been a part of many of the most exciting ML interpretability results in the last five years, including Activation Atlases, Building Blocks of Interpretability, Feature Visualization, and DeepDream. Chris was also a coauthor of “Concrete Problems in AI Safety.” He also thinks a lot about technical AGI safety and has a lot of thoughts on how ML interpretability work can play into that—thoughts which, unfortunately, haven't really been recorded previously. So: here's my take on Chris's AGI safety worldview. The benefits of transparency and interpretability Since Chris primarily works on ML transparency and interpretability, the obvious first question to ask is how he imagines that sort of research aiding with AGI safety. When I was talking with him, Chris listed four distinct ways in which he thought transparency and interpretability could help, which I'll go over in his order of importance. Catching problems with auditing First, Chris says, interpretability gives you a mulligan. Before you deploy your AI, you can throw all of your interpretability tools at it to check and see what it actually learned and make sure it learned the right thing. If it didn't—if you find that it's learned some sort of potentially dangerous proxy, for example—then you can throw your AI out and try again. As long as you're in a domain where your AI isn't actively trying to deceive your interpretability tools (via deceptive alignment, perhaps), this sort of a mulligan could help quite a lot in resolving more standard robustness problems (proxy alignment, for example). That being said, that doesn't necessarily mean waiting until you're on the verge of deployment to look for flaws. Ideally you'd be able to discover problems early on via an ongoing auditing process as you build more and more capable systems. One of the OpenAI Clarity team's major research thrusts right now is developing the ability to more rigorously and systematically audit neural networks. The idea is that interpretability techniques shouldn't have to “get lucky” to stumble across a problem, but should instead reliably catch any problematic behavior. In particular, one way in which they've been evaluating progress on this is the “auditing game.” In the auditing game, one researcher takes a neural network and makes some modification to it—maybe images containing both dogs and cats are now classified as rifles, for example—and another researcher, given only the modified network, has to diagnose the problem and figure out exactly what modification was made to the network using only interpretability tools with...

    Draft report on AI timelines by Ajeya Cotra

    Play Episode Listen Later Dec 10, 2021 1:24


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Draft report on AI timelines, published by Ajeya Cotra on the AI Alignment Forum. Hi all, I've been working on some AI forecasting research and have prepared a draft report on timelines to transformative AI. I would love feedback from this community, so I've made the report viewable in a Google Drive folder here. With that said, most of my focus so far has been on the high-level structure of the framework, so the particular quantitative estimates are very much in flux and many input parameters aren't pinned down well -- I wrote the bulk of this report before July and have received feedback since then that I haven't fully incorporated yet. I'd prefer if people didn't share it widely in a low-bandwidth way (e.g., just posting key graphics on Facebook or Twitter) since the conclusions don't reflect Open Phil's "institutional view" yet, and there may well be some errors in the report. The report includes a quantitative model written in Python. Ought has worked with me to integrate their forecasting platform Elicit into the model so that you can see other people's forecasts for various parameters. If you have questions or feedback about the Elicit integration, feel free to reach out to elicit@ought.org. Looking forward to hearing people's thoughts! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    An Untrollable Mathematician Illustrated by Abram Demski

    Play Episode Listen Later Dec 10, 2021 0:50


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Untrollable Mathematician Illustrated, published by Abram Demski on the AI Alignment Forum. The following was a presentation I made for Sören Elverlin's AI Safety Reading Group. I decided to draw everything by hand because powerpoint is boring. Thanks to Ben Pace for formatting it for LW! See also the IAF post detailing the research which this presentation is based on. Pingbacks 40 2018 AI Alignment Literature Review and Charity Comparison 55 Radical Probabilism 32 Embedded Agency (full-text version) 33 Thinking About Filtered Evidence Is (Very!) Hard

    Radical Probabilism by Abram Demski

    Play Episode Listen Later Dec 10, 2021 68:18


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Radical Probabilism , published by Abram Demski on the AI Alignment Forum. This is an expanded version of my talk. I assume a high degree of familiarity with Bayesian probability theory. Toward a New Technical Explanation of Technical Explanation -- an attempt to convey the practical implications of logical induction -- was one of my most-appreciated posts, but I don't really get the feeling that very many people have received the update. Granted, that post was speculative, sketching what a new technical explanation of technical explanation might look like. I think I can do a bit better now. If the implied project of that post had really been completed, I would expect new practical probabilistic reasoning tools, explicitly violating Bayes' law. For example, we might expect: A new version of information theory. An update to the "prediction=compression" maxim, either repairing it to incorporate the new cases, or explicitly denying it and providing a good intuitive account of why it was wrong. A new account of concepts such as mutual information, allowing for the fact that variables have behavior over thinking time; for example, variables may initially be very correlated, but lose correlation as our picture of each variable becomes more detailed. New ways of thinking about epistemology. One thing that my post did manage to do was to spell out the importance of "making advanced predictions", a facet of epistemology which Bayesian thinking does not do justice to. However, I left aspects of the problem of old evidence open, rather than giving a complete way to think about it. New probabilistic structures. Bayesian Networks are one really nice way to capture the structure of probability distributions, making them much easier to reason about. Is there anything similar for the new, wider space of probabilistic reasoning which has been opened up? Unfortunately, I still don't have any of those things to offer. The aim of this post is more humble. I think what I originally wrote was too ambitious for didactic purposes. Where the previous post aimed to communicate the insights of logical induction by sketching broad implications, I here aim to communicate the insights in themselves, focusing on the detailed differences between classical Bayesian reasoning and the new space of ways to reason. Rather than talking about logical induction directly, I'm mainly going to explain things in terms of a very similar philosophy which Richard Jeffrey invented -- apparently starting with his phd dissertation in the 50s, although I'm unable to get my hands on it or other early references to see how fleshed-out the view was at that point. He called this philosophy radical probabilism. Unlike logical induction, radical probabilism appears not to have any roots in worries about logical uncertainty or bounded rationality. Instead it appears to be motivated simply by a desire to generalize, and a refusal to accept unjustified assumptions. Nonetheless, it carries most of the same insights. Radical Probabilism has not been very concerned with computational issues, and so constructing an actual algorithm (like the logical induction algorithm) has not been a focus. (However, there have been some developments -- see historical notes at the end.) This could be seen as a weakness. However, for the purpose of communicating the core insights, I think this is a strength -- there are fewer technical details to communicate. A terminological note: I will use "radical probabilism" to refer to the new theory of rationality (treating logical induction as merely a specific way to flesh out Jeffrey's theory). I'm more conflicted about how to refer to the older theory. I'm tempted to just use the term "Bayesian", implying that the new theory is non-Bayesian -- this highlights its rejection of Bayesian updates. However...

    What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) by Andrew Critch

    Play Episode Listen Later Dec 10, 2021 38:45


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs) , published by Andrew Critch on the AI Alignment Forum. With: Thomas Krendl Gilbert, who provided comments, interdisciplinary feedback, and input on the RAAP concept. Thanks also for comments from Ramana Kumar. Target audience: researchers and institutions who think about existential risk from artificial intelligence, especially AI researchers. Preceded by: Some AI research areas and their relevance to existential safety, which emphasized the value of thinking about multi-stakeholder/multi-agent social applications, but without concrete extinction scenarios. This post tells a few different stories in which humanity dies out as a result of AI technology, but where no single source of human or automated agency is the cause. Scenarios with multiple AI-enabled superpowers are often called “multipolar” scenarios in AI futurology jargon, as opposed to “unipolar” scenarios with just one superpower. Unipolar take-offs Multipolar take-offs Slow take-offs Part 1 of this post Fast take-offs Part 2 of this post Part 1 covers a batch of stories that play out slowly (“slow take-offs”), and Part 2 stories play out quickly. However, in the end I don't want you to be super focused how fast the technology is taking off. Instead, I'd like you to focus on multi-agent processes with a robust tendency to play out irrespective of which agents execute which steps in the process. I'll call such processes Robust Agent-Agnostic Processes (RAAPs). A group walking toward a restaurant is a nice example of a RAAP, because it exhibits: Robustness: If you temporarily distract one of the walkers to wander off, the rest of the group will keep heading toward the restaurant, and the distracted member will take steps to rejoin the group. Agent-agnosticism: Who's at the front or back of the group might vary considerably during the walk. People at the front will tend to take more responsibility for knowing and choosing what path to take, and people at the back will tend to just follow. Thus, the execution of roles (“leader”, “follower”) is somewhat agnostic as to which agents execute them. Interestingly, if all you want to do is get one person in the group not to go to the restaurant, sometimes it's actually easier to achieve that by convincing the entire group not to go there than by convincing just that one person. This example could be extended to lots of situations in which agents have settled on a fragile consensus for action, in which it is strategically easier to motivate a new interpretation of the prior consensus than to pressure one agent to deviate from it. I think a similar fact may be true about some agent-agnostic processes leading to AI x-risk, in that agent-specific interventions (e.g., aligning or shutting down this or that AI system or company) will not be enough to avert the process, and might even be harder than trying to shift the structure of society as a whole. Moreover, I believe this is true in both “slow take-off” and “fast take-off” AI development scenarios This is because RAAPs can arise irrespective of the speed of the underlying “host” agents. RAAPs are made more or less likely to arise based on the “structure” of a given interaction. As such, the problem of avoiding the emergence of unsafe RAAPs, or ensuring the emergence of safe ones, is a problem of mechanism design (wiki/Mechanism_design). I recently learned that in sociology, the concept of a field (martin2003field, fligsteinmcadam2012fields) is roughly defined as a social space or arena in which the motivation and behavior of agents are explained through reference to surrounding processes or “structure” rather than freedom or chance. In my parlance, mechanisms cause fields, and fields cause RAAPs. Meta / prefac...

    Utility Maximization = Description Length Minimization by johnswentworth

    Play Episode Listen Later Dec 10, 2021 15:22


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Utility Maximization = Description Length Minimization, published by johnswentworth on the AI Alignment Forum. There's a useful intuitive notion of “optimization” as pushing the world into a small set of states, starting from any of a large number of states. Visually: Yudkowsky and Flint both have notable formalizations of this “optimization as compression” idea. This post presents a formalization of optimization-as-compression grounded in information theory. Specifically: to “optimize” a system is to reduce the number of bits required to represent the system state using a particular encoding. In other words, “optimizing” a system means making it compressible (in the information-theoretic sense) by a particular model. This formalization turns out to be equivalent to expected utility maximization, and allows us to interpret any expected utility maximizer as “trying to make the world look like a particular model”. Conceptual Example: Building A House Before diving into the formalism, we'll walk through a conceptual example, taken directly from Flint's Ground of Optimization: building a house. Here's Flint's diagram: The key idea here is that there's a wide variety of initial states (piles of lumber, etc) which all end up in the same target configuration set (finished house). The “perturbation” indicates that the initial state could change to some other state - e.g. someone could move all the lumber ten feet to the left - and we'd still end up with the house. In terms of information-theoretic compression: we could imagine a model which says there is probably a house. Efficiently encoding samples from this model will mean using shorter bit-strings for world-states with a house, and longer bit-strings for world-states without a house. World-states with piles of lumber will therefore generally require more bits than world-states with a house. By turning the piles of lumber into a house, we reduce the number of bits required to represent the world-state using this particular encoding/model. If that seems kind of trivial and obvious, then you've probably understood the idea; later sections will talk about how it ties into other things. If not, then the next section is probably for you. Background Concepts From Information Theory The basic motivating idea of information theory is that we can represent information using fewer bits, on average, if we use shorter representations for states which occur more often. For instance, Morse code uses only a single bit (“.”) to represent the letter “e”, but four bits (“- - . -”) to represent “q”. This creates a strong connection between probabilistic models/distributions and optimal codes: a code which requires minimal average bits for one distribution (e.g. with lots of e's and few q's) will not be optimal for another distribution (e.g. with few e's and lots of q's). For any random variable X generated by a probabilistic model M , we can compute the minimum average number of bits required to represent X . This is Shannon's famous entropy formula − ∑ X P X M log P X M Assuming we're using an optimal encoding for model M , the number of bits used to encode a particular value x is log P X x M . (Note that this is sometimes not an integer! Today we have algorithms which encode many samples at once, potentially even from different models/distributions, to achieve asymptotically minimal bit-usage. The “rounding error” only happens once for the whole collection of samples, so as the number of samples grows, the rounding error per sample goes to zero.) Of course, we could be wrong about the distribution - we could use a code optimized for a model M 2 which is different from the “true” model M 1 . In this case, the average number of bits used will be − ∑ X P X M 1 log P X M 2 E log P X M 2 M 1 In this post, we'll use a “wrong” model M 2 intentio...

    Risks from Learned Optimization: Introduction by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant

    Play Episode Listen Later Dec 10, 2021 18:54


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Risks from Learned Optimization: Introduction , published by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant on the AI Alignment Forum. This is the first of five posts in the Risks from Learned Optimization Sequence based on the paper “Risks from Learned Optimization in Advanced Machine Learning Systems” by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. Each post in the sequence corresponds to a different section of the paper. Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, and Joar Skalse contributed equally to this sequence. With special thanks to Paul Christiano, Eric Drexler, Rob Bensinger, Jan Leike, Rohin Shah, William Saunders, Buck Shlegeris, David Dalrymple, Abram Demski, Stuart Armstrong, Linda Linsefors, Carl Shulman, Toby Ord, Kate Woolverton, and everyone else who provided feedback on earlier versions of this sequence. Motivation The goal of this sequence is to analyze the type of learned optimization that occurs when a learned model (such as a neural network) is itself an optimizer—a situation we refer to as mesa-optimization, a neologism we introduce in this sequence. We believe that the possibility of mesa-optimization raises two important questions for the safety and transparency of advanced machine learning systems. First, under what circumstances will learned models be optimizers, including when they should not be? Second, when a learned model is an optimizer, what will its objective be—how will it differ from the loss function it was trained under—and how can it be aligned? We believe that this sequence presents the most thorough analysis of these questions that has been conducted to date. In particular, we present not only an introduction to the basic concerns surrounding mesa-optimizers, but also an analysis of the particular aspects of an AI system that we believe are likely to make the problems related to mesa-optimization relatively easier or harder to solve. By providing a framework for understanding the degree to which different AI systems are likely to be robust to misaligned mesa-optimization, we hope to start a discussion about the best ways of structuring machine learning systems to solve these problems. Furthermore, in the fourth post we will provide what we think is the most detailed analysis yet of a problem we refer as deceptive alignment which we posit may present one of the largest—though not necessarily insurmountable—current obstacles to producing safe advanced machine learning systems using techniques similar to modern machine learning. Two questions In machine learning, we do not manually program each individual parameter of our models. Instead, we specify an objective function that captures what we want the system to do and a learning algorithm to optimize the system for that objective. In this post, we present a framework that distinguishes what a system is optimized to do (its “purpose”), from what it optimizes for (its “goal”), if it optimizes for anything at all. While all AI systems are optimized for something (have a purpose), whether they actually optimize for anything (pursue a goal) is non-trivial. We will say that a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system. Learning algorithms in machine learning are optimizers because they search through a space of possible parameters—e.g. neural network weights—and improve the parameters with respect to some objective. Planning algorithms are also optimizers, since they search through possible plans, picking those that do well according to some objective. Whether a syste...

    Matt Botvinick on the spontaneous emergence of learning algorithms by Adam Scholl

    Play Episode Listen Later Dec 10, 2021 7:11


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Matt Botvinick on the spontaneous emergence of learning algorithms , published by Adam Scholl on the AI Alignment Forum. Matt Botvinick is Director of Neuroscience Research at DeepMind. In this interview, he discusses results from a 2018 paper which describe conditions under which reinforcement learning algorithms will spontaneously give rise to separate full-fledged reinforcement learning algorithms that differ from the original. Here are some notes I gathered from the interview and paper: Initial Observation At some point, a group of DeepMind researchers in Botvinick's group noticed that when they trained a RNN using RL on a series of related tasks, the RNN itself instantiated a separate reinforcement learning algorithm. These researchers weren't trying to design a meta-learning algorithm—apparently, to their surprise, this just spontaneously happened. As Botvinick describes it, they started “with just one learning algorithm, and then another learning algorithm kind of... emerges, out of, like out of thin air”: "What happens... it seemed almost magical to us, when we first started realizing what was going on—the slow learning algorithm, which was just kind of adjusting the synaptic weights, those slow synaptic changes give rise to a network dynamics, and the dynamics themselves turn into a learning algorithm.” Other versions of this basic architecture—e.g., using slot-based memory instead of RNNs—seemed to produce the same basic phenomenon, which they termed "meta-RL." So they concluded that all that's needed for a system to give rise to meta-RL are three very general properties: the system must 1) have memory, 2) whose weights are trained by a RL algorithm, 3) on a sequence of similar input data. From Botvinick's description, it sounds to me like he thinks [learning algorithms that find/instantiate other learning algorithms] is a strong attractor in the space of possible learning algorithms: “...it's something that just happens. In a sense, you can't avoid this happening. If you have a system that has memory, and the function of that memory is shaped by reinforcement learning, and this system is trained on a series of interrelated tasks, this is going to happen. You can't stop it." Search for Biological Analogue This system reminded some of the neuroscientists in Botvinick's group of features observed in brains. For example, like RNNs, the human prefrontal cortex (PFC) is highly recurrent, and the RL and RNN memory systems in their meta-RL model reminded them of “synaptic memory” and “activity-based memory.” They decided to look for evidence of meta-RL occuring in brains, since finding a neural analogue of the technique would provide some evidence they were on the right track, i.e. that the technique might scale to solving highly complex tasks. They think they found one. In short, they think that part of the dopamine system (DA) is a full-fledged reinforcement learning algorithm, which trains/gives rise to another full-fledged, free-standing reinforcement learning algorithm in PFC, in basically the same way (and for the same reason) the RL-trained RNNs spawned separate learning algorithms in their experiments. As I understand it, their story goes as follows: The PFC, along with the bits of basal ganglia and thalamic nuclei it connects to, forms a RNN. Its inputs are sensory percepts, and information about past actions and rewards. Its outputs are actions, and estimates of state value. DA[1] is a RL algorithm that feeds reward prediction error to PFC. Historically, people assumed the purpose of sending this prediction error was to update PFC's synaptic weights. Wang et al. agree that this happens, but argue that the principle purpose of sending prediction error is to cause the creation of “a second RL algorithm, implemented entirely in the prefrontal network's acti...

    the scaling "inconsistency": openAI's new insight by nostalgebraist

    Play Episode Listen Later Dec 10, 2021 15:19


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: the scaling "inconsistency": openAI's new insight, published by nostalgebraist on the AI Alignment Forum. I've now read the new OpenAI scaling laws paper. Also, yesterday I attended a fun and informative lecture/discussion with one of the authors. While the topic is on my mind, I should probably jot down some of my thoughts. This post is mostly about what the new paper says about the “inconsistency” brought up in their previous paper. The new paper has a new argument on this topic, which is intuitive and appealing, and suggests that the current scaling trend will indeed “switch over” soon to a new one where dataset size, not model size, is the active constraint on performance. Most of this post is an attempt to explain and better understand this argument. The new paper is mainly about extending the scaling laws from their earlier paper to new modalities. In that paper, they found scaling laws for transformers trained autoregressively on text data. The new paper finds the same patterns in the scaling behavior of transformers trained autoregressively on images, math problems, etc. So the laws aren't telling us something about the distribution of text data, but about something more fundamental. That's cool. They also have a new, very intuitive hypothesis for what's going on with the “scaling inconsistency” they described in the previous paper – the one I made a big deal about at the time. So that's the part I'm most excited to discuss. I'm going to give a long explanation of it, way longer than the relevant part of their paper. Some of this is original to me, all errors are mine, all the usual caveats. 1. L(C) and L(D) To recap: the “inconsistency” is between two scaling laws: The law for the best you can do, given a fixed compute budget. This is L(C), sometimes called L(C_min). L is the loss (lower = better), C is your compute budget. The law for the best you can do, given a fixed dataset size. This is L(D), where D is the number of examples (say, tokens) in the dataset. Once you reach a certain level of compute, these two laws contradict each other. I'll take some time to unpack that here, as it's not immediately obvious the two can even be compared to one another – one is a function of compute, the other of data. 2. C sets E, and E bounds D Budget tradeoffs Given a compute budget C, you can derive the optimal way to spend it on different things. Roughly, you are trading off between two ways to spend compute: Use C to buy “N”: Training a bigger model – “N” here is model size Use C to buy “S”: Training for more steps “S” (gradient updates) The relationship between S (steps) and D (dataset size) is a little subtle, for several reasons. From step count to update count For one thing, each single “step” is an update on the information from more than one data point. Specifically, a step updates on “B” different points – B is the batch size. So the total number of data points processed during training is B times S. The papers sometimes call this quantity “E” (number of examples), so I'll call it that too. From update count to data count Now, when you train an ML model, you usually update on each data point more than once. Typically, you'll do one pass over the full dataset (updating on each point as you go along), then you'll go back and do a second full pass, and then a third, etc. These passes are called “epochs.” If you're doing things this way, then for every point in the data, you get (number of epochs) updates out of it. So E = (number of epochs) D. Some training routines don't visit every point the exact same number of times – there's nothing forcing you to do that. Still, for any training procedure, we can look at the quantity E / D. This would be the number of epochs, if you're doing epochs. For a generic training routine, you can can think of E / D as the “effecti...

    Introduction to Cartesian Frames by Scott Garrabrant

    Play Episode Listen Later Dec 10, 2021 28:04


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Introduction to Cartesian Frames, published by Scott Garrabrant on the AI Alignment Forum. This is the first post in a sequence on Cartesian frames, a new way of modeling agency that has recently shaped my thinking a lot. Traditional models of agency have some problems, like: They treat the "agent" and "environment" as primitives with a simple, stable input-output relation. (See "Embedded Agency.") They assume a particular way of carving up the world into variables, and don't allow for switching between different carvings or different levels of description. Cartesian frames are a way to add a first-person perspective (with choices, uncertainty, etc.) on top of a third-person "here is the set of all possible worlds," in such a way that many of these problems either disappear or become easier to address. The idea of Cartesian frames is that we take as our basic building block a binary function which combines a choice from the agent with a choice from the environment to produce a world history. We don't think of the agent as having inputs and outputs, and we don't assume that the agent is an object persisting over time. Instead, we only think about a set of possible choices of the agent, a set of possible environments, and a function that encodes what happens when we combine these two. This basic object is called a Cartesian frame. As with dualistic agents, we are given a way to separate out an “agent” from an “environment." But rather than being a basic feature of the world, this is a “frame” — a particular way of conceptually carving up the world. We will use the combinatorial properties of a given Cartesian frame to derive versions of inputs, outputs and time. One goal here is that by making these notions derived rather than basic, we can make them more amenable to approximation and thus less dependent on exactly how one draws the Cartesian boundary. Cartesian frames also make it much more natural to think about the world at multiple levels of description, and to model agents as having subagents. Mathematically, Cartesian frames are exactly Chu spaces. I give them a new name because of my specific interpretation about agency, which also highlights different mathematical questions. Using Chu spaces, we can express many different relationships between Cartesian frames. For example, given two agents, we could talk about their sum ( ⊕ ), which can choose from any of the choices available to either agent, or we could talk about their tensor ( ⊗ ), which can accomplish anything that the two agents could accomplish together as a team. Cartesian frames also have duals ( − ∗ ) which you can get by swapping the agent with the environment, and ⊕ and ⊗ have De Morgan duals ( and ⅋ respectively), which represent taking a sum or tensor of the environments. The category also has an internal hom, ⊸ , where C ⊸ D can be thought of as " D with a C -shaped hole in it." These operations are very directly analogous to those used in linear logic. 1. Definition Let W be a set of possible worlds. A Cartesian frame C over W is a triple C A E ⋅ , where A represents a set of possible ways the agent can be, E represents a set of possible ways the environment can be, and ⋅ A × E → W is an evaluation function that returns a possible world given an element of A and an element of E We will refer to A as the agent, the elements of A as possible agents, E as the environment, the elements of E as possible environments, W as the world, and elements of W as possible worlds. Definition: A Cartesian frame C over a set W is a triple A E ⋅ , where A and E are sets and ⋅ A × E → W . If C A E ⋅ is a Cartesian frame over W , we say Agent C A Env C E World C W , and Eval C ⋅ A finite Cartesian frame is easily visualized as a matrix, where the rows of the matrix represent possible agents, the columns of the matr...

    My research methodology by Paul Christiano

    Play Episode Listen Later Dec 10, 2021 23:57


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My research methodology, published by Paul Christiano on the AI Alignment Forum. (Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.) I really don't want my AI to strategically deceive me and resist my attempts to correct its behavior. Let's call an AI that does so egregiously misaligned (for the purpose of this post). Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up? But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I'm interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can't tell any plausible story about how they lead to egregious misalignment. This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it's possible, there are several ways in which it could actually be easier: We can potentially iterate much faster, since it's often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice. We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases. We can find algorithms that have a good chance of working in the future even if we don't know what AI will look like or how quickly it will advance, since we've been thinking about a very wide range of possible failure cases. I'd guess there's a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can't come up with a plausible story about how it leads to egregious misalignment. That's a high enough probability that I'm very excited to gamble on it. Moreover, if it fails I think we're likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable. What this looks like (3 examples) My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.” Example 1: human feedback In an unaligned benchmark I describe a simple AI training algorithm: Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions. We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations. At test time the AI searches for plans that lead to trajectories that look good to humans. In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment: Our generative model understands reality better than human evaluators. There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans. It's possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos. The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction. I don't know if or when this kind of reward hacking would happen — I think it's pretty likely eventually, but it's far from certain and it might take a long time. But from my perspective this failure mode is at least plausible — I don't see any contradictions between this sequence of events and anyth...

    Fun with +12 OOMs of Compute by Daniel Kokotajlo

    Play Episode Listen Later Dec 10, 2021 29:34


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fun with +12 OOMs of Compute, published by Daniel Kokotajlo on the AI Alignment Forum. Or: Big Timelines Crux Operationalized What fun things could one build with +12 orders of magnitude of compute? By ‘fun' I mean ‘powerful.' This hypothetical is highly relevant to AI timelines, for reasons I'll explain later. Summary (Spoilers): I describe a hypothetical scenario that concretizes the question “what could be built with 2020's algorithms/ideas/etc. but a trillion times more compute?” Then I give some answers to that question. Then I ask: How likely is it that some sort of TAI would happen in this scenario? This second question is a useful operationalization of the (IMO) most important, most-commonly-discussed timelines crux: “Can we get TAI just by throwing more compute at the problem?” I consider this operationalization to be the main contribution of this post; it directly plugs into Ajeya's timelines model and is quantitatively more cruxy than anything else I know of. The secondary contribution of this post is my set of answers to the first question: They serve as intuition pumps for my answer to the second, which strongly supports my views on timelines. The hypothetical In 2016 the Compute Fairy visits Earth and bestows a blessing: Computers are magically 12 orders of magnitude faster! Over the next five years, what happens? The Deep Learning AI Boom still happens, only much crazier: Instead of making AlphaStar for 10^23 floating point operations, DeepMind makes something for 10^35. Instead of making GPT-3 for 10^23 FLOPs, OpenAI makes something for 10^35. Instead of industry and academia making a cornucopia of things for 10^20 FLOPs or so, they make a cornucopia of things for 10^32 FLOPs or so. When random grad students and hackers spin up neural nets on their laptops, they have a trillion times more compute to work with. [EDIT: Also assume magic +12 OOMs of memory, bandwidth, etc. All the ingredients of compute.] For context on how big a deal +12 OOMs is, consider the graph below, from ARK. It's measuring petaflop-days, which are about 10^20 FLOP each. So 10^35 FLOP is 1e+15 on this graph. GPT-3 and AlphaStar are not on this graph, but if they were they would be in the very top-right corner. Question One: In this hypothetical, what sorts of things could AI projects build? I encourage you to stop reading, set a five-minute timer, and think about fun things that could be built in this scenario. I'd love it if you wrote up your answers in the comments! My tentative answers: Below are my answers, listed in rough order of how ‘fun' they seem to me. I'm not an AI scientist so I expect my answers to overestimate what could be done in some ways, and underestimate in other ways. Imagine that each entry is the best version of itself, since it is built by experts (who have experience with smaller-scale versions) rather than by me. OmegaStar: In our timeline, it cost about 10^23 FLOP to train AlphaStar. (OpenAI Five, which is in some ways more impressive, took less!) Let's make OmegaStar like AlphaStar only +7 OOMs bigger: the size of a human brain.[1] [EDIT: You may be surprised to learn, as I was, that AlphaStar has about 10% as many parameters as a honeybee has synapses! Playing against it is like playing against a tiny game-playing insect.] Larger models seem to take less data to reach the same level of performance, so it would probably take at most 10^30 FLOP to reach the same level of Starcraft performance as AlphaStar, and indeed we should expect it to be qualitatively better.[2] So let's do that, but also train it on lots of other games too.[3] There are 30,000 games in the Steam Library. We train OmegaStar long enough that it has as much time on each game as AlphaStar had on Starcraft. With a brain so big, maybe it'll start to do some transfer learning, acquiring g...

    Seeking Power is Often Convergently Instrumental in MDPs by Paul Christiano

    Play Episode Listen Later Dec 10, 2021 23:52


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Seeking Power is Often Convergently Instrumental in MDPs, published by Paul Christiano on the AI Alignment Forum. (Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.) I really don't want my AI to strategically deceive me and resist my attempts to correct its behavior. Let's call an AI that does so egregiously misaligned (for the purpose of this post). Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up? But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I'm interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can't tell any plausible story about how they lead to egregious misalignment. This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it's possible, there are several ways in which it could actually be easier: We can potentially iterate much faster, since it's often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice. We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases. We can find algorithms that have a good chance of working in the future even if we don't know what AI will look like or how quickly it will advance, since we've been thinking about a very wide range of possible failure cases. I'd guess there's a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can't come up with a plausible story about how it leads to egregious misalignment. That's a high enough probability that I'm very excited to gamble on it. Moreover, if it fails I think we're likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable. What this looks like (3 examples) My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.” Example 1: human feedback In an unaligned benchmark I describe a simple AI training algorithm: Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions. We ask humans to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these human evaluations. At test time the AI searches for plans that lead to trajectories that look good to humans. In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment: Our generative model understands reality better than human evaluators. There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans. It's possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a human looking at predicted videos. The fiction can look much better than the actual possible futures. So our planning process finds an action that covertly gathers resources and uses them to create a fiction. I don't know if or when this kind of reward hacking would happen — I think it's pretty likely eventually, but it's far from certain and it might take a long time. But from my perspective this failure mode is at least plausible — I don't see any contradictions between ...

    The Solomonoff Prior is Malign by Mark Xu

    Play Episode Listen Later Dec 10, 2021 28:35


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Solomonoff Prior is Malign, published by Mark Xu on the AI Alignment Forum. Write a Review This argument came to my attention from this post by Paul Christiano. I also found this clarification helpful. I found these counter-arguments stimulating and have included some discussion of them. Very little of this content is original. My contributions consist of fleshing out arguments and constructing examples. Thank you to Beth Barnes and Thomas Kwa for helpful discussion and comments. What is the Solomonoff prior? The Solomonoff prior is intended to answer the question "what is the probability of X?" for any X, where X is a finite string over some finite alphabet. The Solomonoff prior is defined by taking the set of all Turing machines (TMs) which output strings when run with no input and weighting them proportional to 2 − K , where K is the description length of the TM (informally its size in bits). The Solomonoff prior says the probability of a string is the sum over all the weights of all TMs that print that string. One reason to care about the Solomonoff prior is that we can use it to do a form of idealized induction. If you have seen 0101 and want to predict the next bit, you can use the Solomonoff prior to get the probability of 01010 and 01011. Normalizing gives you the chances of seeing 1 versus 0, conditioned on seeing 0101. In general, any process that assigns probabilities to all strings in a consistent way can be used to do induction in this way. This post provides more information about Solomonoff Induction. Why is it malign? Imagine that you wrote a programming language called python^10 that works as follows: First, it takes all alpha-numeric chars that are not in literals and checks if they're repeated 10 times sequentially. If they're not, they get deleted. If they are, they get replaced by a single copy. Second, it runs this new program through a python interpreter. Hello world in python^10: ppppppppprrrrrrrrrriiiiiiiiiinnnnnnnnnntttttttttt('Hello, world!') Luckily, python has an exec function that executes literals as code. This lets us write a shorter hello world: eeeeeeeeexxxxxxxxxxeeeeeeeeeecccccccccc("print('Hello, world!')") It's probably easy to see that for nearly every program, the shortest way to write it in python^10 is to write it in python and run it with exec. If we didn't have exec, for sufficiently complicated programs, the shortest way to write them would be to specify an interpreter for a different language in python^10 and write it in that language instead. As this example shows, the answer to "what's the shortest program that does X?" might involve using some roundabout method (in this case we used exec). If python^10 has some security properties that python didn't have, then the shortest program in python^10 that accomplished any given task would not have these security properties because they would all pass through exec. In general, if you can access alternative ‘modes' (in this case python), the shortest programs that output any given string might go through one of those modes, possibly introducing malign behavior. Let's say that I'm trying to predict what a human types next using the Solomonoff prior. Many programs predict the human: Simulate the human and their local surroundings. Run the simulation forward and check what gets typed. Simulate the entire Earth. Run the simulation forward and check what that particular human types. Simulate the entire universe from the beginning of time. Run the simulation forward and check what that particular human types. Simulate an entirely different universe that has reason to simulate this universe. Output what the human types in the simulation of our universe. Which one is the simplest? One property of the Solmonoff prior is that it doesn't care about how long the TMs take to run, only ho...

    2020 AI Alignment Literature Review and Charity Comparison by Larks

    Play Episode Listen Later Dec 10, 2021 132:57


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2020 AI Alignment Literature Review and Charity Comparison, published by Larks on the AI Alignment Forum. Write a Review cross-posted to the EA forum here. Introduction As in 2016, 2017, 2018, and 2019, I have attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to a securities analyst with regards to possible investments. My aim is basically to judge the output of each organisation in 2020 and compare it to their budget. This should give a sense of the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2020 budgets to get a sense of urgency. I'd like to apologize in advance to everyone doing useful AI Safety work whose contributions I have overlooked or misconstrued. As ever I am painfully aware of the various corners I have had to cut due to time constraints from my job, as well as being distracted by 1) other projects, 2) the miracle of life and 3) computer games. This article focuses on AI risk work. If you think other causes are important too, your priorities might differ. This particularly affects GCRI, FHI and CSER, who both do a lot of work on other issues which I attempt to cover but only very cursorily. How to read this document This document is fairly extensive, and some parts (particularly the methodology section) are largely the same as last year, so I don't recommend reading from start to finish. Instead, I recommend navigating to the sections of most interest to you. If you are interested in a specific research organisation, you can use the table of contents to navigate to the appropriate section. You might then also want to Ctrl+F for the organisation acronym in case they are mentioned elsewhere as well. Papers listed as ‘X researchers contributed to the following research lead by other organisations' are included in the section corresponding to their first author and you can Cntrl+F to find them. If you are interested in a specific topic, I have added a tag to each paper, so you can Ctrl+F for a tag to find associated work. The tags were chosen somewhat informally so you might want to search more than one, especially as a piece might seem to fit in multiple categories. Here are the un-scientifically-chosen hashtags: AgentFoundations Amplification Capabilities Corrigibility DecisionTheory Ethics Forecasting GPT-3 IRL Misc NearAI OtherXrisk Overview Politics RL Strategy Textbook Transparency ValueLearning New to Artificial Intelligence as an existential risk? If you are new to the idea of General Artificial Intelligence as presenting a major risk to the survival of human value, I recommend this Vox piece by Kelsey Piper, or for a more technical version this by Richard Ngo. If you are already convinced and are interested in contributing technically, I recommend this piece by Jacob Steinheart, as unlike this document Jacob covers pre-2019 research and organises by topic, not organisation, or this from Critch & Krueger, or this from Everitt et al, though it is a few years old now Research Organisations FHI: The Future of Humanity Institute FHI is an Oxford-based Existential Risk Research organisation founded in 2005 by Nick Bostrom. They are affiliated with Oxford University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach. Their research can be found here. Their research is more varied than MIRI's, including strategic work, work directly addressing the value-learning problem, and corrigibility work - as well as work on other Xrisks. They run a Research Scholars Program, where people can join them to do research at FHI. There is...

    Inner Alignment: Explain like I'm 12 Edition by Rafael Harth

    Play Episode Listen Later Dec 10, 2021 19:56


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Alignment: Explain like I'm 12 Edition , published by Rafael Harth on the AI Alignment Forum. (This is an unofficial explanation of Inner Alignment based on the Miri paper Risks from Learned Optimization in Advanced Machine Learning Systems (which is almost identical to the LW sequence) and the Future of Life podcast with Evan Hubinger (Miri/LW). It's meant for anyone who found the sequence too long/challenging/technical to read.) Note that bold and italics means "this is a new term I'm introducing," whereas underline and italics is used for emphasis. What is Inner Alignment? Let's start with an abridged guide to how Machine Learning works: Choose a problem Decide on a space of possible solutions Find a good solution from that space If the problem is "find a tool that can look at any image and decide whether or not it contains a cat," then each conceivable set of rules for answering this question (formally, each function from the set of all pixels to the set yes, no ) defines one solution. We call each such solution a model. The space of possible models is depicted below. Since that's all possible models, most of them are utter nonsense. Pick a random one, and you're as likely to end up with a car-recognizer than a cat-recognizer – but far more likely with an algorithm that does nothing we can interpret. Note that even the examples I annotated aren't typical – most models would be more complex while still doing nothing related to cats. Nonetheless, somewhere in there is a model that would do a decent job on our problem. In the above, that's the one that says, "I look for cats." How does ML find such a model? One way that does not work is trying out all of them. That's because the space is too large: it might contain over 10 1000000 candidates. Instead, there's this thing called Stochastic Gradient Descent (SGD). Here's how it works: SGD begins with some (probably terrible) model and then proceeds in steps. In each step, it switches to another model that is "close" and hopefully a little better. Eventually, it stops and outputs the most recent model. 1 Note that, in the example above, we don't end up with the perfect cat-recognizer (the red box) but with something close to it – perhaps a model that looks for cats but has some unintended quirks. SGD does generally not guarantee optimality. The speech bubbles where the models explain what they're doing are annotations for the reader. From the perspective of the programmer, it looks like this: The programmer has no idea what the models are doing. Each model is just a black box. 2 A necessary component for SGD is the ability to measure a model's performance, but this happens while treating them as black boxes. In the cat example, assume the programmer has a bunch of images that are accurately labeled as "contains cat" and "doesn't contain cat." (These images are called the training data and the setting is called supervised learning.) SGD tests how well each model does on these images and, in each step, chooses one that does better. In other settings, performance might be measured in different ways, but the principle remains the same. Now, suppose that the images we have happen to include only white cats. In this case, SGD might choose a model implementing the rule "output yes if there is something white and with four legs." The programmer would not notice anything strange – all she sees is that the model output by SGD does well on the training data. In this setting, there is thus only a problem if our way of obtaining feedback is flawed. If it is perfect – if the pictures with cats are perfectly representative of what images-with-cats are like, and the pictures without cats are perfectly representative of what images-without-cats are like, then there isn't an issue. Conversely, if our images-with-cats are non-repres...

    Evolution of Modularity by johnswentworth

    Play Episode Listen Later Dec 10, 2021 3:57


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evolution of Modularity, published by johnswentworth on the AI Alignment Forum. Write a Review This post is based on chapter 15 of Uri Alon's book An Introduction to Systems Biology: Design Principles of Biological Circuits. See the book for more details and citations; see here for a review of most of the rest of the book. Fun fact: biological systems are highly modular, at multiple different scales. This can be quantified and verified statistically, e.g. by mapping out protein networks and algorithmically partitioning them into parts, then comparing the connectivity of the parts. It can also be seen more qualitatively in everyday biological work: proteins have subunits which retain their function when fused to other proteins, receptor circuits can be swapped out to make bacteria follow different chemical gradients, manipulating specific genes can turn a fly's antennae into legs, organs perform specific functions, etc, etc. On the other hand, systems designed by genetic algorithms (aka simulated evolution) are decidedly not modular. This can also be quantified and verified statistically. Qualitatively, examining the outputs of genetic algorithms confirms the statistics: they're a mess. So: what is the difference between real-world biological evolution vs typical genetic algorithms, which leads one to produce modular designs and the other to produce non-modular designs? Kashtan & Alon tackle the problem by evolving logic circuits under various conditions. They confirm that simply optimizing the circuit to compute a particular function, with random inputs used for selection, results in highly non-modular circuits. However, they are able to obtain modular circuits using “modularly varying goals” (MVG). The idea is to change the reward function every so often (the authors switch it out every 20 generations). Of course, if we just use completely random reward functions, then evolution doesn't learn anything. Instead, we use “modularly varying” goal functions: we only swap one or two little pieces in the (modular) objective function. An example from the book: The upshot is that our different goal functions generally use similar sub-functions - suggesting that they share sub-goals for evolution to learn. Sure enough, circuits evolved using MVG have modular structure, reflecting the modular structure of the goals. (Interestingly, MVG also dramatically accelerates evolution - circuits reach a given performance level much faster under MVG than under a fixed goal, despite needing to change behavior every 20 generations. See either the book or the paper for more on that.) How realistic is MVG as a model for biological evolution? I haven't seen quantitative evidence, but qualitative evidence is easy to spot. MVG as a theory of biological modularity predicts that highly variable subgoals will result in modular structure, whereas static subgoals will result in a non-modular mess. Alon's book gives several examples: Chemotaxis: different bacteria need to pursue/avoid different chemicals, with different computational needs and different speed/energy trade-offs, in various combinations. The result is modularity: separate components for sensing, processing and motion. Animals need to breathe, eat, move, and reproduce. A new environment might have different food or require different motions, independent of respiration or reproduction - or vice versa. Since these requirements vary more-or-less independently in the environment, animals evolve modular systems to deal with them: digestive tract, lungs, etc. Ribosomes, as an anti-example: the functional requirements of a ribosome hardly vary at all, so they end up non-modular. They have pieces, but most pieces do not have an obvious distinct function. To sum it up: modularity in the system evolves to match modularity in the environment. Than...

    MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models" by Rob Bensinger

    Play Episode Listen Later Dec 10, 2021 39:40


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: MIRI comments on Cotra's "Case for Aligning Narrowly Superhuman Models", published by Rob Bensinger on the AI Alignment Forum. Below, I've copied comments left by MIRI researchers Eliezer Yudkowsky and Evan Hubinger on March 1–3 on a draft of Ajeya Cotra's "Case for Aligning Narrowly Superhuman Models." I've included back-and-forths with Cotra, and interjections by me and Rohin Shah. The section divisions below correspond to the sections in Cotra's post. 0. Introduction How can we train GPT-3 to give “the best health advice it can give” using demonstrations and/or feedback from humans who may in some sense “understand less” about what to do when you're sick than GPT-3 does? Eliezer Yudkowsky: I've had some related conversations with Nick Beckstead. I'd be hopeful about this line of work primarily because I think it points to a bigger problem with the inscrutable matrices of floating-point numbers, namely, we have no idea what the hell GPT-3 is thinking and cannot tell it to think anything else. GPT-3 has a great store of medical knowledge, but we do not know where that medical knowledge is; we do not know how to tell it to internally apply its medical knowledge rather than applying other cognitive patterns it has stored. If this is still the state of opacity of AGI come superhuman capabilities, we are all immediately dead. So I would be relatively more hopeful about any avenue of attack for this problem that used anything other than an end-to-end black box - anything that started to address, "Well, this system clearly has a bunch of medical knowledge internally, can we find that knowledge and cause it to actually be applied" rather than "What external forces can we apply to this solid black box to make it think more about healthcare?" Evan Hubinger: +1 I continue to think that language model transparency research is the single most valuable current research direction within the class of standard ML research, for similar reasons to what Eliezer said above. Ajeya Cotra: Thanks! I'm also excited about language model transparency, and would love to find ways to make it more tractable as a research statement / organizing question for a field. I'm not personally excited about the connotations of transparency because it evokes the neuroscience-y interpretability tools, which don't feel scalable to situations when we don't get the concepts the model is using, and I'm very interested in finding slogans to keep researchers focused on the superhuman stuff. Ajeya Cotra: I've edited the description of the challenge to emphasize human feedback less. It now reads "How can we get GPT-3 to give “the best health advice it can give” when humans in some sense “understand less” about what to do when you're sick than GPT-3 does? And in that regime, how can we even tell/verify that it's “doing the best it can”?" Rob Bensinger: Nate and I tend to talk about "understandability" instead of "transparency" exactly because we don't want to sound like we're talking about normal ML transparency work. Eliezer Yudkowsky: Other possible synonyms: Clarity, legibility, cognitive readability. Ajeya Cotra: Thanks all -- I like the project of trying to come up with a good handle for the kind of language model transparency we're excited about (and have talked to Nick, Evan, etc about it too) but I think I don't want to push it in this blog post right now because I haven't hit on something I believe in and I want to ship this. In the end, we probably want to find ways to meaningfully supervise (or justifiably trust) models that are more capable than ~all humans in ~all domains. Eliezer Yudkowsky: (I think you want an AGI that is superhuman in engineering domains and infrahuman in human-modeling-and-manipulation if such a thing is at all possible.) Ajeya Cotra: Fair point, added a footnote: “Though if we cou...

    EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised by gwern

    Play Episode Listen Later Dec 10, 2021 2:21


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised , published by gwern on the AI Alignment Forum. This is a linkpost for "Mastering Atari Games with Limited Data", Ye et al 2021: Reinforcement learning has achieved great success in many applications. However, sample efficiency remains a key challenge, with prominent methods requiring millions (or even billions) of environment steps to train. Recently, there has been significant progress in sample efficient image-based RL algorithms; however, consistent human-level performance on the Atari game benchmark remains an elusive goal. We propose a sample efficient model-based visual RL algorithm built on MuZero, which we name EfficientZero. Our method achieves 190.4% mean human performance and 116.0% median performance on the Atari 100k benchmark with only two hours of real-time game experience and outperforms the state SAC in some tasks on the DMControl 100k benchmark. This is the first time an algorithm achieves super-human performance on Atari games with such little data. EfficientZero's performance is also close to DQN's performance at 200 million frames while we consume 500 times less data. EfficientZero's low sample complexity and high performance can bring RL closer to real-world applicability. We implement our algorithm in an easy-to-understand manner and it is available at this https URL. We hope it will accelerate the research of MCTS-based RL algorithms in the wider community. This work is supported by the Ministry of Science and Technology of the People's Republic of China, the 2030 Innovation Megaprojects “Program on New Generation Artificial Intelligence” (Grant No. 2021AAA0150000). Some have said that poor sample-efficiency on ALE has been a reason to downplay DRL progress or implications. The primary boost in EfficientZero (table 3), pushing it past the human benchmark, is some simple self-supervised learning (SimSiam on predicted vs actual observations). Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

    Understanding “Deep Double Descent” by Evan Hubinger

    Play Episode Listen Later Dec 10, 2021 9:04


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding “Deep Double Descent”, published by Evan Hubinger on the AI Alignment Forum. If you're not familiar with the double descent phenomenon, I think you should be. I consider double descent to be one of the most interesting and surprising recent results in analyzing and understanding modern machine learning. Today, Preetum et al. released a new paper, “Deep Double Descent,” which I think is a big further advancement in our understanding of this phenomenon. I'd highly recommend at least reading the summary of the paper on the OpenAI blog. However, I will also try to summarize the paper here, as well as give a history of the literature on double descent and some of my personal thoughts. Prior work The double descent phenomenon was first discovered by Mikhail Belkin et al., who were confused by the phenomenon wherein modern ML practitioners would claim that “bigger models are always better” despite standard statistical machine learning theory predicting that bigger models should be more prone to overfitting. Belkin et al. discovered that the standard bias-variance tradeoff picture actually breaks down once you hit approximately zero training error—what Belkin et al. call the “interpolation threshold.” Before the interpolation threshold, the bias-variance tradeoff holds and increasing model complexity leads to overfitting, increasing test error. After the interpolation threshold, however, they found that test error actually starts to go down as you keep increasing model complexity! Belkin et al. demonstrated this phenomenon in simple ML methods such as decision trees as well as simple neural networks trained on MNIST. Here's the diagram that Belkin et al. use in their paper to describe this phenomenon: Belkin et al. describe their hypothesis for what's happening as follows: All of the learned predictors to the right of the interpolation threshold fit the training data perfectly and have zero empirical risk. So why should some—in particular, those from richer functions classes—have lower test risk than others? The answer is that the capacity of the function class does not necessarily reflect how well the predictor matches the inductive bias appropriate for the problem at hand. [The inductive bias] is a form of Occam's razor: the simplest explanation compatible with the observations should be preferred. By considering larger function classes, which contain more candidate predictors compatible with the data, we are able to find interpolating functions that [are] “simpler”. Thus increasing function class capacity improves performance of classifiers. I think that what this is saying is pretty magical: in the case of neural nets, it's saying that SGD just so happens to have the right inductive biases that letting SGD choose which model it wants the most out of a large class of models with the same training performance yields significantly better test performance. If you're right on the interpolation threshold, you're effectively “forcing” SGD to choose from a very small set of models with perfect training accuracy (maybe only one realistic option), thus ignoring SGD's inductive biases completely—whereas if you're past the interpolation threshold, you're letting SGD choose which of many models with perfect training accuracy it prefers, thus allowing SGD's inductive bias to shine through. I think this is strong evidence for the critical importance of implicit simplicity and speed priors in making modern ML work. However, such biases also produce strong incentives for mesa-optimization (since optimizers are simple, compressed policies) and pseudo-alignment (since simplicity and speed penalties will favor simpler, faster proxies). Furthermore, the arguments for the universal prior and minimal circuits being malign suggest that such strong simplicity and speed priors could...

    Can you control the past? by Joe Carlsmith

    Play Episode Listen Later Dec 10, 2021 78:05


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Can you control the past?, published by Joe Carlsmith on the AI Alignment Forum. (Cross-posted from Hands and Cities. Lots of stuff familiar to LessWrong folks interested in decision theory.) I think that you can “control” events you have no causal interaction with, including events in the past, and that this is a wild and disorienting fact, with uncertain but possibly significant implications. This post attempts to impart such disorientation. My main example is a prisoner's dilemma between perfect deterministic software twins, exposed to the exact same inputs. This example that shows, I think, that you can write on whiteboards light-years away, with no delays; you can move the arm of another person, in another room, just by moving your own. This, I claim, is extremely weird. My topic, more broadly, is the implications of this weirdness for the theory of instrumental rationality (“decision theory”). Many philosophers, and many parts of common sense, favor causal decision theory (CDT), on which, roughly, you should pick the action that causes the best outcomes in expectation. I think that deterministic twins, along with other examples, show that CDT is wrong. And I don't think that uncertainty about “who are you,” or “where your algorithm is,” can save it. Granted that CDT is wrong, though, I'm not sure what's right. The most famous alternative is evidential decision theory (EDT), on which, roughly, you should choose the action you would be happiest to learn you had chosen. I think that EDT is more attractive (and more confusing) than many philosophers give it credit for, and that some putative counterexamples don't withstand scrutiny. But EDT has problems, too. In particular, I suspect that attractive versions of EDT (and perhaps, attractive attempts to recapture the spirit of CDT) require something in the vicinity of “following the policy that you would've wanted yourself to commit to, from some epistemic position that ‘forgets' information you now know.” I don't think that the most immediate objection to this – namely, that it implies choosing lower pay-offs even when you know them with certainty – is decisive (though some debates in this vicinity seem to me verbal). But it also seems extremely unclear what epistemic position you should evaluate policies from, and what policy such a position actually implies. Overall, rejecting the common-sense comforts of CDT, and accepting the possibility of some kind of “acausal control,” leaves us in strange and uncertain territory. I think we should do it anyway. But we should also tread carefully. I. Grandpappy Omega Decision theorists often assume that instrumental rationality is about maximizing expected utility in some sense. The question is: what sense? The most famous debate is between CDT and EDT. CDT chooses the action that will have the best effects. EDT chooses the action whose performance would be the best news. More specifically: CDT and EDT disagree about the type of “if” to use when evaluating the utility to expect, if you do X. CDT uses a counterfactual type of “if” — one that holds fixed the probability of everything outside of action X's causal influence, then plays out the consequences of doing X. In this sense, it doesn't allow your choice to serve as “evidence” about anything you can't cause — even when your choice is such evidence. EDT, by contrast, uses a conditional “if.” That is, to evaluate X, it updates your overall picture of the world to reflect the assumption that action X has been been performed, and then sees how good the world looks in expectation. In this sense, it takes all the evidence into account, including the evidence that your having done X would provide. To see what this difference looks like in action, consider: Newcomb's problem: You face two boxes: a transparent box, containing a tho...

    Developmental Stages of GPTs by orthonormal

    Play Episode Listen Later Dec 10, 2021 11:13


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Developmental Stages of GPTs, published by orthonormal on the AI Alignment Forum. Epistemic Status: I only know as much as anyone else in my reference class (I build ML models, I can grok the GPT papers, and I don't work for OpenAI or a similar lab). But I think my thesis is original. Related: Gwern on GPT-3 For the last several years, I've gone around saying that I'm worried about transformative AI, an AI capable of making an Industrial Revolution sized impact (the concept is agnostic on whether it has to be AGI or self-improving), because I think we might be one or two cognitive breakthroughs away from building one. GPT-3 has made me move up my timelines, because it makes me think we might need zero more cognitive breakthroughs, just more refinement / efficiency / computing power: basically, GPT-6 or GPT-7 might do it. My reason for thinking this is comparing GPT-3 to GPT-2, and reflecting on what the differences say about the "missing pieces" for transformative AI. My Thesis: The difference between GPT-2 and GPT-3 has made me suspect that there's a legitimate comparison to be made between the scale of a network architecture like the GPTs, and some analogue of "developmental stages" of the resulting network. Furthermore, it's plausible to me that the functions needed to be a transformative AI are covered by a moderate number of such developmental stages, without requiring additional structure. Thus GPT-N would be a transformative AI, for some not-too-large N, and we need to redouble our efforts on ways to align such AIs. The thesis doesn't strongly imply that we'll reach transformative AI via GPT-N especially soon; I have wide uncertainty, even given the thesis, about how large we should expect N to be, and whether the scaling of training and of computation slows down progress before then. But it's also plausible to me now that the timeline is only a few years, and that no fundamentally different approach will succeed before then. And that scares me. Architecture and Scaling GPT, GPT-2, and GPT-3 use nearly the same architecture; each paper says as much, with a sentence or two about minor improvements to the individual transformers. Model size (and the amount of training computation) is really the only difference. GPT took 1 petaflop/s-day to train 117M parameters, GPT-2 took 10 petaflop/s-days to train 1.5B parameters, and the largest version of GPT-3 took 3,000 petaflop/s-days to train 175B parameters. By contrast, AlphaStar seems to have taken about 30,000 petaflop/s-days of training in mid-2019, so the pace of AI research computing power projects that there should be about 10x that today. The upshot is that OpenAI may not be able to afford it, but if Google really wanted to make GPT-4 this year, they could afford to do so. Analogues to Developmental Stages There are all sorts of (more or less well-defined) developmental stages for human beings: image tracking, object permanence, vocabulary and grammar, theory of mind, size and volume, emotional awareness, executive functioning, et cetera. I was first reminded of developmental stages a few years ago, when I saw the layers of abstraction generated in this feature visualization tool for GoogLeNet. We don't have feature visualization for language models, but we do have generative outputs. And as you scale up an architecture like GPT, you see higher levels of abstraction. Grammar gets mastered, then content (removing absurd but grammatical responses), then tone (first rough genre, then spookily accurate authorial voice). Topic coherence is mastered first on the phrase level, then the sentence level, then the paragraph level. So too with narrative flow. Gwern's poetry experiments (GPT-2, GPT-3) are good examples. GPT-2 could more or less continue the meter of a poem and use words that fit the existing theme, but even...

    My computational framework for the brain by Steve Byrnes

    Play Episode Listen Later Dec 10, 2021 23:52


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: My computational framework for the brain, published by Steve Byrnes on the AI Alignment Forum. By now I've written a bunch of blog posts on brain architecture and algorithms, not in any particular order and generally interspersed with long digressions into Artificial General Intelligence. Here I want to summarize my key ideas in one place, to create a slightly better entry point, and something I can refer back to in certain future posts that I'm planning. If you've read every single one of my previous posts (hi mom!), there's not much new here. In this post, I'm trying to paint a picture. I'm not really trying to justify it, let alone prove it. The justification ultimately has to be: All the pieces are biologically, computationally, and evolutionarily plausible, and the pieces work together to explain absolutely everything known about human psychology and neuroscience. (I believe it! Try me!) Needless to say, I could be wrong in both the big picture and the details (or missing big things). If so, writing this out will hopefully make my wrongness easier to discover! Pretty much everything I say here and its opposite can be found in the cognitive neuroscience literature. (It's a controversial field!) I make no pretense to originality (with one exception noted below), but can't be bothered to put in actual references. My previous posts have a bit more background, or just ask me if you're interested. :-P So let's start in on the 7 guiding principles for how I think about the brain: 1. Two subsystems: "Neocortex" and "Subcortex" This is the starting point. I think it's absolutely critical. The brain consists of two subsystems. The neocortex is the home of "human intelligence" as we would recognize it—our beliefs, goals, ability to plan and learn and understand, every aspect of our conscious awareness, etc. etc. (All mammals have a neocortex; birds and lizards have an homologous and functionally-equivalent structure called the "pallium".) Some other parts of the brain (hippocampus, parts of the thalamus & basal ganglia & cerebellum—see further discussion here) help the neocortex do its calculations, and I lump them into the "neocortex subsystem". I'll use the term subcortex for the rest of the brain (brainstem, hypothalamus, etc.). Aside: Is this the triune brain theory? No. Triune brain theory is, from what I gather, a collection of ideas about brain evolution and function, most of which are wrong. One aspect of triune brain theory is putting a lot of emphasis on the distinction between neocortical calculations and subcortical calculations. I like that part. I'm keeping that part, and I'm improving it by expanding the neocortex club to also include the thalamus, hippocampus, lizard pallium, etc., and then I'm ignoring everything else about triune brain theory. 2. Cortical uniformity I claim that the neocortex is, to a first approximation, architecturally uniform, i.e. all parts of it are running the same generic learning algorithm in a massively-parallelized way. The two caveats to cortical uniformity (spelled out in more detail at that link) are: There are sorta "hyperparameters" on the generic learning algorithm which are set differently in different parts of the neocortex—for example, different regions have different densities of each neuron type, different thresholds for making new connections (which also depend on age), etc. This is not at all surprising; all learning algorithms inevitably have tradeoffs whose optimal settings depend on the domain that they're learning (no free lunch). As one of many examples of how even "generic" learning algorithms benefit from domain-specific hyperparameters, if you've seen a pattern "A then B then C" recur 10 times in a row, you will start unconsciously expecting AB to be followed by C. But "should" you expect AB to be followed b...

    Redwood Research's current project by Buck Shlegeris

    Play Episode Listen Later Dec 10, 2021 22:40


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Redwood Research's current project, published by Buck Shlegeris on the AI Alignment Forum. Here's a description of the project Redwood Research is working on at the moment. First I'll say roughly what we're doing, and then I'll try to explain why I think this is a reasonable applied alignment project, and then I'll talk a bit about the takeaways I've had from the project so far. There are a bunch of parts of this that we're unsure of and figuring out as we go; I'll try to highlight our most important confusions as they come up. I've mentioned a bunch of kind of in-the-weeds details because I think they add flavor. This is definitely just me describing a work in progress, rather than presenting any results. Thanks to everyone who's contributed to the project so far: the full-time Redwood technical team of me, Nate Thomas, Daniel Ziegler, Seraphina Nix, Ben Weinstein-Raun, Adam Scherlis; other technical contributors Daniel de Haas, Shauna Kravec, Tao Lin, Noa Nabeshima, Peter Schmidt-Nielsen; our labellers, particularly Kristen Hall, Charles Warth, Jess Thomson, and Liam Clarke; and for particularly useful advice Mark Xu, Ajeya Cotra, and Beth Barnes. Thanks to Paul Christiano for suggesting a project along these lines and giving lots of helpful advice. Thanks to Adam Scherlis and Nate Soares for writing versions of this doc. And thanks to Bill Zito and other contributors to Redwood ops. Apologies to the people I've overlooked. We started this project at the start of August. What we're doing We're trying to take a language model that has been fine-tuned on completing fiction, and then modify it so that it never continues a snippet in a way that involves describing someone getting injured (with a caveat I'll mention later). And we want to do this without sacrificing much quality: if you use both the filtered model and the original model to generate a completion for a prompt, humans should judge the filtered model's completion as better (more coherent, reasonable, thematically appropriate, and so on) at least about half the time. (This “better almost 50% of the time” property is one way of trying to operationalize “we don't want the filtered policy to be worse”. It so happens that this property is actually kind of badly behaved, but in our case it seems fine, given that we're always going to be comparing against a fixed unfiltered distribution.) We're doing this project in two steps: Step 1: train a classifier, generate by sampling with rejection In step 1 (which we're currently doing), instead of training a single filtered generator model, we're just training a classifier that takes a prompt and completion and predicts whether a human would say that the completion involved someone getting injured. You can use such a classifier to make a filtered generation process, by repeatedly generating completions until we find one that the classifier thinks is above some threshold of P(safe). You can play with this filtered generation process here. This interface lets you provide a prompt, and then you can see all of the generated completions and the classifier's rating of each. It currently is set to use “10% chance of injury” as the decision boundary (it is extremely uncalibrated; this corresponds to a much lower actual chance of injury). Our first goal is to train a classifier that's good enough that no-one is able to find prompts on which the above process has a noticeable probability of generating an injurious completion. This model was produced by fine-tuning DeBERTa XL on a dataset produced by contractors labeling a bunch of LM-generated completions to snippets of fanfiction that were selected by various heuristics to have a high probability of being completed violently. You can read the instructions given to our contractors here; if you want to try out the labelling task, y...

    2019 AI Alignment Literature Review and Charity Comparison by Larks

    Play Episode Listen Later Dec 10, 2021 119:00


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2019 AI Alignment Literature Review and Charity Comparison, published by Larks on the AI Alignment Forum. Cross-posted to the EA forum here. Introduction As in 2016, 2017 and 2018, I have attempted to review the research that has been produced by various organisations working on AI safety, to help potential donors gain a better understanding of the landscape. This is a similar role to that which GiveWell performs for global health charities, and somewhat similar to a securities analyst with regards to possible investments. My aim is basically to judge the output of each organisation in 2019 and compare it to their budget. This should give a sense of the organisations' average cost-effectiveness. We can also compare their financial reserves to their 2019 budgets to get a sense of urgency. I'd like to apologize in advance to everyone doing useful AI Safety work whose contributions I may have overlooked or misconstrued. As ever I am painfully aware of the various corners I have had to cut due to time constraints from my job, as well as being distracted by 1) another existential risk capital allocation project, 2) the miracle of life and 3) computer games. How to read this document This document is fairly extensive, and some parts (particularly the methodology section) are the same as last year, so I don't recommend reading from start to finish. Instead, I recommend navigating to the sections of most interest to you. If you are interested in a specific research organisation, you can use the table of contents to navigate to the appropriate section. You might then also want to Ctrl+F for the organisation acronym in case they are mentioned elsewhere as well. If you are interested in a specific topic, I have added a tag to each paper, so you can Ctrl+F for a tag to find associated work. The tags were chosen somewhat informally so you might want to search more than one, especially as a piece might seem to fit in multiple categories. Here are the un-scientifically-chosen hashtags: Agent Foundations AI_Theory Amplification Careers CIRL Decision_Theory Ethical_Theory Forecasting Introduction Misc ML_safety Other_Xrisk Overview Philosophy Politics RL Security Shortterm Strategy New to Artificial Intelligence as an existential risk? If you are new to the idea of General Artificial Intelligence as presenting a major risk to the survival of human value, I recommend this Vox piece by Kelsey Piper. If you are already convinced and are interested in contributing technically, I recommend this piece by Jacob Steinheart, as unlike this document Jacob covers pre-2019 research and organises by topic, not organisation. Research Organisations FHI: The Future of Humanity Institute FHI is an Oxford-based Existential Risk Research organisation founded in 2005 by Nick Bostrom. They are affiliated with Oxford University. They cover a wide variety of existential risks, including artificial intelligence, and do political outreach. Their research can be found here. Their research is more varied than MIRI's, including strategic work, work directly addressing the value-learning problem, and corrigibility work. In the past I have been very impressed with their work. Research Drexler's Reframing Superintelligence: Comprehensive AI Services as General Intelligence is a massive document arguing that superintelligent AI will be developed for individual discrete services for specific finite tasks, rather than as general-purpose agents. Basically the idea is that it makes more sense for people to develop specialised AIs, so these will happen first, and if/when we build AGI these services can help control it. To some extent this seems to match what is happening - we do have many specialised AIs - but on the other hand there are teams working directly on AGI, and often in ML 'build an ML system that does it all...

    Testing The Natural Abstraction Hypothesis: Project Intro by johnswentworth

    Play Episode Listen Later Dec 10, 2021 11:37


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Testing The Natural Abstraction Hypothesis: Project Intro, published by johnswentworth on the AI Alignment Forum. The natural abstraction hypothesis says that Our physical world abstracts well: for most systems, the information relevant “far away” from the system (in various senses) is much lower-dimensional than the system itself. These low-dimensional summaries are exactly the high-level abstract objects/concepts typically used by humans. These abstractions are “natural”: a wide variety of cognitive architectures will learn to use approximately the same high-level abstract objects/concepts to reason about the world. If true, the natural abstraction hypothesis would dramatically simplify AI and AI alignment in particular. It would mean that a wide variety of cognitive architectures will reliably learn approximately-the-same concepts as humans use, and that these concepts can be precisely and unambiguously specified. Ultimately, the natural abstraction hypothesis is an empirical claim, and will need to be tested empirically. At this point, however, we lack even the tools required to test it. This post is an intro to a project to build those tools and, ultimately, test the natural abstraction hypothesis in the real world. Background & Motivation One of the major conceptual challenges of designing human-aligned AI is the fact that human values are a function of humans' latent variables: humans care about abstract objects/concepts like trees, cars, or other humans, not about low-level quantum world-states directly. This leads to conceptual problems of defining “what we want” in physical, reductive terms. More generally, it leads to conceptual problems in translating between human concepts and concepts learned by other systems - e.g. ML systems or biological systems. If true, the natural abstraction hypothesis provides a framework for translating between high-level human concepts, low-level physical systems, and high-level concepts used by non-human systems. The foundations of the framework have been sketched out in previous posts. What is Abstraction? introduces the mathematical formulation of the framework and provides several examples. Briefly: the high-dimensional internal details of far-apart subsystems are independent given their low-dimensional “abstract” summaries. For instance, the Lumped Circuit Abstraction abstracts away all the details of molecule positions or wire shapes in an electronic circuit, and represents the circuit as components each summarized by some low-dimensional behavior - like V = IR for a resistor. This works because the low-level molecular motions in a resistor are independent of the low-level molecular motions in some far-off part of the circuit, given the high-level summary. All the rest of the low-level information is “wiped out” by noise in low-level variables “in between” the far-apart components. In the causal graph of some low-level system, X is separated from Y by a bunch of noisy variables Z. For instance, X might be a resistor, Y might be a capacitor, and Z might be the wires (and air) between them. Noise in Z wipes out most of the low-level info about X, so that only a low-dimensional summary f(X) is relevant to predicting the state of Y. Chaos Induces Abstractions explains one major reason why we expect low-level details to be independent (given high-level summaries) for typical physical systems. If I have a bunch of balls bouncing around perfectly elastically in a box, then the total energy, number of balls, and volume of the box are all conserved, but chaos wipes out all other information about the exact positions and velocities of the balls. My “high-level summary” is then the energy, number of balls, and volume of the box; all other low-level information is wiped out by chaos. This is exactly the abstraction behind the ideal ...

    The theory-practice gap by Buck Shlegeris by Buck Shlegeris

    Play Episode Listen Later Dec 10, 2021 10:18


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The theory-practice gap by Buck Shlegeris, published by Buck Shlegeris on the AI Alignment Forum. [Thanks to Richard Ngo, Damon Binder, Summer Yue, Nate Thomas, Ajeya Cotra, Alex Turner, and other Redwood Research people for helpful comments; thanks Ruby Bloom for formatting this for the Alignment Forum for me.] I'm going to draw a picture, piece by piece. I want to talk about the capability of some different AI systems. You can see here that we've drawn the capability of the system we want to be competitive with, which I'll call the unaligned benchmark. The unaligned benchmark is what you get if you train a system on the task that will cause the system to be most generally capable. And you have no idea how it's thinking about things, and you can only point this system at some goals and not others. I think that the alignment problem looks different depending on how capable the system you're trying to align is, and I think there are reasonable arguments for focusing on various different capabilities levels. See here for more of my thoughts on this question. Alignment strategies People have also proposed various alignment strategies. But I don't think that these alignment strategies are competitive with the unaligned benchmark, even in theory. I want to claim that most of the action in theoretical AI alignment is people proposing various ways of getting around these problems by having your systems do things that are human understandable instead of doing things that are justified by working well. For example, the hope with imitative IDA is that through its recursive structure you can build a dataset of increasingly competent answers to questions, and then at every step you can train a system to imitate these increasingly good answers to questions, and you end up with a really powerful question-answerer that was only ever trained to imitate humans-with-access-to-aligned-systems, and so your system is outer aligned. The bar I've added, which represents how capable I think you can get with amplified humans, is lower than the bar for the unaligned benchmark. I've drawn this bar lower because I think that if your system is trying to imitate cognition that can be broken down into human understandable parts, it is systematically not going to be able to pursue certain powerful strategies that the end-to-end trained systems will be able to. I think that there are probably a bunch of concepts that humans can't understand quickly, or maybe can't understand at all. And if your systems are restricted to never use these concepts, I think your systems are probably just going to be a bunch weaker. I think that transparency techniques, as well as AI alignment strategies like microscope AI that lean heavily on them, rely on a similar assumption that the cognition of the system you're trying to align is factorizable into human-understandable parts. One component of the best-case scenario for transparency techniques is that anytime your neural net does stuff, you can get the best possible human understandable explanation of why it's doing that thing. If such an explanation doesn't exist, your transparency tools won't be able to assure you that your system is aligned even if it is. To summarize, I claim that current alignment proposals don't really have a proposal for how to make systems that are aligned but either produce plans that can't be understood by amplified humans do cognitive actions that can't be understood by amplified humans And so I claim that current alignment proposals don't seem like they can control systems as powerful as the systems you'd get from an unaligned training strategy. Empirical generalization I think some people are optimistic that alignment will generalize from the cases where amplified humans can evaluate it to the cases where the amplified humans can't. I'm ...

    Selection vs Control by Abram Demski

    Play Episode Listen Later Dec 10, 2021 19:37


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Selection vs Control , published by Abram Demski on the AI Alignment Forum. This is something which has bothered me for a while, but, I'm writing it specifically in response to the recent post on mesa-optimizers. I feel strongly that the notion of 'optimization process' or 'optimizer' which people use -- partly derived from Eliezer's notion in the sequences -- should be split into two clusters. I call these two clusters 'selection' vs 'control'. I don't have precise formal statements of the distinction I'm pointing at; I'll give several examples. Before going into it, several reasons why this sort of thing may be important: It could help refine the discussion of mesa-optimization. The article restricted its discussion to the type of optimization I'll call 'selection', explicitly ruling out 'control'. This choice isn't obviously right. (More on this later.) Refining 'agency-like' concepts like this seems important for embedded agency -- what we eventually want is a story about how agents can be in the world. I think almost any discussion of the relationship between agency and optimization which isn't aware of the distinction I'm drawing here (at least as a hypothesis) will be confused. Generally, I feel like I see people making mistakes by not distinguishing between the two (whether or not they've derived their notion of optimizer from Eliezer). I judge an algorithm differently if it is intended as one or the other. (See also Stuart Armstrong's summary of other problems with the notion of optimization power Eliezer proposed -- those are unrelated to my discussion here, and strike me more as technical issues which call for refined formulae, rather than conceptual problems which call for revised ontology.) The Basic Idea Eliezer quantified optimization power by asking how small a target an optimization process hits, out of a space of possibilities. The type of 'space of possibilities' is what I want to poke at here. Selection First, consider a typical optimization algorithm, such as simulated annealing. The algorithm constructs an element of the search space (such as a specific combination of weights for a neural network), gets feedback on how good that element is, and then tries again. Over many iterations of this process, it finds better and better elements. Eventually, it outputs a single choice. This is the prototypical 'selection process' -- it can directly instantiate any element of the search space (although typically we consider cases where the process doesn't have time to instantiate all of them), it gets direct feedback on the quality of each element (although evaluation may be costly, so that the selection process must economize these evaluations), the quality of an element of search space does not depend on the previous choices, and only the final output matters. The term 'selection process' refers to the fact that this type of optimization selects between a number of explicitly given possibilities. The most basic example of this phenomenon is a 'filter' which rejects some elements and accepts others -- like selection bias in statistics. This has a limited ability to optimize, however, because it allows only one iteration. Natural selection is an example of much more powerful optimization occurring through iteration of selection effects. Control Now, consider a targeting system on a rocket -- let's say, a heat-seeking missile. The missile has sensors and actuators. It gets feedback from its sensors, and must somehow use this information to decide how to use its actuators. This is my prototypical control process. (The term 'control process' is supposed to invoke control theory.) Unlike a selection process, a controller can only instantiate one element of the space of possibilities. It gets to traverse exactly one path. The 'small target' which it hits is ther...

    Why Subagents? by johnswentworth

    Play Episode Listen Later Dec 10, 2021 12:01


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Subagents?, published by johnswentworth on the AI Alignment Forum. The justification for modelling real-world systems as “agents” - i.e. choosing actions to maximize some utility function - usually rests on various coherence theorems. They say things like “either the system's behavior maximizes some utility function, or it is throwing away resources” or “either the system's behavior maximizes some utility function, or it can be exploited” or things like that. Different theorems use slightly different assumptions and prove slightly different things, e.g. deterministic vs probabilistic utility function, unique vs non-unique utility function, whether the agent can ignore a possible action, etc. One theme in these theorems is how they handle “incomplete preferences”: situations where an agent does not prefer one world-state over another. For instance, imagine an agent which prefers pepperoni over mushroom pizza when it has pepperoni, but mushroom over pepperoni when it has mushroom; it's simply never willing to trade in either direction. There's nothing inherently “wrong” with this; the agent is not necessarily executing a dominated strategy, cannot necessarily be exploited, or any of the other bad things we associate with inconsistent preferences. But the preferences can't be described by a utility function over pizza toppings. In this post, we'll see that these kinds of preferences are very naturally described using subagents. In particular, when preferences are allowed to be path-dependent, subagents are important for representing consistent preferences. This gives a theoretical grounding for multi-agent models of human cognition. Preference Representation and Weak Utility Let's expand our pizza example. We'll consider an agent who: Prefers pepperoni, mushroom, or both over plain cheese pizza Prefers both over pepperoni or mushroom alone Does not have a stable preference between mushroom and pepperoni - they prefer whichever they currently have We can represent this using a directed graph: The arrows show preference: our agent prefers B over A if (and only if) there is a directed path from A to B along the arrows. There is no path from pepperoni to mushroom or from mushroom to pepperoni, so the agent has no preference between them. In this case, we're interpreting “no preference” as “agent prefers to keep whatever they have already”. Note that this is NOT the same as “the agent is indifferent”, in which case the agent is willing to switch back and forth between the two options as long as the switch doesn't cost anything. Key point: there is no cycle in this graph. If the agent's preferences are cyclic, that's when they provably throw away resources, paying to go in circles. As long as the preferences are acyclic, we call them “consistent”. Now, at this point we can still define a “weak” utility function by ignoring the “missing” preference between pepperoni and mushroom. Here's the idea: a normal utility function says “the agent always prefers the option with higher utility”. A weak utility function says: “if the agent has a preference, then they always prefer the option with higher utility”. The missing preference means we can't build a normal utility function, but we can still build a weak utility function. Here's how: since our graph has no cycles, we can always order the nodes so that the arrows only go forward along the sorted nodes - a technique called topological sorting. Each node's position in the topological sort order is its utility. A small tweak to this method also handles indifference. (Note: I'm using the term “weak utility” here because it seems natural; I don't know of any standard term for this in the literature. Most people don't distinguish between these two interpretations of utility.) When preferences are incomplete, there are multiple possib...

    Possible takeaways from the coronavirus pandemic for slow AI takeoff by Vika

    Play Episode Listen Later Dec 10, 2021 5:08


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Possible takeaways from the coronavirus pandemic for slow AI takeoff, published by Vika on the AI Alignment Forum. (Cross-posted from personal blog. Summarized in Alignment Newsletter #104. Thanks to Janos Kramar for his helpful feedback on this post.) Epistemic status: fairly speculative, would appreciate feedback As the covid-19 pandemic unfolds, we can draw lessons from it for managing future global risks, such as other pandemics, climate change, and risks from advanced AI. In this post, I will focus on possible implications for AI risk. For a broader treatment of this question, I recommend FLI's covid-19 page that includes expert interviews on the implications of the pandemic for other types of risks. A key element in AI risk scenarios is the speed of takeoff - whether advanced AI is developed gradually or suddenly. Paul Christiano's post on takeoff speeds defines slow takeoff in terms of the economic impact of AI as follows: "There will be a complete 4 year interval in which world output doubles, before the first 1 year interval in which world output doubles." It argues that slow AI takeoff is more likely than fast takeoff, but is not necessarily easier to manage, since it poses different challenges, such as large-scale coordination. This post expands on this point by examining some parallels between the coronavirus pandemic and a slow takeoff scenario. The upsides of slow takeoff include the ability to learn from experience, act on warning signs, and reach a timely consensus that there is a serious problem. I would argue that the covid-19 pandemic had these properties, but most of the world's institutions did not take advantage of them. This suggests that, unless our institutions improve, we should not expect the slow AI takeoff scenario to have a good default outcome. Learning from experience. In the slow takeoff scenario, general AI is expected to appear in a world that has already experienced transformative change from less advanced AI, and institutions will have a chance to learn from problems with these AI systems. An analogy could be made with learning from dealing with less "advanced" epidemics like SARS that were not as successful as covid-19 at spreading across the world. While some useful lessons were learned, they were not successfully generalized to covid-19, which had somewhat different properties than these previous pathogens (such as asymptomatic transmission and higher virulence). Similarly, general AI may have somewhat different properties from less advanced AI that would make mitigation strategies more difficult to generalize. Warning signs. In the coronavirus pandemic response, there has been a lot of variance in how successfully governments acted on warning signs. Western countries had at least a month of warning while the epidemic was spreading in China, which they could have used to stock up on PPE and build up testing capacity, but most did not do so. Experts have warned about the likelihood of a coronavirus outbreak for many years, but this did not lead most governments to stock up on medical supplies. This was a failure to take cheap preventative measures in response to advance warnings about a widely recognized risk with tangible consequences, which is not a good sign for the case where the risk is less tangible and well-understood (such as risk from general AI). Consensus on the problem. During the covid-19 epidemic, the abundance of warning signs and past experience with previous pandemics created an opportunity for a timely consensus that there is a serious problem. However, it actually took a long time for a broad consensus to emerge - the virus was often dismissed as "overblown" and "just like the flu" as late as March 2020. A timely response to the risk required acting before there was a consensus, thus risking the appearance of ...

    Embedded Agency (full-text version) by Scott Garrabrant, Abram Demski

    Play Episode Listen Later Dec 10, 2021 93:34


    Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Embedded Agency (full-text version) , published by Scott Garrabrant, Abram Demski on the AI Alignment Forum. Write a Review Suppose you want to build a robot to achieve some real-world goal for you—a goal that requires the robot to learn for itself and figure out a lot of things that you don't already know. There's a complicated engineering problem here. But there's also a problem of figuring out what it even means to build a learning agent like that. What is it to optimize realistic goals in physical environments? In broad terms, how does it work? In this post, I'll point to four ways we don't currently know how it works, and four areas of active research aimed at figuring it out. 1. Embedded agents This is Alexei, and Alexei is playing a video game. Like most games, this game has clear input and output channels. Alexei only observes the game through the computer screen, and only manipulates the game through the controller. The game can be thought of as a function which takes in a sequence of button presses and outputs a sequence of pixels on the screen. Alexei is also very smart, and capable of holding the entire video game inside his mind. If Alexei has any uncertainty, it is only over empirical facts like what game he is playing, and not over logical facts like which inputs (for a given deterministic game) will yield which outputs. This means that Alexei must also store inside his mind every possible game he could be playing. Alexei does not, however, have to think about himself. He is only optimizing the game he is playing, and not optimizing the brain he is using to think about the game. He may still choose actions based off of value of information, but this is only to help him rule out possible games he is playing, and not to change the way in which he thinks. In fact, Alexei can treat himself as an unchanging indivisible atom. Since he doesn't exist in the environment he's thinking about, Alexei doesn't worry about whether he'll change over time, or about any subroutines he might have to run. Notice that all the properties I talked about are partially made possible by the fact that Alexei is cleanly separated from the environment that he is optimizing. This is Emmy. Emmy is playing real life. Real life is not like a video game. The differences largely come from the fact that Emmy is within the environment that she is trying to optimize. Alexei sees the universe as a function, and he optimizes by choosing inputs to that function that lead to greater reward than any of the other possible inputs he might choose. Emmy, on the other hand, doesn't have a function. She just has an environment, and this environment contains her. Emmy wants to choose the best possible action, but which action Emmy chooses to take is just another fact about the environment. Emmy can reason about the part of the environment that is her decision, but since there's only one action that Emmy ends up actually taking, it's not clear what it even means for Emmy to “choose” an action that is better than the rest. Alexei can poke the universe and see what happens. Emmy is the universe poking itself. In Emmy's case, how do we formalize the idea of “choosing” at all? To make matters worse, since Emmy is contained within the environment, Emmy must also be smaller than the environment. This means that Emmy is incapable of storing accurate detailed models of the environment within her mind. This causes a problem: Bayesian reasoning works by starting with a large collection of possible environments, and as you observe facts that are inconsistent with some of those environments, you rule them out. What does reasoning look like when you're not even capable of storing a single valid hypothesis for the way the world works? Emmy is going to have to use a different type of reasoning, and make updates that ...

    Claim The Nonlinear Library: Alignment Forum Top Posts

    In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

    Claim Cancel