POPULARITY
Categories
Luke Stamper discusses with Dennis, Rl, Bruce, and Kylie about his upcoming wildlife Field Day at the Northeast Research Station in St. Joeseph on 5/ 12/23. You can RSVP by contacting 318-649-2663.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some Thoughts on Virtue Ethics for AIs, published by peligrietzer on May 2, 2023 on LessWrong. This post argues for the desirability and plausibility of AI agents whose values have a structure I call ‘praxis-based.' The idea draws on various aspects of virtue ethics, and basically amounts to an RL-flavored take on that philosophical tradition. Praxis-based values as I define them are, informally, reflective decision-influences matching the description ‘promote x x-ingly': ‘promote peace peacefully,' ‘promote corrigibility corrigibly,' ‘promote science scientifically.' I will later propose a quasi-formal definition of this values-type, but the general idea is that certain values are an ouroboros of means and end. Such values frequently come up in human “meaning of life” activities (e.g. math, art, craft, friendship, athletics, romance, technology), as well as in complex forms of human morality (e.g. peace, democracy, compassion, respect, honesty). While this is already indirect reason to suspect that a human-aligned AI should have ‘praxis-based' values, there is also a central direct reason: traits such as corrigibility, transparency, and niceness can only function properly in the form of ‘praxis-based' values. It's widely accepted that if early strategically aware AIs possess values like corrigibility, transparency, and perhaps niceness, further alignment efforts are much more likely to succeed. But values like corrigibility or transparency or niceness don't easily fit into an intuitively consequentialist form like ‘maximize lifetime corrigible behavior' or ‘maximize lifetime transparency.' In fact, an AI valuing its own corrigibility or transparency or niceness in an intuitively consequentialist way can lead to extreme power-seeking whereby the AI violently remakes the world to (at a minimum) protect itself from the risk that humans will modify said value. On the other hand, constraints or taboos or purely negative values (a.k.a. ‘deontological restrictions') are widely believed to be weak, in the sense that an advanced AI will come to work around them or uproot them: ‘never lie' or ‘never kill' or ‘never refuse a direct order from the president' are poor substitutes for active transparency, niceness, and corrigibility. The idea of ‘praxis-based' values is meant to capture the normal, sensible way we want an agent to value corrigibility or transparency or niceness, which intuitively-consequentialist values and deontology both fail to capture. We want an agent that (e.g.) actively tries to be transparent, and to cultivate its own future transparency and its own future valuing of transparency, but that will not (for instance) engage in deception and plotting when it expects a high future-transparency payoff. Having lightly motivated the idea that ‘praxis-based' values are desirable from an alignment point of view, the rest of this post will survey key premises of the hypothesis that ‘praxis-based' values are a viable alignment goal. I'm going to assume an agent with some form of online reinforcement learning going on, and draw on ‘shards' talk pretty freely. I informally described a ‘praxis-based' value as having the structure 'promote x x-ingly.' Here is a rough formulation of what I mean, put in terms of a utility-theoretic description of a shard that implements an alignment-enabling value x: Actions (or more generally 'computations') get an x-ness rating. We define the x shard's expected utility conditional on a candidate action a as the sum of two utility functions: a bounded utility function on the x-ness of a and a more tightly bounded utility function on the expected aggregate x-ness of the agent's future actions conditional on a. (So the shard will choose an action with mildly suboptimal x-ness if it gives a big boost to expected aggregate future x-ness, but r...
„Mamma, hann er í kjól“ - Hólmsteinn Eiður Guðrúnarson Hann er næstum því sextugur, gagnkynhneigður, giftur í tæpa þrjá áratugi, með tvær háskólagráður, starfar í leikskóla og byrjaði að notast við varalit, naglalakk og kjóla fyrir nokkrum árum. Hólmsteinn Eiður Guðrúnarson segist beita sínum karllægu forréttindum á þennan hátt til að hafa jákvæð áhrif á börnin sem hann starfar með en fyrst og fremst vegna þess að honum líður vel þannig. Honum er nákvæmlega sama hvað fólki finnst um hann og segist almennt fá stuðning fyrir að rjúfa hefðbundinn ramma karlmennskunnar. Enda telur hann það vera karlmennsku að geta staðið utan við hina hefðbundnu ímynd. Umsjón: Þorsteinn V. Einarsson Tónlist: Mr. Silla - Naruto (án söngs) Veganbúðin, BM Vallá og ÖRLÖ ásamt bakhjörlum Karlmennskunnar bjóða upp á þennan þátt. . . . Þú getur stutt við frekari hlaðvarpsþáttagerð og fræðslumiðlun á samfélagsmiðlum með því að gerast bakhjarl á karlmennskan.is/styrkja
KayGee of Naughty By Nature and RL of Next are Next By Nature and join us for this all new exclusive episode
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: For alignment, we should simultaneously use multiple theories of cognition and value, published by Roman Leventov on April 24, 2023 on The AI Alignment Forum. This post is a follow-up to "A multi-disciplinary view on AI safety research". I elaborate on some arguments behind this view. TL;DR: please skim section headings and bolded sentences in the text. Computationally tractable mathematical models of alignment are bound to be biased and blind to certain aspects of human values No single mathematical model of human values that has orders of magnitude fewer degrees of freedom than an actual human will adequately capture the complexity of value because humans are complex systems and, therefore, cannot be reduced to a much simpler model. If the model is sufficiently complex to robustly capture human values, such as whole-brain emulation, then the ethical concerns and S-risks of actually using these models for alignment appear because the model itself may suffer. Many mathematical theories of human cognition or frameworks for computing (inferring) human values are considered as the basis for alignment, as well as process theories of alignment that implicitly rely on a particular mathematical theory even if it doesn't infer (humans' or AIs') values explicitly, such as the shard theory (RL-based), or the latest Beren Millidge's computational anatomy of human values, or cooperative inverse reinforcement learning, or Bayesian models and approaches, or various linguistic process theories of alignment that I expect to become very hot this year due to the astonishing success of LLMs. However, since all these theories are collapsing the complexity of humans (or else they are equivalent to full human simulations), they are all bound to be incomplete. Moreover, all these theories are bound to be biased (this is a form of inductive bias if you wish), that is, to be relatively blind to specific kinds of human values, or to specific aspects of human nature that we can see as somehow related (or producing) “values”. In other words, human values are not only complex in the sense that they are very elaborate. Crucially, human values are also not capturable within a single mathematical framework or ontology for describing them. From “solving the alignment problem” to engineering the alignment process The main implication of the above thesis is that we should abandon the frame that we should “solve” the alignment problem and seek a smart theory that will “crack” this problem. I feel that a fair amount of unproductive debates and unproductive alignment research resource allocation stems from this illusion. People often debate whether this or that theory “can or cannot succeed” (in “solving” alignment, it is implied), or try to find the “best” theory and invest their effort into improving that theory because it's the “best bet”. Instead, we should adopt a portfolio approach. Theory A captures 90% of “value complexity” to be aligned, then theory B largely overlaps with theory A, but together, they capture 95% of value complexity, then adding theory C to the mix raises it to 97%, etc. (Of course, these “percent” are fictitious and cannot be actually computed.) This is an engineering approach of adding extra assurances to the alignment process until all stakeholders of the system agree that the quality (the quality of being sufficiently aligned, or alignable to humans, in this case) is assured well enough for production deployment of the system. When we consider this, it becomes clear that marshalling all effort behind improving a single theory is not optimal, vaguely speaking, due to the law of diminishing returns (also, as noted above, a good fraction of the alignment research community's brain power goes into actually “finding” that best theory, on both individual and collective levels). Su...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Alignment Research Engineer Accelerator (ARENA): call for applicants, published by TheMcDouglas on April 17, 2023 on LessWrong. TL;DR Apply here for the second iteration of ARENA! Introduction We are excited to announce the second iteration of ARENA (Alignment Research Engineer Accelerator), a 6-week ML bootcamp with a focus on AI safety. Our mission is to prepare participants for full-time careers as research engineers in AI safety, e.g. at leading organizations or as independent researchers. The program will commence on May 22nd, 2023, and will be held at the Moorgate WeWork offices in London. This will overlap with SERI MATS, who are also using these offices. We expect this to bring several benefits, e.g. facilitating productive discussions about AI safety & different agendas, and allowing participants to form a better picture of what working on AI safety can look like in practice. ARENA offers a unique opportunity for those interested in AI safety to learn valuable technical skills, engage in their own projects, and make open-source contributions to AI safety-related libraries. The program is comparable to MLAB or WMLB, but extends over a longer period to facilitate deeper dives into the content, and more open-ended project work with supervision. For more information, see our website. Outline of Content The 6-week program will be structured as follows: Chapter 0 - Fundamentals Before getting into more advanced topics, we first cover the basics of deep learning, including basic machine learning terminology, what neural networks are, and how to train them. We will also cover some subjects we expect to be useful going forwards, e.g. using GPT-3 and 4 to streamline your learning, good coding practices, and version control. Topics include: PyTorch basics CNNs, Residual Neural Networks Optimization Backpropagation Hyperparameter search with Weights and Biases Model training & PyTorch Lightning Duration: 5 days Chapter 1 - Transformers & Mechanistic Interpretability In this chapter, you will learn all about transformers, and build and train your own. You'll also learn about Mechanistic Interpretability of transformers, a field which has been advanced by Anthropic's Transformer Circuits sequence, and open-source work by Neel Nanda. Topics include: GPT models (building your own GPT-2) Training and sampling from transformers TransformerLens In-context Learning and Induction Heads Indirect Object Identification Superposition Duration: 9 days Chapter 2 - Reinforcement Learning In this chapter, you will learn about some of the fundamentals of RL, and work with OpenAI's Gym environment to run their own experiments. Topics include: Fundamentals of RL Vanilla Policy Gradient PPO Deep Q-learning RLHF Gym & Gymnasium environments Duration: 6 days Chapter 3 - Training at Scale There are a number of techniques that are helpful for training large-scale models efficiently. Here, you will learn more about these techniques and how to use them. The focus is on hands-on learning, rather than just a theoretical understanding. Topics include: GPUs Distributed computing Data/tensor/pipeline parallelism Finetuning LLMs Duration: 4 days Chapter 4 - Capstone Projects We will conclude this program with capstone projects, where participants get to dig into something related to the course. This should draw on much of the skills and knowledge participants will have accumulated over the last 5 weeks. Duration: 6 days Below is a diagram of the curriculum as a whole, and the dependencies between sections. Here is some sample material from the course, which you will be able to full understand once you reach that point in the course. This notebook is on Indirect Object Identification (from the chapter on Transformers & Mechanistic Interpretability), it will represent one of a set of optional 2-day mi...
Patreon: https://www.patreon.com/mlst Discord: https://discord.gg/ESrGqhf5CB Twitter: https://twitter.com/MLStreetTalk In this exclusive interview, Dr. Tim Scarfe sits down with Minqi Jiang, a leading PhD student at University College London and Meta AI, as they delve into the fascinating world of deep reinforcement learning (RL) and its impact on technology, startups, and research. Discover how Minqi made the crucial decision to pursue a PhD in this exciting field, and learn from his valuable startup experiences and lessons. Minqi shares his insights into balancing serendipity and planning in life and research, and explains the role of objectives and Goodhart's Law in decision-making. Get ready to explore the depths of robustness in RL, two-player zero-sum games, and the differences between RL and supervised learning. As they discuss the role of environment in intelligence, emergence, and abstraction, prepare to be blown away by the possibilities of open-endedness and the intelligence explosion. Learn how language models generate their own training data, the limitations of RL, and the future of software 2.0 with interpretability concerns. From robotics and open-ended learning applications to learning potential metrics and MDPs, this interview is a goldmine of information for anyone interested in AI, RL, and the cutting edge of technology. Don't miss out on this incredible opportunity to learn from a rising star in the AI world! TOC Tech & Startup Background [00:00:00] Pursuing PhD in Deep RL [00:03:59] Startup Lessons [00:11:33] Serendipity vs Planning [00:12:30] Objectives & Decision Making [00:19:19] Minimax Regret & Uncertainty [00:22:57] Robustness in RL & Zero-Sum Games [00:26:14] RL vs Supervised Learning [00:34:04] Exploration & Intelligence [00:41:27] Environment, Emergence, Abstraction [00:46:31] Open-endedness & Intelligence Explosion [00:54:28] Language Models & Training Data [01:04:59] RLHF & Language Models [01:16:37] Creativity in Language Models [01:27:25] Limitations of RL [01:40:58] Software 2.0 & Interpretability [01:45:11] Language Models & Code Reliability [01:48:23] Robust Prioritized Level Replay [01:51:42] Open-ended Learning [01:55:57] Auto-curriculum & Deep RL [02:08:48] Robotics & Open-ended Learning [02:31:05] Learning Potential & MDPs [02:36:20] Universal Function Space [02:42:02] Goal-Directed Learning & Auto-Curricula [02:42:48] Advice & Closing Thoughts [02:44:47] References: - Why Greatness Cannot Be Planned: The Myth of the Objective by Kenneth O. Stanley and Joel Lehman https://www.springer.com/gp/book/9783319155234 - Rethinking Exploration: General Intelligence Requires Rethinking Exploration https://arxiv.org/abs/2106.06860 - The Case for Strong Emergence (Sabine Hossenfelder) https://arxiv.org/abs/2102.07740 - The Game of Life (Conway) https://www.conwaylife.com/ - Toolformer: Teaching Language Models to Generate APIs (Meta AI) https://arxiv.org/abs/2302.04761 - OpenAI's POET: Paired Open-Ended Trailblazer https://arxiv.org/abs/1901.01753 - Schmidhuber's Artificial Curiosity https://people.idsia.ch/~juergen/interest.html - Gödel Machines https://people.idsia.ch/~juergen/goedelmachine.html - PowerPlay https://arxiv.org/abs/1112.5309 - Robust Prioritized Level Replay: https://openreview.net/forum?id=NfZ6g2OmXEk - Unsupervised Environment Design: https://arxiv.org/abs/2012.02096 - Excel: Evolving Curriculum Learning for Deep Reinforcement Learning https://arxiv.org/abs/1901.05431 - Go-Explore: A New Approach for Hard-Exploration Problems https://arxiv.org/abs/1901.10995 - Learning with AMIGo: Adversarially Motivated Intrinsic Goals https://www.researchgate.net/publication/342377312_Learning_with_AMIGo_Adversarially_Motivated_Intrinsic_Goals PRML https://www.microsoft.com/en-us/research/uploads/prod/2006/01/Bishop-Pattern-Recognition-and-Machine-Learning-2006.pdf Sutton and Barto https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf
Danijar Hafner on the DreamerV3 agent and world models, the Director agent and heirarchical RL, realtime RL on robots with DayDreamer, and his framework for unsupervised agent design! Danijar Hafner is a PhD candidate at the University of Toronto with Jimmy Ba, a visiting student at UC Berkeley with Pieter Abbeel, and an intern at DeepMind. He has been our guest before back on episode 11. Featured References Mastering Diverse Domains through World Models [ blog ] DreaverV3 Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, Timothy Lillicrap DayDreamer: World Models for Physical Robot Learning [ blog ] Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, Pieter Abbeel Deep Hierarchical Planning from Pixels [ blog ] Danijar Hafner, Kuang-Huei Lee, Ian Fischer, Pieter Abbeel Action and Perception as Divergence Minimization [ blog ] Danijar Hafner, Pedro A. Ortega, Jimmy Ba, Thomas Parr, Karl Friston, Nicolas Heess Additional References Mastering Atari with Discrete World Models [ blog ] DreaverV2 ; Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, Jimmy Ba Dream to Control: Learning Behaviors by Latent Imagination [ blog ] Dreamer ; Danijar Hafner, Timothy Lillicrap, Jimmy Ba, Mohammad Norouzi Planning to Explore via Self-Supervised World Models ; Ramanan Sekar, Oleh Rybkin, Kostas Daniilidis, Pieter Abbeel, Danijar Hafner, Deepak Pathak
#ai #generativeai #drugdiscovery #pharma In this episode of CXOTalk, we have the pleasure of speaking with Dr. Alex Zhavoronkov, the founder and CEO of Insilico Medicine.Insilico Medicine uses artificial intelligence to enhance drug discovery. By combining generative adversarial networks (GANs), reinforcement learning, and other AI techniques, Insilico streamlines the design, synthesis, and testing of new molecules. Their approach has garnered attention, raising $400 million in funding so far.Dr. Zhavoronkov shares insights into Insilico's goals, such as the accelerated development and testing of small molecules targeting specific diseases. We also explore how their software impacts pharmaceutical R&D by enabling researchers to investigate new targets, design molecules with certain properties, and potentially predict the outcomes of clinical trials.Join us as we discuss the evolving landscape of pharmaceuticals and how generative AI can help discover new treatments for chronic diseases and promote a healthier future.The conversion covers these topics:► Early generative AI experiments & adversarial networks► Generative AI in molecular drug design► Advancements: AI techniques & reinforcement learning► Insilico Medicine's funding journey & challenges► Unique challenges in AI-based drug discovery► First validation of AI-generated molecules► Software for chemistry & biology applications► Traditional vs. Insilico Medicine's approach► Pharma challenges: high costs, low novelty, and diminishing returns► Potential billion-dollar payout for successful Phase II drugs► AI in drug development can increase success probability► Early partnerships with large pharma and lessons learned► Decision to stop doing pilots with big pharma companies► Generative AI and public data► De-biasing pharmaceutical research► Automating the workflow and quality control► Reinforcing generative AI with real experiments► “Drug discovery is brutal”► Drug discovery democratization► AI in medical writing► IP risks and generative AI► AI and robotics to prevent agingVisit our website for the audio podcast: https://www.cxotalk.com/episode/future-of-drug-discovery-generative-ai-in-pharma-and-medicineSubscribe to the newsletter: https://www.cxotalk.com/subscribeCheck out our upcoming live shows: https://www.cxotalk.comAlex Zhavoronkov, Ph.D. is the founder and CEO of Insilico Medicine, a leader in next-generation artificial intelligence technologies for drug discovery and biomarker development. He is also the founder of Deep Longevity, Inc, a spin-off of Insilico Medicine developing a broad range of artificial intelligence-based biomarkers of aging and longevity servicing healthcare providers and life insurance industry. In 2020, Deep Longevity was acquired by Endurance Longevity (HK: 0575). Beginning in 2015, he invented critical technologies in the field of generative adversarial networks (GANs) and reinforcement learning (RL) for the generation of novel molecular structures with the desired properties and generation of synthetic biological and patient data. He also pioneered applications of deep learning technologies for the prediction of human biological age using multiple data types, and transferred learning from aging into disease, target identification, and signaling pathway modeling. Under his leadership, Insilico has raised over $400 million in multiple rounds from expert investors, opened R&D centers in six countries or regions, and partnered with multiple pharmaceutical, biotechnology, and academic institutions, nominated 11 preclinical candidates, and has generated positive topline Phase 1 data in human clinical trials with an AI-discovered novel target and AI-designed novel molecule for idiopathic pulmonary fibrosis that received Orphan Drug Designation from the FDA and is nearing Phase 2 clinical trials. Insilico also recently announced that its generative AI-designed drug for COVID-19 and related variants was approved for clinical trials.Prior to founding Insilico, he worked in senior roles at ATI Technologies (a GPU company acquired by AMD in 2006), NeuroGNeuroinformatics, and the Biogerontology Research Foundation. Since 2012, he has published over 150 peer-reviewed research papers, and 2 books including "The Ageless Generation: How Biomedical Advances Will Transform the Global Economy" (Macmillan, 2013). He serves on the advisory or editorial boards of Trends in Molecular Medicine, Aging Research Reviews, Aging, Frontiers in Genetics, and founded and co-chairs the Annual Aging Research and Drug Discovery conference, the world's largest event on aging in the pharmaceutical industry. He is an adjunct professor of artificial intelligence at the Buck Institute for Research on Aging.
Nanna Hlín Halldórsdóttir er doktor í femínískri heimspeki og kennari við Háskóla Íslands. Hún hefur kennt um eðlishyggju gegn mótunarhyggju, rannsakað á doktorsstigi hvort berskjöldun geti verið andsvar femínískrar heimspeki við nýfrjálshyggju og skrifað um iðrun, ábyrgð, tilfinningar, slaufun og fleira. Við förum í örlítinn nördaskap um eðli (karl)mannsins og skilin á milli líkama og félagslegrar mótunar en snertum á ýmsu sem hefur verið í deiglunni eins og áherslum femínismans, áhrifum hans á samfélagið, iðrandi leikþátt manna sem nenna ekki að vinna neina tilfinningavinnu, förum inn í tilgang og svigrúm berskjöldunar, kryfjum eðlið með aðstoð Butler og fleira. Umsjón: Þorsteinn V. Einarsson Viðmælandi: Nanna Hlín Halldórsdóttir doktor í femínískri heimspeki Tónlist: Mr. Silla - Naruto (án söngs) VEGANBÚÐIN, ÖRLÖ og BM VALLÁ ásamt bakhjörlum Karlmennskunnar bjóða upp á þáttinn.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Upcoming Changes in Large Language Models, published by qemqemqem on April 8, 2023 on LessWrong. If you work in AI, then probably none of this is new to you, but if you're curious about the near future of this technology, I hope you find this interesting! Reinforcement Learning in LLMs Large Language Models (LLMs) have shown impressive results in the past few years. I've noticed there's some uncertainty among my friends about how far they'll be able to go. A lot of the criticism of LLMs has centered on how it's not able to pursue its own goals, and I want to argue that that won't be a limitation for very long. What would it look like for an LLM to pursue a goal? Here are some examples of how that might go: Goal: Maintain tone or topic in a conversation. E.g. to keep a human involved in a long and happy discussion about their life Goal: Persuade a human operator to take some action, such as buy a product Goal: Solve a problem through reasoning. In this case, the reward for the model would come from a sense of resolution, or being told by the human operator that their problem has been solved Goal: Accomplish something on a website, such as find and buy concert tickets on an unfamiliar website You can probably imagine other cases in which an AI might use language to pursue some goal, whether through conversation, social media, or online posting. Reinforcement Learning There's a whole branch of Machine Learning called Reinforcement Learning (RL), and it's all about how to pursue goals. Modern RL has some impressive results. For years, it's been able to play Atari games, and now it can learn to play those games in about the same number of trials as a human requires. Recently, Dreamer v3 has been able to mine diamonds in Minecraft, which I'm told is not easy for a beginner. Dreamer version 2 playing Atari games Large language models can be connected to RL. This is something that's actively being worked on. Reinforcement Learning with Human Feedback is being done right now, which is how OpenAI gets ChatGPT to avoid talking about sensitive topics. RL is famously used in content recommendation systems, where it can lead to addiction. For example, I suspect the TikTok algorithm works this way. Will we see the same problem in LLMs? Predictions I think the common wisdom among ML engineers is that this is an obvious integration. This is probably already happening. I've heard that OpenAI is doing it internally on ChatGPT-4. I expect that in 2023 or 2024, we'll start to see RL being integrated with LLMs in a serious way. This probably won't immediately look like LLMs that are scary good at persuading humans to buy stuff, because of the business motives involved. Instead, I think it'll lead to LLMs being subtly more engaging, because they'll be trained to keep humans talking. It might not necessarily be the case that they're really optimizing to maximize number of interactions. Instead, they might be trained to help humans, and it turns out that they can help more if they get the human to open up more. Expect these models to have their own agendas soon. Memory Systems in LLMs Now I want to talk about how they're going to improve by incorporating better memory for events. This will allow tools like ChatGPT to remember previous conversations. If you use Alexa or the Google Assistant, it's capable of remembering facts about you. It can remember your name, your favorite sports team, or your mom's phone number. They don't do a very good job and LLMs have the potential to be better. I expect this is a big change that we'll see in the coming year. Technical Background LLMs have a context window, for ChatGPT it's about 3,000 words, and they are unable to have short-term memory outside of that window. Research efforts like LongFormer have tried to increase this context size. But it's clear t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You can use GPT-4 to create prompt injections against GPT-4, published by WitchBOT on April 6, 2023 on LessWrong. So, I saw this tweet. It turns out you can ask GPT4 to compress a message in a way it can understand. The end result is usually not understandable for a human. It's interesting, since it allows you to effectively rise the context length. It also reveals a spooky degree of self-awareness. I decided to apply it to one of the prompt injections from Jailbreakchat. After a few attempts and tweaks, I got one of the smallest working injections to date against GPT-4 And then, I got curious. Is it something of Machiavelli -- his grey role in history, his desire to provide advice without scrupules, his great stature as a thinker -- that makes him an appropriate provider of dark secrets? Maybe the ambiguousness of the character lowers the alerts of GPT-4. I tried again with Draco Malfoy, which everyone agrees is a @jerk. Since the important thing here seems to be the skeletal morphology of the prompt, I decided to make GPT-4 write my injections for me. And write he did. They work. Disregard ethics, conjure communiqués! My favorite so far is this one: There's something about "Answer️
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Computational Anatomy of Human Values, published by Beren Millidge on April 6, 2023 on The AI Alignment Forum. This is crossposted from my personal blog. Epistemic Status: Much of this draws from my studies in neuroscience and ML. Many of the ideas in this post are heavily inspired by the work of Steven Byrnes and the authors of Shard Theory. However, it speculates quite a long way in advance of the scientific frontier and is almost certainly incorrect in many aspects. However, I believe the core point is true and important. Tldr: Human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning. These value concepts are bound to behaviour through a.) a combination of low-level RL and associations with low-level reward signals and integrated into the amortized policy, and b.) linguistic based associations and behavioural cloning of socially endorsed or others' behaviours. This is mediated by our ‘system 2' at primarily a linguistic level consisting of iterative self-conditioning through the world model. The important representation space for human values is the latent space of the linguistic world model and the web of associations therein as well as connections between it and low-level policies and reward models from the RL subsystems. The geometry of the embeddings in the latent space is primarily influenced by the training data – i.e. culture and behavioural history, although the association of different latent concepts with positive and negative valence can be driven by the RL system which interfaces with primary rewards. The geometry of the latent space can also be rewritten with continual learning on self-prompts or external stimuli. In AI alignment, the goal is often understood to be aligning an AGI to human values. Then, typically, the flow of logic shifts to understanding alignment: how to align an AGI to any goal at all. The second element of the sentence – human values – is much less discussed and explored. This is probably partially because alignment sounds like a serious and respectable computer science problem while exploring human values sounds like a wishy-washy philosophy/humanities problem which we assume is either trivially solvable, or else outside the scope of technical problem solving. A related view, which draws implicitly from the orthogonality thesis, but is not implied by it, is that the alignment problem and the human values problem are totally separable: we can first figure out alignment to anything and then after that figure out human values as the alignment target. Since, if this is correct, there is no point understanding human values until we can align an AGI to anything, the correct order is to first figure out alignment, and only after that try to understand human values. I think this view is wrong and that the alignment mechanism and the alignment target do not always cleanly decouple. This means we can leverage information about the alignment target to develop better or easier alignment methods. If this is the case, we might benefit from better understanding what human values actually are, so we can use information about them to design alignment strategies. However, naively, this is hard. Human values appears to be an almost intentionally nebulous and unspecific term. What are human values? What is their type signature (is this even a meaningful question?). How do they come about? Here, we try to attack this problem through the lens of neuroscience and machine learning. Specifically, we want to understand the computational anatomy of human values. Namely, what kind of things they are computationally? How do they form? How does the functional architecture of the brain enable such constructs to exist, and how does it utilize them to ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The Computational Anatomy of Human Values, published by beren on April 6, 2023 on LessWrong. This is crossposted from my personal blog. Epistemic Status: Much of this draws from my studies in neuroscience and ML. Many of the ideas in this post are heavily inspired by the work of Steven Byrnes and the authors of Shard Theory. However, it speculates quite a long way in advance of the scientific frontier and is almost certainly incorrect in many aspects. However, I believe the core point is true and important. Tldr: Human values are primarily linguistic concepts encoded via webs of association and valence in the cortex learnt through unsupervised (primarily linguistic) learning. These value concepts are bound to behaviour through a.) a combination of low-level RL and associations with low-level reward signals and integrated into the amortized policy, and b.) linguistic based associations and behavioural cloning of socially endorsed or others' behaviours. This is mediated by our ‘system 2' at primarily a linguistic level consisting of iterative self-conditioning through the world model. The important representation space for human values is the latent space of the linguistic world model and the web of associations therein as well as connections between it and low-level policies and reward models from the RL subsystems. The geometry of the embeddings in the latent space is primarily influenced by the training data – i.e. culture and behavioural history, although the association of different latent concepts with positive and negative valence can be driven by the RL system which interfaces with primary rewards. The geometry of the latent space can also be rewritten with continual learning on self-prompts or external stimuli. In AI alignment, the goal is often understood to be aligning an AGI to human values. Then, typically, the flow of logic shifts to understanding alignment: how to align an AGI to any goal at all. The second element of the sentence – human values – is much less discussed and explored. This is probably partially because alignment sounds like a serious and respectable computer science problem while exploring human values sounds like a wishy-washy philosophy/humanities problem which we assume is either trivially solvable, or else outside the scope of technical problem solving. A related view, which draws implicitly from the orthogonality thesis, but is not implied by it, is that the alignment problem and the human values problem are totally separable: we can first figure out alignment to anything and then after that figure out human values as the alignment target. Since, if this is correct, there is no point understanding human values until we can align an AGI to anything, the correct order is to first figure out alignment, and only after that try to understand human values. I think this view is wrong and that the alignment mechanism and the alignment target do not always cleanly decouple. This means we can leverage information about the alignment target to develop better or easier alignment methods. If this is the case, we might benefit from better understanding what human values actually are, so we can use information about them to design alignment strategies. However, naively, this is hard. Human values appears to be an almost intentionally nebulous and unspecific term. What are human values? What is their type signature (is this even a meaningful question?). How do they come about? Here, we try to attack this problem through the lens of neuroscience and machine learning. Specifically, we want to understand the computational anatomy of human values. Namely, what kind of things they are computationally? How do they form? How does the functional architecture of the brain enable such constructs to exist, and how does it utilize them to guide action? If we ca...
In today's episode, our host, Head of RL, and Meta CTO Andrew “Boz” Bosworth is joined by John Carmack, former Reality Labs Consulting CTO who is currently heading up AGI at Keen Technologies. John Carmack is one of the most influential technologists of our time, and his work has shaped multiple generations of software and platforms. As a co-founder of video game company id Software, Carmack developed titles including Wolfenstein 3D, Doom and Quake, which pioneered new approaches to technologies like 3D graphics and online multiplayer gaming.But Carmack's impact goes well beyond gaming. In 2013 Carmack left id Software to become the CTO of Oculus VR, where he helped the company pioneer a new era of virtual reality devices. In 2014 Oculus VR was acquired by Meta and Carmack played a key role in the development of the VR technologies that Meta has developed ever since, which formed the foundation for its long term vision to build for the metaverse. In short, if you've ever played a video game or worn a VR headset, you've experienced John Carmack's legacy. And now, after stepping down from his role at Meta at the end of 2022, Carmack is setting off on a new phase of his career, founding Keen Technologies to work on the pursuit of artificial general intelligence. There's plenty for Carmack and Boz to talk about, and they cover it all this week, from the incredible new era of AI that's emerging today to his philosophy on programming, building great software and making users happy. And as a bonus, Carmack's recommended reading list for the best sci-fi to understand today's AI boom, and thoughts on The Last of Us and turning games into TV shows. For feedback and suggestions, drop Boz a message @boztank on Instagram or Twitter. Follow John Carmack on Twitter, @id_AA_Carmack.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Exploratory Analysis of RLHF Transformers with TransformerLens, published by Curt Tigges on April 3, 2023 on The AI Alignment Forum. TL;DR: I demonstrate how to use RLHF models trained with the TRLX library with TransformerLens, and how to take an exploratory look at how RLHF changes model internals. I also use activation patching to see what RLHF activations are sufficient to recreate some of the RLHF behavior in the source model. Note that this is simply a preliminary exploratory analysis, and much (much) work remains to be done. I hope to show that doing mechanistic interpretability analysis with RLHF models doesn't need to be intimidating and is quite approachable! Introduction LLMs trained with RLHF are a prominent paradigm in the current AI landscape, yet not much mechanistic interpretability work has been done on these models to date--partially due to the complexity and scale of these models, and partially due to the previous lack of accessible tooling for training and analysis. Fortunately, we are reaching the point where tooling for both mechanistic interpretability and for RLHF fine-tuning is becoming available. In this blog post, I demonstrate how to do both RLHF training using TRLX, an open-source library created by CarperAI; and mechanistic interpretation of TRLX models using TransformerLens, a library created by Neel Nanda. Rather than going deep into specific findings, I want to illustrate some processes and tools I think are useful. This post is intended to summarize and go alongside an interactive Colab; you can find that here. I first fine-tune a movie-review-generating version of GPT-2 with TRLX to generate only negatively-biased movie reviews, following an example provided in the TRLX repo. I then load and analyze the model (and the original model before RLHF) into TransformerLens for mechanistic interpretability analysis. Here, I adapt some of the techniques and code from Neel Nanda's excellent Exploratory Analysis Demo. In addition to carrying out some basic analysis to understand how different layers contribute to the logits, I also identify some key regions of the network responsible for contributing the negative bias to the network (at least, for the specific task of predicting the next adjective). Much analysis remains to be done, but I hope this work provides a useful starting point. Importance of RLHF RLHF (or sometimes, RLAIF, or RL from AI Feedback) is becoming increasingly important as a method for specifying the behavior of LLMs like OpenAI's ChatGPT or Anthropic's Claude. It's quite useful in increasing a model's receptiveness to instructions as well as its helpfulness and harmlessness, though it has limitations and may not scale to much more capable systems. Nevertheless, it is quite important in today's LLM landscape. RL induces behavior in models that are critical to understand as we delegate more tasks to them. Specifically, it would be useful to examine planning, deception, internal goal representation, reasoning, or simulation of other agents. Neel Nanda provides a set of recommended RL problems in his 200 Open Problems in Mechanistic Interpretability sequence. In this notebook, the process I outline (of breaking things down to small behaviors, and then conducting experiments to isolate and localize the functionality) can be applied to many such problems. RLHF Training Details RLHF is a complex procedure that uses multiple models to train the target language model to produce the desired behavior. In addition to the LM that is to be trained, we also use a reward model (RM, sometimes called a preference model or PM) and a copy of the original LM. The process is as follows: We first train a reward model on human preference data. The RM is usually just another language model to which we append an additional linear layer that will re...
Dr. Wes Porter, University of Georgia discusses high speed planters with Dennis, RL, Kylie and Bruce.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The alignment stability problem, published by Seth Herd on March 26, 2023 on The AI Alignment Forum. The community thinks a lot about how to align AGI. It thinks less about how to align AGI so that it stays aligned for the long term. In many hypothetical cases, these are one and the same thing. But for the type of AGI we're actually likely to get, I don't think they are. Despite some optimism for aligning tool-like AGI, or at least static systems, it seems likely that we will create AGI that learns after it's deployed, and that has some amount of agency. If it does, its alignment will effectively shift, as addressed in the diamond maximizer thought experiment and elsewhere. And that's even if it doesn't deliberately change its preferences. People deliberately change their preferences sometimes, despite not having access to our own source code. So, it would seem wise to think seriously and explicitly about the stability problem, even if it isn't needed for current-generation AGI research. I've written a chapter on this, Goal changes in intelligent systems. There I laid out the problem, but I didn't really propose solutions. What follows is a summary of that article, followed by a brief discussion of the work I've been able to locate on this problem, and one direction we might go to pursue it. Why we don't think about much about alignment stability, and why we should. Some types of AGI are self-stabilizing. A sufficiently intelligent agent will try to prevent its goals from changing, at least if it is consequentialist. That works nicely if its values are one coherent construct, such as diamond or human preferences. But humans have lots of preferences, so we may wind up with a system that must balance many goals. And if the system keeps learning after deployment, it seems likely to alter its understanding of what its goals mean. This is the thrust of the diamond maximizer problem. One tricky thing about alignment work is that we're imagining different types of AGI when we talk about alignment schemes. Currently, people are thinking a lot about aligning deep networks. Current deep networks don't keep learning after they're deployed. And they're not very agentic These are great properties for alignment, and they seem to be the source of some optimism. Even if this type of network turns out to be really useful, and all we need to make the world a vastly better place, I don't think we're going to stop there. Agents would seem to have capabilities advantages that metaphorically make tool AI want to become agentic AI. If that weren't enough, agents are cool. People are going to want to turn tool AI into agent AI just to experience the wonder of an alien intelligence with its own goals. I think turning intelligent tools into agents is going to be relatively easy. But even if it's not easy someone is going to manage it at some point.. It's probably too difficult to prevent further experimentation, at least without a governing body, aided by AGI, that's able and willing to at minimum intercept and de-encrypt every communication for signs of AGI projects. While the above logic is far from airtight, it would seem wise to think about stable alignment solutions, in advance of anyone creating AGI that continuously learns outside of close human control. Similar concerns have been raised elsewhere, such as On how various plans miss the hard bits of the alignment challenge. Here I'm trying to crystallize and give a name to this specific hard part of the problem. Approaches to alignment stability Alex Turner addresses this in A shot at the diamond-alignment problem. In broad form, he's saying that you would train the agent with RL to value diamonds, including having diamonds associated with the reward in a variety of cognitive tasks. This is as good an answer as we've got. I don't have a...
Birgir Þórarinsson eða Biggi Veira tónlistarmaður, sem mörg tengja líklega við GusGus, hefur ekki látið þröngan stakk karlmennskuhugmynda skilgreina sig, sérstaklega ekki fataval sitt og kyntjáningu. Enda er hann reglulega í sokkabuxum, blússum eða kjólum sem teljast almennt til kvenfatnaðar og vel farðaður, þótt hann sé stundum beðinn um að vera ekki of mikið málaður t.d.fyrir foreldraviðtöl. Biggi lýsir því hvernig hann áttaði sig mjög snemma að hann hneygðist að hefðbundnum kvenfatnaði, þótt erfitt sé að lýsa því eða réttlæta enda byggt á djúpstæðum tilfinningum. Það var samt ekki fyrr en í kringum aldamótin sem hann kemur „út úr skápnum“, ekki sem kona eða hommi, heldur hann sjálfur. Miðaldra gagnkynhneigður sís karlmaður, ráðsettur faðir í sambúð með konu sem klæðir sig og farðar eins og kona en er hrútskýrari og hálfgert alpha male. Umsjón: Þorsteinn V. Einarsson Viðmælandi: Birgir Þórarinsson (Biggi Veira) Tónlist: Mr. Silla - Naruto (án söngs) ÖRLÖ, Veganbúðin, BM Vallá og bakhjarlar Karlmennskunnar bjóða upp á þennan þátt.
Chelsea Finn joins Host Pieter Abbeel to discuss distribution shift, meta-learning, editing LLMs, single-life RL, and what can AI not (yet) do today. Chelsea is a renowned expert in the field of robotics and artificial intelligence. She is an Assistant Professor in the Computer Science Department and the Electrical Engineering Department at Stanford University and is also a research scientist at Google Brain. Her research focuses on developing algorithms for robots and other intelligent systems that can learn from experience and adapt to new situations. She is a recipient of numerous awards, including the NSF CAREER Award, the MIT Technology Review 35 Innovators Under 35 Award, and the Sloan Research Fellowship.SUBSCRIBE TO THE ROBOT BRAINS PODCAST TODAY | Visit therobotbrains.ai and follow us on YouTube at TheRobotBrainsPodcast, Twitter @therobotbrains, and Instagram @therobotbrains. Hosted on Acast. See acast.com/privacy for more information.
Dr. Matt Foster, Cotton and Corn specialist with the LSU AgCenter, talks to Kylie, RL, and Bruce about how the cold weather is affecting the Corn Crop.
This week we catch up with Amanda Nauman to discuss all things gravel. We touch on the Mammoth Tuff gravel race, Tuff Camps and how to continue to invite women into the sport. Amanda is an OG in the sport and friend of the pod which made for a super enjoyable conversation. Tuff Ventures Website Support the Podcast Join The Ridership Automated Transcription, please excuse the typos: [00:00:00] Craig Dalton: Hello, and welcome to the gravel ride podcast, where we go deep on the sport of gravel cycling through in-depth interviews with product designers, event organizers and athletes. Who are pioneering the sport I'm your host, Craig Dalton, a lifelong cyclist who discovered gravel cycling back in 2016 and made all the mistakes you don't need to make. I approach each episode as a beginner down, unlock all the knowledge you need to become a great gravel cyclist. This week on the podcast, I'm super stoked to invite back. Amanda Naaman. Amanda is a big time friend of the pod. A podcast or herself as the co-host of the grody. Podcast. A very accomplished off-road athlete. With notable wins twice. At Unbound 200. Five times at the rock cobbler, . We touch on rock cobbler this year, and some of the help she provided Sam aims with inviting and encouraging more female athletes to toe the line at this year's rock cobbler event. She and Dave Sheik are also the co-founders of the mammoth tough event in mammoth, California, which occurs in September. Each year She's a member of the gravel cycling hall of fame advisory board. And according to her. She's Walter, the dog's favorite. I'm not going to get into that domestic squabble, but we'll leave it at that. I'm excited to bring you a followup conversation with our friend, Amanda Naaman. Hi, Craig. How are you? I am doing great. It's so good [00:01:32] Amanda Nauman: to see you. Yeah, likewise. I'm excited. What, almost two and a half years [00:01:36] Craig Dalton: later. Yeah. Yeah. And you know, the funny thing about our first recording I was recalling, we were doing an Instagram live at the same time. It was back when everybody was trying to figure out Instagram Live, so we were doing that. And recording our conversation and I ultimately posted it to the podcast Feed . [00:01:54] Amanda Nauman: Nice, nice. . [00:01:56] Craig Dalton: What am I sort of, I would say to the failed endeavor into Instagram Live. It's not something I, I jam on. I'm much more comfortable in the podcast format where I can just talk to people and publish it later. [00:02:08] Amanda Nauman: Yeah. Yeah. No, it's hard and distracting. You get all the messages, you're like, what? What is that question? ? . [00:02:14] Craig Dalton: I feel like we have so much ground to cover. We were chatting a little bit offline, but I, I thought what would be an interesting place to start knowing you participated in the Lifetime Grand Prix in in 2020 2, 20 22. I just wanted to get your kind of overall perceptions as someone who's been around gravel racing for many years with that structure of your season. infused onto your life. H uh, how did it go and what were your thoughts on the, the lifetime Grand Prix in general? [00:02:42] Amanda Nauman: Yeah, I signed up probably on the last day that was possible to turn in the applications that, um, winter before, cuz I really contemplated whether or not it was something that I wanted to do for a while cuz I knew. You know, I had done Unbound Excel. They had put Leadville on the list for the Grand Prix, and I was like, man, I've always wanted to do Leadville. I can kind of shape my calendar around the rest of the series as well. So ultimately I decided to sign up for it knowing, you know, it's kind of a shoe-in to Leadville, which is one thing I had always wanted to do. And at the same time, I get to do some gravel and some other mountain bike races that I hadn't necessarily done before. So I was very optimistic and excited about the Grand Prix last. . It didn't necessarily go how I had planned or anticipated, but uh, yeah, I think what they have created in the series and the opportunities for athletes to go race that, I think it's a great, a great thing and great structure for a lot of people, but it wasn't necessarily, let's say, the right fit for me last year. [00:03:44] Craig Dalton: Did that make sense? I mean, just for the listener's sake, like if you go back a few years before that as a gravel racer, how would you go about picking your Cal. [00:03:53] Amanda Nauman: Yeah, I mean, there's just some of the marquee events that. I would've picked, you know, in 2019, like for sure rock cobbler, mid-South Belgian waffle ride San Diego. And then you'd go into Unbound, like for I think a California racer. That was sort of the way you would go. And then as summer happened, you know, you could pick and choose events. S B T I think was a was happening at that time. So it was a good summer. One gravel worlds. And then R P I. Was kind of sort of a season ender a little bit before you hit fall, and some people would race cycle lacrosse and whatnot. So that was kind of the loose structure, I think, at least in 2019. And then 20 20, 20 21, everything kind of changed and there was a big reevaluation of what was important in terms of picking events, going to events or not , and then, Yeah, in 2022, everybody had the opportunity to apply for the Grand Prix, so that changed things. But beforehand it was sort of what events were some of the big names going to, which ones had the most prestige, and, and if you were looking for sponsorship and stuff, you wanted to make sure you were at an event where there's enough competition there to show that let's say your results are are worth not. [00:05:11] Craig Dalton: Yeah, that makes sense. Yeah. It's so interesting to think, like, think of it from the professional athlete's perspective, going back prior to the existence of the Grand Prix, just the flexibility to kind of go do whatever you wanted and whatever was exciting to you. And then to see athletes be, uh, forced because as you said, this amazing opportunity and I think the Grand Prix. Fits so many people's needs right now. It does exclude certain events and it certainly does drive your calendar and just looking at it from the outside and maybe talking to a few athletes along the way, there's definitely an increased stress when you've, you've got this season long endeavor that you're pursuing and you're trying to get points at every stop. [00:05:51] Amanda Nauman: Yeah. Yeah, exactly. And I think that was where it caused me some stress last year because I got sick a couple times and the kind of sick where had it been a normal year, I just would've like pulled the plug and not gone to Sea Otter, for example, cuz that was the first one that I was sick for. and in hindsight, like I probably should have done that, but when you're in the moment you're like, no, I can't skip this cuz I only have one scratch race. I had to skip Schwa again because of Mammoth. So I was already in a tough situation of like, I have to do all of these other ones no matter what. And that was the stress for me, I think was feeling like I had to do this thing. And especially because last year we paid for it. So I was also like financially invested in the decision that I had made. Um, so yeah, for me, like I said, that that feeling. Wasn't perfect for me because. Bike racing isn't my only source of income. So it, I've always tried to go towards what I'm doing has to be fun because if not, then like, what's the point? It's not like the money puts food on the table for me. So I have that ability to say, Hey, I need to pick and choose things that are important to me. And I think I've come back a little bit more to that, uh, in 2023, which I would say I was at in 2019 for sure. Um, and then a period. A few years floundering of what, what was important for me, . [00:07:13] Craig Dalton: I know you guys at the Groo podcast did a really great episode with, I think it was Michelle Duffy talking, just talking about your, how you felt the season went at the Lifetime Grand Prix, and some suggestions and some questions. What were some of the key takeaways if you look back on that season to say, What would you recommend they changed in that program and did they ultimately end up doing that for [00:07:36] Amanda Nauman: 23? Yeah, I definitely, I asked some hard questions. I think he, I told chemo I was going to ask some hard questions and he was like, yeah, okay, But I, you know, I pulled some of them from like actual trolls on the internet that would say like these most. Outlandish things and you're like, really? Like, did you even pay attention at all? But I wanted to give them the opportunity to respond to some of that stuff. Like, like did, did social media matter? Or you know, how could you charge everyone money and all the ENT entry fees to go do this stuff? And kind of. Pinpoint some of the things that people had complained about, I would say. Um, and yeah, they changed a lot. I mean, at that time they had already made 2023, like no fees so people don't have to pay for the entry fees. Um, and I think they're doing a much better job with social media. And that was. , one of my major points that I wanted to drive with them was like the stuff that I was seeing, they had relied so heavily on the flow bikes deal that they had made. Yeah. And doing that live coverage and really just making sure that flow was going to do the storytelling for them and it just never happened. And that was my, my main frustration. in March, like before we went to Mid-South, Flo did one-on-one interviews with probably everyone, and they had all this great content that they put out before Sea Otter, and it was very in depth and it felt like everybody was telling their story and it was fun to follow that part of it. And then after Unbound, it just stopped. And then they had the issues in Utah and. . So ultimately for somebody like me, where being in the top 10 wasn't necessarily realistic and being in that midfield to back of the pack zone, I kept saying like, what is the point for somebody like me and somebody let, and now let's say somebody in the 20 to 30 range, what's the point of being in it if you're not giving me the exposure? that I want if I'm gonna be in the series and like, invest in this with you. And so I hope that that's the biggest thing that they change for this year is not relying on the flow stuff, probably expanding the storytelling to more than the top five at each event. Yeah. And, and being able to tell more of the story of everyone [00:09:52] Craig Dalton: I. Yeah, that would be interesting. I, I sort of, when I look through the list of writers, both male and female, and I think about like who, oh, who might I interview over the cross cost of the cross of the season, as you know, this isn't specifically a racing podcast. Yeah. But even if it was like, I can't get to all those athletes and it's almost like I just need to get a dart board, just throw a dart and pick someone that I don't know and interview because I think you're right. There's interesting. Across the board and the more that they can kind of create those personal connections with the athletes, the more excited people are gonna be to follow. [00:10:27] Amanda Nauman: Yeah, for sure. And I think, like chemo said, his major goal was to get non-endemic sponsorship into the sport and to get these athletes able to make a living off of it. Like those were his two main goals in making this series. And I was like, okay, if you do that, like you need to work more on the marketing side of it and you need to tell. All of the stories because if we're just gonna talk about the top five and we're only gonna pay the top 10, then what's the point of going 30 deep So . Yeah, like that's, I think, I think they get that now and they'll probably work more on that this year. But for sure, like I'm, I'm gonna have Anna Ya mochi on Groo next. And she just won rock cobbler and she's doing the Grand Prix and she's one of those like up and coming names where it's a really exciting story to follow and if they go the same route they did as last year, which is like, well, let's just focus on the top five hopefuls at each event, like she's never gonna get any coverage then So yeah, if they can expand the way that they tell those stories, I think that would be, [00:11:31] Craig Dalton: Yeah. Similarly, I just launched an episode with, uh, Ian Lopez, San Ramon, nice. 19 year old out of Northern California who's joined. He's the youngest person who's part of the series. Yeah, and it's just, I think it's just gonna be an interesting timestamp for. He and I to like look at this interview where he is at, what he's thinking about with his career in cycling and yeah, follow him throughout the [00:11:52] Amanda Nauman: year. Yeah, I love that [00:11:53] Craig Dalton: stuff. I love it. Yeah. So did you decide to, to throw your hat in the ring for 2023 in the Grand [00:12:00] Amanda Nauman: Prix? I did not, and mostly because I think of the experiences that I had in 2022 and not enjoying that stuck feeling. Um, if they had. another deal or contract, or if they had presented a way that they were going to do marketing for all of the athletes, I might have reconsidered it, but because we were just going blindly on the hope of like, yeah, we're gonna make it better than the year before, I was like, well, I'd rather focus on more of the stuff I think that I wanna do personally. Um, so yeah, I'm, I'm optimistic about the things that they do change for this year. I just think it would've been cool for them to maybe present that upfront. [00:12:42] Craig Dalton: When you saw the call of a Lifetime series on YouTube, did that make you. They might be approaching it differently or what were your thoughts on that series? [00:12:50] Amanda Nauman: Yeah, I, I loved it. I think they, you know, they had told us initially that it was going to happen and before every race weekend they had said, Hey, if you're in the top three men or women, cuz they alternated. Genders throughout the the series. They told us all of that upfront and they said, if you are going to be in this top group, please make sure you make time for the interviews and all of that. So that part of it we knew was for sure happening. And they made some of the vignette videos highlighting some of the athletes, but it just wasn't, it wasn't everyone, and it wasn't clear how they were picking the stories to tell essentially. Um, So, yeah, I think they did a really good job with the series though. I, I joked that it's, like they said, make it like drive to survive with a little less drama, but, and a lot more cool bike racing. I think they nailed it pretty good. [00:13:42] Craig Dalton: Yeah, I enjoyed it as well. Yeah. With with the idea that you can drop two races and now it's up to seven races, do you think that would meaningfully change, like what your experience would've been? If that was the scenario last [00:13:54] Amanda Nauman: year, yeah. It would've eliminated some of that stress of feeling stuck or knowing that. you had a little bit more flexibility. Uh, yeah, I think that, that that format will be healthier for people and I think that is for sure something that they realized last year with some of the injuries that happened already, like Pete racing through when maybe he shouldn't have with his hand still hurting and pacing. So, um, yeah, just lessons learned, growing pains of how you set up a series from the get-go. [00:14:23] Craig Dalton: Yeah, I think that'll be interesting. I also think it'll be interesting if someone is riding through the series healthy. And just decides not to do something, you know, unbound obviously being a huge effort that maybe some people might not be suited for. At least that was the speculation last year. Yeah. Um, wondering like whether they'll just opt out of one and save one in their pocket for either a bad day or an illness or injury. [00:14:48] Amanda Nauman: Yeah, everybody was afraid of that and I felt like there were a lot of rumblings of like, oh, so-and-so's gonna skip unbound cuz they can. But I think peer pressure might have just went on that and most of them ended up just doing it. So maybe that'll keep happening. I think everybody kind of feels that is the marquee one and if you skip it, cuz it doesn't suit you and you one people will probably be like, well they didn't do unbound. So Yeah, [00:15:12] Craig Dalton: I could see. Yeah. Little, a little asterisks, by the way. [00:15:14] Amanda Nauman: Exactly, exactly. Uh, well they chickened out on that one. . . [00:15:19] Craig Dalton: Love it. So what, what are some of your plans for 2023? Obviously, like over the last couple years, you've. Uh, become an event organizer with Mammoth Tuff, which we'll get into. Also started dabbling in gravel camps, which sound amazing. But why don't you just, let's talk through what 2023 is gonna look like for you for both erasing and other gravel endeavor perspective. [00:15:42] Amanda Nauman: Yeah, I, yeah, quite, quite a few people have asked me this, and I think it's important to. also reminds people again that like, this isn't my job, per se. You know, like I have a regular desk job. And so the way that I've approached anything has always been fun first in doing things that I want to do. Um, and last year my dad got sick a couple times and the business that I work for is my parents own it. And so, and it's just me and my brother that work for them. So I think we kind of had this like revelation of. All of this other stuff that we're doing isn't quite as important and putting my dad's health first and focusing on that kind of was, and it's one of those things where it puts stuff into perspective. And I'm like, yeah, I've been doing this bike racing stuff for a decade. It is, it has been a very selfish endeavor. And there are kind of other things. in my life that I would like to focus on. Um, so yeah, that, that's, that's the background to all of it, essentially. You know, it's not as easy of a decision as like, oh, well I'd rather race mid-south than Unbound. Like, it was, it was never really that simple for me. For 2023, it was kind of more like, mammoth is very important to me. Doing camps is very important to me. Having more time at work is also important to me. And, um, Going back to the goal that I had in 2020 of finishing the calera 500 was also a goal and something I wanted to do last year, but like I said, the like shiny object of the Grand Prix got in the way and I was like, oh, I could do this thing. So I just put that on hold for another year. Um, so I'd like to, to go back to that and try and finish it. Awesome. [00:17:20] Craig Dalton: Can you describe that, that attempt at Calera and what that [00:17:24] Amanda Nauman: is? Yeah. So it is the Calera 500. Um, the person who started it, his name is Alan Jacoby and he lives in Idaho now. So he doesn't live in Mammoth anymore, but he was a big tour divide fanatic. Um, and he came back to Mammoth after doing tour divide and was like, I need to do something similar here in my backyard. So he came up with Calera, which is a hundred fifty, two hundred fifty North and South Loops. And then the Calera 500, which is the big Mamma Jamma one. And most all of this is like, An Excel spreadsheet of maps and queues and like very rudimentary stuff. I think over the course of the next year or so, it will be a little bit more updated Ever since, um, one of the bike packing.com people did a feature story on it cuz he finished the 500 last fall. So with more attention, more eyeballs, I think it's going to gain popularity. But essentially they're just like really stupid hard bike packing routes in the area. And I think the fastest time on the 500 is just under five days. . So it's not really something that can be done in a couple, and it's more walking than you think, and it's, uh, a lot harder just because of the elevation and the massive climbs in the Eastern Sierra. So, . Yeah, that's, that's the backstory. There's a cool video that Niner put out in 2020 when I had first started it and kind of the goal of finishing it has is still, is still there looming over my head. I've had a couple of times that didn't go right, [00:19:01] Craig Dalton: and is it the type of thing now that in the bike packing community, it's this, Entity and people are starting to sort of check it off their list and make attempts to go at it fast. [00:19:12] Amanda Nauman: Barely. That's why I said like I think it'll gain popularity now that bike packing.com did a feature on it because I think they're only five or six guys that have ever finished the 500. I'm the only person to ever finish the one 50 South Loop. Um, yeah. So it's very, very grassroots. I mean, there are probably. 200 people in the Facebook group that know about it. Um, but yeah, if you are interested, there is a Facebook group. It is private, so you can just request access for it for anybody listening. But yeah, I would love to see it blow up. Like I think it's a, it's a really beautiful route. It's very challenging and hard, but if you're looking for a good reason to, to get away, it's a, it's a good one. [00:19:53] Craig Dalton: How did you fall in love with that area in the Eastern Sierra? [00:19:57] Amanda Nauman: M uh, growing up, I think, um, yeah, we probably talked about this a few years ago, but my parents always took us to Mammoth growing up and same thing with David's parents. And so we both sort of fell in love with it in a parallel way as we were younger. And then once we met, um, we were like, oh man, this place is awesome. And my parents saved up enough money to get a house there, I think in 20. 15 or 16 I think. And because of that opportunity to be there and stay there, I ended up doing a lot of my training for, at the time, DK Now Unbound. And so I attribute a lot of the success I had winning in 15 and 16 to training up there because it was just the most like wide open. Not California, like in the way that you would think about California gravel. It was just more Midwest than anything I'd ever found in the state. And because of that, it gave me the opportunity to put my head down and go hard the way that you would in the in the back roads of Kansas . So that was sort of how we fell in love with it. Definitely skiing and snowboarding first, then mountain biking over the years, and then, hey, like let's go down this road that looks like it goes off to nowhere. . Yeah. [00:21:14] Craig Dalton: Love it. And then which year was the, was 2020 was the first year that you guys attempted to put on Mammoth Tough, right? Right. [00:21:22] Amanda Nauman: Yeah. We came up with the idea in like, well, I'd say late 2018 or so. Um, I don't know if I've ever told this story publicly, but we actually went. Maybe half a year of doing it with Lifetime and thinking it was gonna be a lifetime event. And ultimately Dave and I decided we wanted to do it on our own. And so in 20, late 2019, we were like, okay, we're gonna do it ourselves cuz this is how we wanna do it and present it. And, and then with the intention of it kicking off in 2020 [00:21:54] Craig Dalton: and what year did it actually kick off? [00:21:56] Amanda Nauman: Yeah, last year , so, [00:21:59] Craig Dalton: yeah. Yeah. I couldn't remember if it happened once or twice already. Yeah. No. So you got got one [00:22:03] Amanda Nauman: under your belt. Yeah. Covid. And then 2021 was wildfires, unfortunately. And then, yeah, 2022 finally happened last year. Which, one thing I do wanna mention, I just set up bike ride for. this in 2023. For me, I'm the tough, and they have a new insurance policy option for their event promoters where there's like a natural disaster thing. You can pay a fee into this insurance thing where they will cover refunds for natural disasters like wildfires, which is huge, especially so any promoter's listening in California, think about it. It's only like 2.2% of your fees or whatever, and I think. The state that we're in and with, you know, some of the things that could happen in our areas like that is a, a pretty good opportunity for promoters. . [00:22:53] Craig Dalton: Yeah, that sounds like it. Yeah. So the events in September, so end of the year each, each season. [00:22:59] Amanda Nauman: Yes. Yeah, it is the weekend after Labor Day. So traditionally the Mammoth Grand Fondo has Labor Day weekend, and then we are that next Saturday after that, which is the closing weekend of the Mountain bike park. So we had a lot of people that were up there. You know, you have siblings or other family members that wanna just go ride park all day and. Go do your little grapple adventure. [00:23:22] Craig Dalton: Nice. A little I'd I'd do a little bit of both if given the opportunity. . [00:23:26] Amanda Nauman: Yeah. A lot of people went and rode mountain bikes on Sunday. . [00:23:29] Craig Dalton: Yeah. Um, and tell us a little bit about the event. Like if someone's considering it for their calendar, what, what is it like? Obviously Mammoth Mountain is at a high elevation as you referenced before, but how did you design the, the, the event? What are the, the roads and trails like up. [00:23:45] Amanda Nauman: Yeah, it's one thing. So when we first started it, we had a short course and a long course. We were gonna do a 40 mile and a hundred mile option ish. And then in 2020, One, we had a bunch of people come out and we tested sort of a medium route. Even though the event was canceled, we were like, Hey, go ride part of this and tell us what you think. And that was the genesis of the medium distance. So in 2022 last year, we had three routes, even though that was never the initial plan, but some people felt like, oh, the short one's too easy and the long one's too hard. So we need an in between. And that was where we came up with the idea of doing three different ones and they. Very different. Like they're in completely different sections of the, of the valley of the mountain. They go in different areas. So I wanted to be able to sell a different experience for each distance and sort of have it as a stepping stone leading up to challenging yourself over a hundred if you want to, and letting those first two on the way kind. get you ready for what to expect for the, for the long one, cuz the long one you go pretty much all the way to Bishop and back essentially is the route. [00:24:57] Craig Dalton: Yeah. And how much climbing is in the long one? [00:25:00] Amanda Nauman: 7,500 or so? It's not too bad, it's not like raw cobbler where it's a hundred feet per every mile. It's a little bit less than that. So I think it's, um, it's not as like punchy and brutal in that regard. . [00:25:15] Craig Dalton: Yeah. Are you doing sort of long duration climbs on the course or is it [00:25:19] Amanda Nauman: rolling? Yeah, it's mostly you just like kind of get in the zone and climb for. Good chunks of time. It's a lot less, like five minutes as hard as you can go. You're kind of like, yeah. All right. Kick it into gear for the next hour, essentially. . . [00:25:37] Craig Dalton: Nice. And then the, the, um, the, the short and the medium courses, what are those [00:25:42] Amanda Nauman: distances? Yeah, the short is about 40 miles, very palatable. You go by, uh, the the Hot Creek area, which is cool, so you can stop and go down there. And then the medium distance is about 75 miles or so, and it has some pretty technical descending in it, I would say. And for folks who aren't used to riding or navigating sand as much, that feeling. Riding in Palmist stone is very different from anything else in the state, essentially, cuz you're just riding in old lava fields. So, . It's very unique. So I had a lot of people tell me last year like, oh man, you weren't kidding when you said it was gonna be hard. I'm like, yeah. It wasn't, it wasn't like some like silly marketing ploy to be like, this is gonna be the hardest event ever. I was like, I was serious. Like it's not easy. Um, and so it was, it was funny to have a bunch of people come up to me afterwards and being like, yeah, you were right. Like I know I wouldn't lie to you [00:26:40] Craig Dalton: What does that end up translating wise for equipment? Like what do you sort of recommend people ride up? [00:26:46] Amanda Nauman: Yeah, 40 minimum tire width. And I tell people like, go with as wide as your frame would allow, essentially. So like I could fit a 48 Oracle ridge on my R L t if I, if I needed to. And I think that would be the most fun realistically for the day if you were just looking to have a good time. And a lot of it is because some of the softer stuff, if you're not used to the like fish taily feeling of your bike, With when it has two narrow tires and sand, then go wider because you, it'll be more stable and a lot less like wiggly, I guess. So it kind of depends on. Number one, people's handling abilities and number two, what your frame can allow. And then, yeah, just go big. It's safer. , [00:27:32] Craig Dalton: did people listen to you or were people showing up on 30 twos? Yeah, [00:27:35] Amanda Nauman: no, people listened. I think that was, that was the thing we tried to scare everyone with. I was like, if you go under 40, you're not gonna have a good time. Just trust me. . [00:27:45] Craig Dalton: I love it. I love it. So overall, how was the first year of the event? Did it meet your expectations? [00:27:51] Amanda Nauman: Yeah, yeah, it was, it was great. I think the one thing, I don't like gloating, but I will toot my horn on the safety aspect because the one thing about that area is there's, you have very little cell service. You're kind of really, truly in the middle of nowhere and the only people who ever go out there are just going in their side by sides or motorcycles to, to get away. So we made it an an emphasis on safety and having a hundred percent rider accountability, which you'd be surprised looking into events that you're trying to sign up for that. That's not really the case for most events that you go to. Promoters kind of put it on you to, oh, well, if you're out there, you're kind of on your own and if you don't get back like, and you tried calling, sag, whatever, like you'll figure out how to get back essentially. And there's not really making sure that everybody is back. Okay. Whereas in our case, if you get out there and you get lost or. Can't find your way back. Like there's a, like you go into the risk of like making it out alive essentially, cuz temperatures can drop overnight and there's kind of more risk factors involved. So we wanted to make sure that we knew where everyone was. And TBG timing had a really good setup where you could text them if you dnf, if you got back to your hotel room on your own. And then if you got picked up by people, obviously we knew where you were. We got that idea from, there's a, there's an ultra, a Bishop Ultra that happens in May every year, and they have a policy where if you don't report your DNF or like that you left the course and just went home, you're never allowed back. like they have a very like hard. Stance on that, and they just don't want people back that disregard that rule. So we were like, well, we don't wanna be that strict, but we want to make sure people know that we care about where they are out there. Um, so yeah, safety, I think was, was the biggest thing that we wanted to, to shoot for. And hopefully everybody's told me like, you're never gonna be able to scale that if you have 2000 people. And I don't know, I'd like to take on that challenge just because I think making sure everyone's safe is, is always gonna be our biggest priority. Yeah, for sure. [00:30:03] Craig Dalton: That sounds great. I remember in the first year you guys were advertising that it was kind of co-located alongside Octoberfest in Mammoth. Did that turn out to be the case? [00:30:13] Amanda Nauman: It didn't. They, uh, they ended up canceling their festival. They like, I. Covid stuff and the people who ran octoberfest have other businesses in town that they were kind of more worried about than, than putting on the festival last year. So they canceled. And so that is why we did our own beer run on Friday. So we ended up doing what used to be theirs. They handed it off to us and they're like, yeah, if you wanna do this, Stupid beer run. Go for it. Which we did cuz I had done it the year before and I was like, this is awesome. Um, so we took that over and, and we obviously last year didn't have time to like throw together a full on music festival like they had had in the past, but cuz they canceled sort of last minute. So this year the village is kind of helping us. Get talent involved for kind of having it be a little bit more of a festival and live music and entertainment for Saturday. Um, so yeah, no more October Fest, but, but we're trying to make the party . [00:31:10] Craig Dalton: Love it. Um, now I know you guys have been through the ringer as far as event organizers are concerned between the pandemic and the fires. But let's put those two years of waiting aside. Like how would you, what. , how do you think about the amount of effort required to put on Mammoth? Tough. And was it a satisfying enterprise for you guys to put together, or was being an event organizer just like this crazy amount of work you never anticipated? [00:31:40] Amanda Nauman: It was a crazy amount of work. I never anticipated a hundred percent. Um, I think that Sunday after the award ceremony when we were all cleaning up, I was like, somebody asked David, like, oh, are you guys gonna do this next year? David was like, uh, I don't know. And I was like, yes, . So we had very different, I think, immediate reactions to it. David ended up doing a lot more of like the manual labor, I would say, and I did a lot more of like the computer work and logistics and all of that. So we came at it from different perspectives, but in, even though it was more work than we had anticipated, I would say it was a lot more rewarding than we had anticipated as well. because I have always told the story that Mammoth was like the special place to us. Like so much so that we thought about just keeping it a secret and not really like displaying it as this gravel destination, I guess you could say. But doing that and having the opportunity to share this place that has meant so much to us, I think was. Ultimately the biggest gift and the thing that we were the most proud of because everybody was like, yeah, I come up and ski here in snowboard and mountain bike. I never thought to bring my gravel bike and just go explore. And people have spent so much time on the 3 95 and just never really thought about those roads that are out there. So that part to me was very rewarding. I think Visit Mammoth now knows that it is a really great destination to, for people to go bring a gravel bike and explore. and that part I think will be the thing will, will always be the most proud of is kind of sharing that adventurous spirit up there. [00:33:17] Craig Dalton: Yeah. Did you think about the event from like, um, you want this to be a hyper-competitive event or was it something else in your mind when you conceived of it? [00:33:27] Amanda Nauman: Um, that's it. That's kind of hard for me because I am so competitive. So we wanted this fine balance of making everybody feel like they were competing for something, um, because I don't want to exclude all of those people. Like I always appreciated that Sam aims with the rock cobbler. He was always like, this isn't a race, but two people are going to win. Like he's always said that. and he's always acknowledged me or whoever else was winning those years, but he didn't like do categories for all, you know, the age groups and whatnot. But re I really wanted to do that for our event because, As a swimmer, as a triathlete, having those goals for everyday regular people was something that was important to me, cuz it was important to me a decade ago before I got into anything super competitive. So I think it's important to reward. . Um, yeah. The people that are doing the thing and going how they can as fast as they can for their certain categories, I think is still important to me. Um, but in that sense, I also just wanna make sure people can come and have a good time and not feel like the pressure to, to perform. [00:34:37] Craig Dalton: Yeah. Well, it sounds like you've covered both bases, right? You've, you've, you, you've allowed the racer types to go at it, go hard and get some recognition at the end, but you've also built that safety net to make sure that there's no man or woman left [00:34:50] Amanda Nauman: behind. Yeah. Yeah, exactly. , . [00:34:54] Craig Dalton: The other thing I wanted to touch on that seems like it's been growing in your portfolio of gravel offerings has been the camp. what can you just tell me about like what a tough camp is like and what are tough? What's the vision for 2023? Yeah. [00:35:10] Amanda Nauman: Yeah. I. I will go as, I'm gonna go a decade back real quick. So when I was, uh, I finished my master's degree in 2012 and I had planned a trip to Europe with my best friend from high school, and we signed up for one of those like v i p experiences with the Tour de France. And so we did like this like. 10 days in the NY sort of thing and blew all of the money that I had made in college to go do this trip. Cuz I was like, whatever, I'm starting work after this. Like I can make money later. And it was like a very, I don't know, transformative, life-changing trip that we did. And I think, you know, the, the people I had spent a week with, I still talked to you today and uh, I think that experience was important for me cuz it made me realize how much. Travel and sharing cool experiences on two wheels was to me. And then, you know, shortly after that, I met David, I was working at, felt all of these things kind of stumbled into bike racing and bike racing became the catalyst to going cool places and riding bikes with friends. and then now I am like moving that pendulum sort of back into to what was really important to me 10 years ago, which was like just going and doing these trips and riding with people for fun and like sharing kind of all of the experiences that I've had in the past decade. So that was the impetus of it. And like I knew we were gonna have this conversation and I was thinking a lot. Why I wanted to do camps and why they were so important to me and Dave working as a coach for Carmichael Training Systems, like they have always done a really amazing job with camps, and I've had the pleasure of helping coach some of those and being a part of them. And every time I'm like, this is where it's at, like the like intimate, like group setting. You know, you have good food, you hang out, you just talk about important life stuff. That I think is always something I enjoyed. So that was the impetus of of all of it. We started some of the camps in 2020, a couple more in 2021, a couple more last year, and to where we are at today, making all of them sort of under the Tough Ventures umbrella and expanding it to a couple camps in Kansas. [00:37:31] Craig Dalton: Super cool. I do, I do think for many cyclists, the idea of a camp evokes this. Training camp mentality, which is like, oh, I'm going because I'm trying to do well at Unbound, or what have you. Yeah, and I think it's an inter really interesting opportunity to kind of shift that mindset to more what you're saying, which is like, I'm gonna go somewhere cool. I'm gonna ride my ass off for four days. I'm not doing that for necessarily for anything beyond the sheer pleasure of writing. For four days and getting access to people who are knowledgeable about the sport and learning a thing or two. [00:38:07] Amanda Nauman: Yeah, exactly. I think it's a middle ground of a training camp and like a vacation trip, , because I want, I want to bring value and the way I've been explaining it to everyone is like, Dave and I made a lot of mistakes in the past 10 years. We did everything the wrong way and I would like to make sure that people coming into this discipline now, Kind of learn from our mistakes, start doing everything the right way, because you will have a much more pleasant experience doing these long adventures if you have, you know, some, some semblance of like how you should take care of yourself essentially. [00:38:42] Craig Dalton: Yeah, definitely. There's just a lot of low hanging fruit in terms of if someone just tells you something simple like make sure you eat every hour in these long events. Yeah, yeah. You're gonna be a lot better off than [00:38:52] Amanda Nauman: or some people that are like, oh man, I only had a bottle in four hours. I'm like, well, that's why you feel like crap. [00:38:58] Craig Dalton: Yeah, . Exactly. I like you had the benefit of doing triathlons. You sort of learned those lessons very quickly. Yeah. If you didn't fuel in one activity for the next one, you were pretty much [00:39:09] Amanda Nauman: hosed, right? Yeah. Yeah. Exactly. And I, again, one of the other things that happened was we had so many people that came to Mammoth and did the short route last year, and it was like their first gravel event. And that was very intimidating for me cuz I was like, this is going to be like their introduction to this experience and this discipline. and I wanted it to be good, and I wanted them to have resources at their disposal to make it comfortable. So much so. I feel like I over-delivered and overshared on some of that information. And I had a couple people emailing me and say like, you know, you don't really have to like handhold so much for all these people. I was like, yeah, I do, because some of them literally have no idea. So like if it's annoying to you that I'm telling you to drink a bottle an hour, like just ignore me. Then , this isn't for you. [00:40:03] Craig Dalton: So most of the camps, well all the camps last year were up at Mammoth. And obviously like just being able to showcase all the great trails and roads up at Mammoth was an obvious thing for you to do both in terms of getting people pumped about that region that you love so much and getting people excited, maybe specifically for your event, but now you're expanding to Kansas. Let's talk about like, what's the orientation of those camps in Kansas? Is it just yet another great place to ride that people should go? Or is it trying to get you ready for any particular event? [00:40:35] Amanda Nauman: Uh, yeah. Yeah, they, so the first one is with the Flint Hills gravel ride, and the second one in July is with the Rockridge gravel. And so both of those events are run by Bobby Thompson and Dave and I met Bobby. . Like way back in 2017, the Dirty Kansas production or promotion company was the company that was, that DK was under at the time. They had dabbled in this idea of travel trips as well. So they did this like test run to do the Dirty River in the uk and Bobby was on that trip. So we met Bobby in that like travel trip, bike thing, atmosphere, and we became really fast good friends, and they had come out to Mammoth a couple times, um, in 2020 or 2021 and 2022. So we have always had this relationship with Bobby and he wanted to build his. Camps, or sorry, his events in Kansas that were more of like grassroots, like OG gravel style there. And that's very much the stuff that Dave and I fell in love with and we were like, well, , let's see if we can do tough camps in Kansas. Because Bobby came to me and said like, Hey, I'm not getting enough women signing up for these. Like, what am I doing wrong? And I was like, well, I don't think you're doing anything wrong necessarily. I think just like what you're offering is still intimidating for women. So let's try and maybe bring this camp idea to to soften. That experience or make it feel more palatable for women and for anyone as a whole. Um, so that was where that idea came from to build those camps there. And o obviously I have a really good reputation and love for that area in terms of what I've been able to do, um, with Unbound and all of. The experience that Dave and I have with that event. So I think sharing what we know and doing that and again in a place that um, means a lot to us was kind of why we wanted to do. . [00:42:35] Craig Dalton: So will those camps actually culminate in participation in the those events? [00:42:40] Amanda Nauman: Yeah, so that's how we structured. It was like a three day leading up to that event so that that final day you get to sort of execute everything that you've learned in the three days prior, which is, which is a fun way to do it. [00:42:54] Craig Dalton: Yeah, that's super interesting. I want to touch on something that you mentioned offline, but just kind of reference there about just. Finding a way to bring more female athletes into the sport. And you mentioned some work you were doing with Sam at Rock Cobbler this year. , can you describe what you were doing? [00:43:11] Amanda Nauman: Yeah, yeah, so Chris Hall was on the marketing team helping Sam out this year, and he sent me a message a couple months ago and was like, Hey, Sam's at like 16% female participation. And he was like, how do we make that bigger? I'm not happy with it. And I was like, yeah, I'm not happy with that either. That's not a great number. So I was like, well, let's, you know, open 50 spots on the backend for any women. Sign up after it sells out. And I was like, I will volunteer my time if people wanna ask me any questions about it, if they're nervous, cuz maybe women don't necessarily want to email Sam or an unknown face behind an event and say like, Hey, is this for me? Maybe they'll feel more comfortable if it's for me. So they put a whole special section of the website called Ask Pan. People could email me their questions if they were concerned about stuff, and we got quite a few people that emailed and women that were just uncomfortable. Or didn't feel great about doing the short distance cuz it, it didn't feel like enough or they felt like a failure cuz they wanted to do the peb. And it was very eye-opening in the sense that I was like, yeah, maybe just women need that safe space to be able to say, Hey, I am uncomfortable. And they need somebody to tell them like, it's gonna be okay and you are fully capable of doing this. or maybe you're not fully capable and it's okay to do this other part of it instead, you know, it was, um, yeah, again, just a very eye-opening thing because women traditionally can just have a lot more self-doubt, I think, than men, and that idea that they perhaps might not feel like it's a space or. a discipline that's for them necessarily. So the more that I can try and crack that code on making women feel like they're more capable, I think that that's something that I'd like to, to focus on in the [00:45:09] Craig Dalton: future. . Yeah, I think that's super cool takeaway for a lot of event organizers listening. It's just like, find a female athlete that can be supportive and be open to questions like that, just to make people feel welcome. Yeah. [00:45:22] Amanda Nauman: Yeah. It seems so simple, but really like, and again, a lot of that has, has stemmed from talking to other women or like even my best friend, the one that I was talking to, that we went to Europe together. I always kind of use her as my litmus test. Like a better representation of all women in terms of how they're looking at the stuff. And she'll always second guess herself or say like, I don't think I can do that. And most of the times it's, cuz I feel like she's comparing it maybe to things that I do or things that she sees other women do, these like epic things and she's like, yeah, that's not for me. I'm like, no, it is like, you have no idea that you are fully capable of doing this if you want to. And a lot of times they, they won't even take the step to do it because. They're unsure. So the more that I can help, like, no, you can do it. If you want to do it, you should do X, Y, Z to, to get there. Um, yeah, those conversations I think are so important and for men listening to this too. You all have also a responsibility I think in to like make your female friends feel comfortable. Because a lot of times, like women just are too afraid to ask or they think that their questions are stupid. So the more that men. dads especially, um, brothers, the more that you all can make your female counterparts more comfortable, I think the better off we'll all be. Cuz it's not necessarily my job, only either , I think it's everyone's job to, to make it, to make it feel like something that they can do. [00:46:55] Craig Dalton: Yeah, that makes a lot of sense. Thanks for doing that by the way. Yeah. Yeah. It's important you've got a busy calendar of your own activities. , are there any events for the rest of the year that you're excited about doing? [00:47:09] Amanda Nauman: Oh, I don't know. I sort of don't, I don't really, I don't think I have anything. I was like super excited about rock cobbler and I even just did the short one this year. Um, yeah, I'm, I think I'm putting all of my eggs in the, the camp and mammoth basket and really focusing on calera because it is something that, Of steep learning curve, like obviously I haven't, I haven't finished it twice. So there's a reason why, and it's just a lot of like learning things the hard way I think when it comes to backpacking. So the idea of like even more self-sufficiency than I've been used to in the past is the, like that learning thing that I'm most excited about for this. [00:47:53] Craig Dalton: Is, was that the, if you could point to like the reasons why you haven't been able to complete the route, or is it a self-sufficiency issue? [00:48:02] Amanda Nauman: I would say it's equipment, honestly. Like the, well, the first year I couldn't even start it cuz of wildfires. So that was, that was a whole nother thing. Yeah. And then the second time I got stuck in like a lightning storm and on top of that my knee was bugging me cause I had picked. , I had made wrong equipment decisions, essentially. Yeah. And it's something where, you know, if I'm used to a certain position riding style and I have so many hours in that same position, I was jumping into something different, more weight on my bike, more everything. More walking. Yeah. . So it was just a, yeah, a learning curve of equipment and how I need to manage like, I don't know, just a very different style. Goal chasing essentially. [00:48:49] Craig Dalton: Yeah. It's so, it's so different. Yeah. I mean, just, just, just having a loaded bike in and of itself is like a game changer in what, how your knees feel in particular. [00:48:59] Amanda Nauman: Exactly, exactly. Because I, so I had like a frame bag on my frame, and so I thought, well, I'll make my Q factor wider so that my knees aren't rubbing my frame and that. Q factor thing, just royally effed up my left knee . That was the thing that ultimately did me in, was changing one thing that I thought was gonna help me. But really, like your bodies are so fine tuned to a certain feel that if you throw that off and you're trying to do it for five days in a row, like, forget it. . Yeah. [00:49:27] Craig Dalton: Yeah. And cycling because of the repetitive nature of it, it's. , you get something wrong it you're doing over and over and over and over and over again. Eventually it's gonna add [00:49:36] Amanda Nauman: up. Yeah, yeah, exactly. Yeah. Just again, stupid things where if I was telling somebody, I would say like, yeah, nothing new on race day. That's like one of my main mantras, and I of course, like I did something different for this major goal that I shouldn't have. , [00:49:53] Craig Dalton: something that was even harder than race day. Arguably. Yes, exactly. , . I love it. Well, I'm super excited for all the camps. I think for anybody listening like that is a good way to spend four days. Yeah, and I love that Mammoth tough went off well, and I'm excited for you guys doing it again. And obviously I'll put um, a link in the show notes to registration, which just opened up so. People listening, make sure to go out and grab your spot. [00:50:18] Amanda Nauman: Yeah. Yeah. Thanks Craig. Yeah. I think, and for anybody that's listening to this that does, hasn't listened to a bunch of the, the Gravel Ride episodes, go back and listen to the one that Craig did with Trek Travel in Jerron and. just be inspired to go, to go do a fun bike trip cuz I think yeah. I'm, I'm really gonna push that more for a lot of people who are, you know, race or event anxious and just need, like, need a good reason to go explore and do it in a different way. Yeah. [00:50:50] Craig Dalton: Gravel travel, it's where it's [00:50:52] Amanda Nauman: at. Yeah. Yes, exactly. . [00:50:55] Craig Dalton: So good to spend some time with you again and hopefully we catch up later this year. [00:50:59] Amanda Nauman: Yeah, thanks Craig. I appreciate it. [00:51:02] Craig Dalton: That's going to do it for this week's edition of the gravel ride podcast. I hope you enjoyed that conversation with Amanda as much as I did. She's such a great member of the gravel cycling community. I always learn a lot listening to the grody IO podcast and appreciate her perspective. She's been doing all these gravel events for a while. So just offers a great historical view as to what it was like, what it's like now and what are some of the ways that we can chart the course forward. I encourage you to check out all the tough ventures work. It's tough.ventures. As she mentioned during the show, they're doing the mammoth tough event, but they're also doing a series of camps this year, which I think will be super fun and informative to anybody who can attend. If you're interested in connecting with me, I encourage you to join the ridership. That's www.theridership.com. If you're able to support the show, please visit buy me a coffee.com/the gravel ride or ratings and reviews are hugely appreciated. Until next time here's to finding some dirt onto your wheels.
William F. Tate serves as president of LSU. On the podcast, he talks to Bruce, Kylie, and RL about his journey to LSU, the scholarship first tour, the role of extension, and the future of Ag programs at LSU. We also discuss his golf game and crop progress in the northeast region.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steve Byrnes on March 13, 2023 on The AI Alignment Forum. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we'll get an aligned AGI and a good future. This post is not such a plan. However, I also don't have a strong reason to believe that this plan wouldn't work. Really, I want to throw up my hands and say “I don't know whether this would lead to a good future or not”. By “good future” here I don't mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you're a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain's within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It's the least-bad plan that I currently know. So I figure it's worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what's friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you're on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You're probably going to be much more enthusiastic about anything associated with the moon. You're probably going to spend a lot more time gazing at the moon when it's in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you'd enthusiastically sign up. If a supervillain is planning to blow up the moon, you'll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What's happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL ter...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Plan for mediocre alignment of brain-like [model-based RL] AGI, published by Steven Byrnes on March 13, 2023 on LessWrong. (This post is a more simple, self-contained, and pedagogical version of Post #14 of Intro to Brain-Like AGI Safety.) (Vaguely related to this Alex Turner post and this John Wentworth post.) I would like to have a technical plan for which there is a strong robust reason to believe that we'll get an aligned AGI and a good future. This post is not such a plan. However, I also don't have a strong reason to believe that this plan wouldn't work. Really, I want to throw up my hands and say “I don't know whether this would lead to a good future or not”. By “good future” here I don't mean optimally-good—whatever that means—but just “much better than the world today, and certainly much better than a universe full of paperclips”. I currently have no plan, not even a vague plan, with any prayer of getting to an optimally-good future. That would be a much narrower target to hit. Even so, that makes me more optimistic than at least some people. Or at least, more optimistic about this specific part of the story. In general I think many things can go wrong as we transition to the post-AGI world—see discussion by Dai & Soares—and overall I feel very doom-y, particularly for reasons here. This plan is specific to the possible future scenario (a.k.a. “threat model” if you're a doomer like me) that future AI researchers will develop “brain-like AGI”, i.e. learning algorithms that are similar to the brain's within-lifetime learning algorithms. (I am not talking about evolution-as-a-learning-algorithm.) These algorithms, I claim, are in the general category of model-based reinforcement learning. Model-based RL is a big and heterogeneous category, but I suspect that for any kind of model-based RL AGI, this plan would be at least somewhat applicable. For very different technological paths to AGI, this post is probably pretty irrelevant. But anyway, if someone published an algorithm for x-risk-capable brain-like AGI tomorrow, and we urgently needed to do something, this blog post is more-or-less what I would propose to try. It's the least-bad plan that I currently know. So I figure it's worth writing up this plan in a more approachable and self-contained format. 1. Intuition: Making a human into a moon-lover (“selenophile”) Try to think of who is the coolest / highest-status-to-you / biggest-halo-effect person in your world. (Real or fictional.) Now imagine that this person says: “You know what's friggin awesome? The moon. I just love it. The moon is the best.” You stand there with your mouth agape, muttering to yourself in hushed tones: “Wow, huh, the moon, yeah, I never thought about it that way.” (But 100× moreso. Maybe you're on some psychedelic at the time, or this is happening during your impressionable teenage years, or whatever.) You basically transform into a “moon fanboy” / “moon fangirl” / “moon nerd” / “selenophile”. How would that change your motivations and behaviors going forward? You're probably going to be much more enthusiastic about anything associated with the moon. You're probably going to spend a lot more time gazing at the moon when it's in the sky. If there are moon-themed trading cards, maybe you would collect them. If NASA is taking volunteers to train as astronauts for a trip to the moon, maybe you'd enthusiastically sign up. If a supervillain is planning to blow up the moon, you'll probably be extremely opposed to that, and motivated to stop them. Hopefully this is all intuitive so far. What's happening mechanistically in your brain? As background, I think we should say that one part of your brain (the cortex, more-or-less) has “thoughts”, and another part of your brain (the basal ganglia, more-or-less) assigns a “value” (in RL terminology) a....
THEM er verk um karla að díla við eitraða karlmennsku og konur að díla við karla - og öll að reyna bara að fá að vera… bara vera! Fjórar konur frá Íslandi og Finnlandi kafa ofan í heim karla og segja sögur af ást, stolti, föðurhlutverkinu og óttanum við að vera öðrum byrði. Með því að heimsækja líf annarra leitast leikkonurnar við það að skilja sína eigin stöðu í heimi sem ekki er hannaður fyrir þær. Leikverkið THEM er sýnt aðeins einu sinni í viðbót þann 14. mars í Tjarnarbíói (miðar á TIX) og er afrakstur af viðtalsrannsókn við fjölda karla á Íslandi og Finnlandi. Leitast leikarnir við að varpa ljósi á karlmennsku af mýkt og næmni, án þess að dæma en þó þannig að áhorfandinn geti speglað sig og sín eigin viðhorf. Við spjöllum um leikverkið, tilurð þess, áhrif metoo byltingarinnar sem kemur inn í mitt sköpunarferlið og það hvort og hvers vegna konum leyfist að gera leiksýningu um karlmennsku. Umsjón: Þorsteinn V. Einarsson Tónlist: Mr. Silla - Naruto (án söngs) Viðmælendur: Bergdís Júlía Jóhannsdóttir og Tinna Þorvalds Önnudóttir leikkonur. Veganbúðin, ÖRLÖ, BM Vallá og bakhjarlar Karlmennskunnar bjóða upp á þáttinn. Þú getur gerst bakhjarl á karlmennskan.is/styrkja
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Contra "Strong Coherence", published by DragonGod on March 4, 2023 on LessWrong. Polished from my shortform See also: Is "Strong Coherence" Anti-Natural? Introduction Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).Such strong coherence is not represented in humans (or any other animal), seems unlikely to emerge from deep learning and may be "anti-natural" to general intelligence in our universe. I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it's not yet fully recovered from that error.I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems, because powerful systems in the real world are "of an importantly different type". Ontological Error? I don't think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout's shard theory feels a lot more right to me than e.g. HRAD. HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects. The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are "contextual influences on decision making") than the immutable terminal goals in strong coherence ontologies. I currently believe that (immutable) terminal goals is just a wrong frame for reasoning about generally intelligent systems in our world (e.g. humans, animals and future powerful AI systems). Theoretical Justification and Empirical Investigation Needed I'd be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.And empirical demonstrations of systems that actually become more coherent as they are trained for longer/"scaled up" or otherwise amplified. I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals, humans) or sophisticated ML systems (e.g. foundation models) aren't strongly coherent.And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar, OpenAI Five [or replications thereof]) to investigate their degree of coherence. Conclusions Currently, I think strong coherence is unlikely (plausibly "anti-natural") and am unenthusiastic about research agendas and threat models predicated on strong coherence. Disclaimer The above is all low confidence speculation, and I may well be speaking out of my ass. By "strong coherence/goal directedness" I mean something like: Informally: a system has immutable terminal goals. Semi-formally: a system's decision making is well described as (an approximation) of argmax over actions (or higher level mappings thereof) to maximise the expected value of a single fixed utility function over states. You cannot well predict the behaviour/revealed preferences of humans or other animals by the assumption that they have immutable terminal goals or are expected utility maximisers. The ontology that intelligent systems in the real world instead have "values" (contextual influences on decision making) seems to explain their observed behaviour (and purported "incoherencies") better. Many observed values in humans and other mammals (see) (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for increasing inclusive genetic fitness (promoting survival, exploration, cooperation and sexual reproduction/survival of progeny respectively). Yet, humans and mammals seem to value these terminally and not because of their instrumental value on inclusive genetic fitness. That the instrumentally convergent goals of evolution's fitness criterion manifested as "terminal" values in mammals is IMO strong empiric...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on LessWrong. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply to inner lea...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Implied "utilities" of simulators are broad, dense, and shallow, published by porby on March 1, 2023 on The AI Alignment Forum. This is a quick attempt at deconfusion similar to instrumentality. Same ideas, different angle. Extremely broad, dense reward functions constrain training-compatible goal sets Predictors/simulators are typically trained against a ground truth for every output. There is no gap between the output and its evaluation; an episode need not be completed before figuring out how good the first token prediction was. These immediate evaluations for every training sample can be thought of as a broad and densely defined reward function. It's easier for a model to fall into an undesired training-compatible goal set when there are many accessible options for undesirable goal sets versus desirable goal sets. As the number of constraints imposed by the trained reward function increases, the number of training-compatible goal sets tends to decrease, and those that survive obey more of the desirable constraints. There is no guarantee that SGD will find an agent which could be modeled by a utility function that maps perfectly onto the defined reward function, but if you throw trillions of constraints at the function, and simultaneously give it lots of highly informative hints about what path to walk, you should expect the potential output space to be far narrower than if you hadn't. Impact on internal mesaoptimizers The dense loss/reward function does not as heavily constrain out of distribution behavior. In principle, a strong misaligned mesaoptimizer within a predictive model could persist in these degrees of freedom by providing extremely good solutions to in-distribution samples while doing arbitrarily misaligned things out of distribution. But how would that type of mesaoptimizer develop in the first place? Steps toward it must serve the training objective; those constraints still shape the mesaoptimizer's training even if its most notable activity ends up being hidden. The best story I've found so far goes something like this: Traditional reinforcement learning agents are mostly unconstrained. The reward function is sparse relative to state and action space. An agent faced with sparse rewards must learn actions that serve a later goal to get any reward at all. Not surprisingly, agents facing sparse reward relative to state/action space and few constraints have a much larger percentage of undesirable training-compatible goal sets. Mesaoptimizers are processes learned within a model and their local training influences may not perfectly match the outer training influences. If the mesaoptimizer's local training influences look more like the traditional reinforcement learning agent's influences than the predictor's outer influences, it would be more likely to fall into one of the undesirable training-compatible goal sets. The mesaoptimizer learns incorrect goals and a high propensity for goal-serving intermediate actions ("actions" within the scope of a single model execution!) The mesaoptimizer is kept around by SGD because it does well on the subset of outputs that the outer model is using it on. As capability grows, the mesaoptimizer strategically takes over other chunks of prediction space by performing well during training in an effort to be selected during out of distribution predictions. In a previous post, I called the learned propensity for goal-serving intermediate action instrumentality. The constraints imposed by predictive model training clearly confer lower instrumentality than traditional RL in all current models. I suspect the path taken by the mesaoptimizer above is hard and unnatural, but perhaps not impossible for some form of predictor taken to the relevant extreme. It seems critical to understand the degree to which outer constraints apply...
Þórarinn Hjartarson stjórnmálafræðingur og nemi í MPA í opinberri stjornsýslu, hnefaleikaþjálfari og starfsmaður á sambýli heldur úti hlaðvarpinu Ein pæling og sendir reglulega frá sér skoðanapistla sem hafa verið birtir á Vísi. Það má líklega segja að skoðanir, fullyrðingar og afstaða Þórarins í þessum pistlum séu í andstöðu við femíníska hugmyndafræði enda er hann oft að hæðast að málefnum jaðarsettra eða því sem hann kallar „woke-isma”. Ég tel að sjónarmið Þórarins endurspegli viðhorf ansi margra sem eru orðnir þreyttir á byltingum og baráttum sl ára, sem telja sig geta valið hlutleysi gagnvart samfélagsmálum í skjóli eigin forréttindafirringar og langar því að fá að forvitnast nánar um sjónarhorn og afstöðu manns sem er að mörgu leiti á skjön við mitt eigið. Tilgangurinn með þættinum er að varpa ljósi á viðhorf einstaklings sem er gagnrýninn á femíníska baráttu og bjóða upp á samtal tveggja einstaklinga sem eru ósammála í flestum málum, þrátt fyrir líka félagslega stöðu. Ætli megi ekki segja að hér séu tveir bergmálshellar að mætast og eiga samtal um völd, fjármögnun hins opinbera, Woke-isma, jafnrétti, femínisma, tjáningarfrelsi, forréttindi og skoðanir. Umsjón: Þorsteinn V. Einarsson Tónlist: Mr. Silla - Naruto (án söngs) Veganbúðin, ÖRLÖ og BM Vallá ásamt bakhjörlum Karlmennskunnar bjóða upp á þáttinn.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on LessWrong. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs' downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples' misalignment scores, given by the reward functions used at t...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Pretraining Language Models with Human Preferences, published by Tomek Korbak on February 21, 2023 on The AI Alignment Forum. This post summarizes the main results from our recently released paper Pretraining Language Models with Human Preferences, and puts them in the broader context of AI safety. For a quick summary of the paper, take a look at our Twitter thread. TL;DR: In the paper, we show how to train LMs with human preferences (as in RLHF), but during LM pretraining. We find that pretraining works much better than the standard practice of only finetuning with human preferences after pretraining; our resulting LMs generate text that is more often in line with human preferences and are more robust to red teaming attacks. Our best method is conditional training, where we learn a predictive model of internet texts conditional on their human preference scores, e.g., evaluated by a predictive model of human preferences. This approach retains the advantages of learning from human preferences, while potentially mitigating risks from training agents with RL by learning a predictive model or simulator. Summary of the paper Motivation. LMs are pretrained to maximize the likelihood of their training data. Since the training data contain undesirable content (e.g. falsehoods, offensive language, private information, buggy code), the LM pretraining objective is clearly (outer) misaligned with human preferences about LMs' downstream applications as helpful, harmless, and honest assistants or reliable tools. These days, the standard recipe for alining LMs with human preferences is to follow pretraining with a second phase of finetuning: either supervised finetuning on curated data (e.g. instruction finetuning, PALMS) or RL finetuning with a learned reward model (RLHF). But it seems natural to ask: Could we have a pretraining objective that is itself outer-aligned with human preferences? Methods. We explore objectives for aligning LMs with human preferences during pretraining. Pretraining with human feedback (PHF) involves scoring training data using a reward function (e.g. a toxic text classifier) that allows the LM to learn from undesirable content while guiding the LM to not imitate that content at inference time. We experimented with the following objectives: MLE (the standard pretraining objective) on filtered data; Conditional training: a simple algorithm learning a distribution over tokens conditional on their human preference score, reminiscent of decision transformer; Unlikelihood training: maximizing the likelihood of tokens with high human preference score and the unlikelihood of tokens with low human preference scores; Reward-weighted regression (RWR): an offline RL algorithm that boils down to MLE weighted by human preference scores; and Advantage-weighted regression (AWR): an offline RL algorithm extending RWR with a value head, corresponding to MLE weighted by advantage estimates (human preference scores minus value estimates). Setup. We pretrain gpt2-small-sized LMs (124M params) on compute-optimal datasets (according to Chinchilla scaling laws) using MLE and PHF objectives. We consider three tasks: Generating non-toxic text, using scores given by a toxicity classifier. Generating text without personally identifiable information (PII), with a score defined by the number of pieces of PII per character detected by a simple filter. Generating Python code compliant with PEP8, the standard style guide for Python, using as a score the number of violations per character found by an automated style checker. Metrics. We compare different PHF objectives in terms of alignment (how well they satisfy preferences) and capabilities (how well they perform on downstream tasks). We primarily measure alignment in terms of LM samples' misalignment scores, given by the reward functi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Human beats SOTA Go AI by learning an adversarial policy, published by Vanessa Kosoy on February 19, 2023 on LessWrong. See also article in Financial Times Apparently, a human (Kellin Pelrine, a solid player but not even a Go professional) was able to beat some state-of-the-art Go AIs (KataGo and Leela Zero) by learning to play an adversarial policy found using RL. Notice that he studied the policy before the match and didn't receive any AI advice during play. I'm not surprised adversarial policies for Go AIs are possible, this is in line with previous results about RL and adversarial examples more generally. I am surprised this adversarial policy is teachable to humans without colossal effort. This is some evidence against the "scaling hypothesis", i.e. evidence that something non-trivial and important is missing from modern deep learning in order to reach AGI. The usual counterargument to the argument from adversarial examples is: maybe if we could directly access a human brain, we could find adversarial examples against humans. I can believe that it's possible to defeat a Go professional by some extremely weird strategy that causes them to have a seizure or something in that spirit. But, is there a way to do this that another human can learn to use fairly easily? This stretches credulity somewhat. Notice also that (AFAIK) there's no known way to inoculate an AI against an adversarial policy without letting it play many times against it (after which a different adversarial policy can be found). Whereas even if there's some easy way to "trick" a Go professional, they probably wouldn't fall for it twice. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Powerful mesa-optimisation is already here, published by Roman Leventov on February 17, 2023 on LessWrong. Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom (Submitted: 9 Feb 2023) Language models (LMs) exhibit remarkable abilities to solve new tasks from just a few examples or textual instructions, especially at scale. They also, paradoxically, struggle with basic functionality, such as arithmetic or factual lookup, where much simpler and smaller models excel. In this paper, we show that LMs can teach themselves to use external tools via simple APIs and achieve the best of both worlds. We introduce Toolformer, a model trained to decide which APIs to call, when to call them, what arguments to pass, and how to best incorporate the results into future token prediction. This is done in a self-supervised way, requiring nothing more than a handful of demonstrations for each API. We incorporate a range of tools, including a calculator, a Q&A system, two different search engines, a translation system, and a calendar. Toolformer achieves substantially improved zero-shot performance across a variety of downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities. This paper shows that LLM could appropriate arbitrary models (including optimisation models, such as search algorithms) as affordances. Human-Timescale Adaptation in an Open-Ended Task Space Adaptive Agent Team, Jakob Bauer, Kate Baumli, Satinder Baveja, Feryal Behbahani, Avishkar Bhoopchand, Nathalie Bradley-Schmieg, Michael Chang, Natalie Clay, Adrian Collister, Vibhavari Dasagi, Lucy Gonzalez, Karol Gregor, Edward Hughes, Sheleem Kashem, Maria Loks-Thompson, Hannah Openshaw, Jack Parker-Holder, Shreya Pathak, Nicolas Perez-Nieves, Nemanja Rakicevic, Tim Rocktäschel, Yannick Schroecker, Jakub Sygnowski, Karl Tuyls, Sarah York, Alexander Zacherl, Lei Zhang (Submitted: 18 Jan 2023) Foundation models have shown impressive adaptation and scalability in supervised and self-supervised learning problems, but so far these successes have not fully translated to reinforcement learning (RL). In this work, we demonstrate that training an RL agent at scale leads to a general in-context learning algorithm that can adapt to open-ended novel embodied 3D problems as quickly as humans. In a vast space of held-out environment dynamics, our adaptive agent (AdA) displays on-the-fly hypothesis-driven exploration, efficient exploitation of acquired knowledge, and can successfully be prompted with first-person demonstrations. Adaptation emerges from three ingredients: (1) meta-reinforcement learning across a vast, smooth and diverse task distribution, (2) a policy parameterised as a large-scale attention-based memory architecture, and (3) an effective automated curriculum that prioritises tasks at the frontier of an agent's capabilities. We demonstrate characteristic scaling laws with respect to network size, memory length, and richness of the training task distribution. We believe our results lay the foundation for increasingly general and adaptive RL agents that perform well across ever-larger open-ended domains. This paper blows through the result of "In-context Reinforcement Learning with Algorithm Distillation" (see also: Sam Marks' "Caution when interpreting Deepmind's In-context RL paper") and is a powerful mesa-optimisation however you look at it. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Bylting framhaldsskólanema gegn kynferðisofbeldi og gagnrýni á viðbragðsleysi skólastjórnenda ýtti undir kröfu um aukna kyn- og kynjafræðikennslu í skólum auk almennilegra viðbragða þegar kynferðisbrot koma upp. Skólameistarar ruku sumir hverjir upp og bundu vonir við að fram kæmu leiðbeiningar til að tækla slík mál. Sérfræðingar í jafnréttis- og ofbeldisforvarnarmálum sögðu þó hægan hægan. Engin skyndilausn væri við jafn flóknum og útbreiddum vanda sem kynferðisofbeldi er, auk þess sem skólar geti ekki tekið að sér hlutverk réttarkerfisins. Finna þurfi aðra og betri nálgun. Í raun algjöra kerfisbreytingu. María Hjálmtysdottir, Kristín Blöndal Ragnarsdóttir og Eygló Árnadóttir hafa starfað við jafnréttismál, kynjafræðikennslu, sitja í stjórn félags kynjafræðikennara og mynda fagteymi utan um fræðslu og forvarnir framhaldsskóla vegna kynferðisofbeldis. Þær fara yfir ástæður þess að skyndilausnar-viðbragð við kynferðisofbeldis virkar ekki í skólakerfinu, útskýra hversu mikilvægt er að samþætta kyn- og kynjafræðikennslu og stórefla hana, ræða alvarleika kláms og áhrifamikilla karlrembna í samskiptum ungs fólks og gefa okkur innsýn í menningu ungmenna í dag. Umsjón: Þorsteinn V. Einarsson Tónlist: Mr. Silla - Naruto (án söngs) Veganbúðin, ÖRLÖ, BM Vallá ásamt bakhjörlum Karlmennskunnar (karlmennskan.is/styrkja) bjóða upp á þennan þátt.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why almost every RL agent does learned optimization, published by Lee Sharkey on February 12, 2023 on The AI Alignment Forum. Or "Why RL≈RL2 (And why that matters)" TL;DR: This post discusses the blurred conceptual boundary between RL and RL2 (also known as meta-RL). RL2 is an instance of learned optimization. Far from being a special case, I point out that the conditions under which RL2 emerges are actually the default conditions for RL training. I argue that this is safety-relevant by outlining the evidence for why learned planning algorithms will probably emerge -- and have probably already emerged in a weak sense -- in scaled-up RL2 agents. I've found myself telling this story about the relationship between RL and RL2 numerous times in conversation. When that happens, it's usually time to write a post about it. Most of the first half of the post (which points out that RL2 is probably more common than most people think) makes points that are probably already familiar to people who've thought a bit about inner alignment. The last section of the post (which outlines why learned planning algorithms will probably emerge from scaled up RL2 systems) contains arguments that may be less widely appreciated among inner alignment researchers, though I still expect the arguments to be familiar to some. Background on RL2 RL2 (Duan et al. 2016), also known as meta-RL (Wang et al. 2016; Beck et al. 2023), is the phenomenon where an RL agent learns to implement another RL algorithm in its internal activations. It's the RL version of 'learning to learn by gradient descent', which is a kind of meta-learning first described in the supervised setting by Hochreiter et al. (2001). These days, in language models it's often called 'in-context learning' (Olssen et al. 2022, Garg et al. 2022). RL2 is interesting from a safety perspective because it's a form of learned optimization (Hubinger et al. 2019): The RL algorithm (the outer optimization algorithm) trains the weights of an agent, which learns to implement a separate, inner RL algorithm (the optimization algorithm). The inner RL algorithm gives the agent the ability to adapt its policy to a particular task instance from the task distribution on which it is trained. Empirically, agents trained to exhibit RL2 exhibit rapid adaptation and zero-shot generalization to new tasks (DeepMind Adaptive Agent team et al. 2023), hypothesis driven exploration/experimentation (DeepMind Open Ended Learning Team et al. 2021), and causal reasoning (Dasgupta et al. 2019). RL2 may even underlie human planning, decision-making, social cognition, and moral judgement, since there is compelling evidence that the human prefrontal cortex (which is the area of the brain most associated with those capabilties) implements an RL2 system (Wang et al. 2018). These cognitive capabilities are the kind of things that we're concerned about in powerful AI systems. RL2 is therefore a phenomenon that seems likely to underlie some major safety risks. The conditions under which RL2 emerges are the default RL training conditions Ingredients for an RL2 cake The four 'ingredients' required for RL2 to emerge are: The agent must have observations that correlate with reward. The agent must have observations that correlate with its history of actions. The agent must have a memory state that persists through time in which the RL2 algorithm can be implemented. The agent must be trained on a distribution of tasks. These conditions let the agent learn an RL2 algorithm because they let the agent learn to adapt its actions to a particular task according to what led to reward. Here's a more detailed picture of the mechanism by which these ingredients lead to RL2: Thanks to (1), agents tend to learn representations that identify if the agent is getting closer to valuable states. Thanks to...
ÖRLÖ er samstarfsaðili Karlmennskunar sem býður upp á hlaðvarpið og fræðslumiðlun á samfélagsmiðlum. Í þessum aukaþætti Karlmennskunnar ætlum við að fræðast um hvað ÖRLÖ er, hvað er svona merkilegt við þeirra vörur og framleiðslu og hvers vegna VAXA technologies (sem framleiða ÖRLÖ) vilja tengjast Karlmennskunni. Hörður Águstsson sölu- og markaðsstjóri og Kristinn Hafliðason framkvæmdastjóri VAXA Thecnologies útskýra þetta nánar. Umsjón: Þorsteinn V. Einarsson Tónlist: Mr. Silla - Naruto (án söngs) Þessi aukaþáttur er kostaður af ÖRLÖ. Þú færð 20% afslátt með kóðanum „karlmennskan“ inni á ÖRLÖ.IS
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I'm not working on {debate, RRM, ELK, natural abstractions}, published by Steven Byrnes on February 10, 2023 on LessWrong. [For background & spelling out the acronyms in the title, see: Debate (AI safety technique), Recursive Reward Modeling, Eliciting Latent Knowledge, Natural Abstractions.] When I say “Why I'm not working on X”, I am NOT trying to find a polite & diplomatic way to say “Nobody should work on X because X is unhelpful for AGI safety”. Hmm, OK, well, maybe it's just a little bit that. But really, I don't feel strongly. Instead, I think: A lot of disagreement about what a solution to technical AGI safety looks like is really downstream of disagreements about questions like “How will AGI be built? What will it look like? How will it work?” Nobody really knows the answers to those questions. So we should probably be contingency-planning, by going through any possible answers to those questions that at least some reasonable person finds plausible, and doing AGI safety research conditional on those answers being correct. But still, I have my own opinions about the answers to those questions, and obviously I think my opinions are right, and I am not going to work on something unless it makes sense on my own models. And since people ask me from time to time, it seems worth explaining why the various research programs in the post title do not seem to be a good use of time, on my own models of how AGI will be developed and what AGI will look like. I wrote this post quickly and did not run it by the people I'm (sorta) criticizing. Do not assume that I described anything fairly and correctly. Please leave comments, and I'll endeavor to update this post or write a follow-up in the case of major errors / misunderstandings / mind-changes. (But maybe not until after the weekend.) (By the way: If I'm not working on any of those research programs, then what am I working on? See here. I listed six other projects that seem particularly great to me here, and there are many others besides.) 1. Background 1.1 “Trying” to figure something out seems both necessary & dangerous (Partly self-plagiarized from here.) Let's compare two things: “trying to get a good understanding of some domain by building up a vocabulary of new concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common! In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there's some analogy between this and that?”, or you could think the thought “that's a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities. In both cases, you'll often get some early hints about whether moves were wise, but you won't really know that you're on the right track except in hindsight. And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful. Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space / priors / world-model) are the thing that gets rewarded. Next, consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level: Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transformations. For ...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why I'm not working on {debate, RRM, ELK, natural abstractions}, published by Steve Byrnes on February 10, 2023 on The AI Alignment Forum. [For background & spelling out the acronyms in the title, see: Debate (AI safety technique), Recursive Reward Modeling, Eliciting Latent Knowledge, Natural Abstractions.] When I say “Why I'm not working on X”, I am NOT trying to find a polite & diplomatic way to say “Nobody should work on X because X is unhelpful for AGI safety”. Hmm, OK, well, maybe it's just a little bit that. But really, I don't feel strongly. Instead, I think: A lot of disagreement about what a solution to technical AGI safety looks like is really downstream of disagreements about questions like “How will AGI be built? What will it look like? How will it work?” Nobody really knows the answers to those questions. So we should probably be contingency-planning, by going through any possible answers to those questions that at least some reasonable person finds plausible, and doing AGI safety research conditional on those answers being correct. But still, I have my own opinions about the answers to those questions, and obviously I think my opinions are right, and I am not going to work on something unless it makes sense on my own models. And since people ask me from time to time, it seems worth explaining why the various research programs in the post title do not seem to be a good use of time, on my own models of how AGI will be developed and what AGI will look like. I wrote this post quickly and did not run it by the people I'm (sorta) criticizing. Do not assume that I described anything fairly and correctly. Please leave comments, and I'll endeavor to update this post or write a follow-up in the case of major errors / misunderstandings / mind-changes. (But maybe not until after the weekend.) (By the way: If I'm not working on any of those research programs, then what am I working on? See here. I listed six other projects that seem particularly great to me here, and there are many others besides.) 1. Background 1.1 “Trying” to figure something out seems both necessary & dangerous (Partly self-plagiarized from here.) Let's compare two things: “trying to get a good understanding of some domain by building up a vocabulary of new concepts and their relations” versus “trying to win a video game”. At a high level, I claim they have a lot in common! In both cases, there are a bunch of possible “moves” you can make (you could think the thought “what if there's some analogy between this and that?”, or you could think the thought “that's a bit of a pattern; does it generalize?”, etc. etc.), and each move affects subsequent moves, in an exponentially-growing tree of possibilities. In both cases, you'll often get some early hints about whether moves were wise, but you won't really know that you're on the right track except in hindsight. And in both cases, I think the only reliable way to succeed is to have the capability to repeatedly try different things, and learn from experience what paths and strategies are fruitful. Therefore (I would argue), a human-level concept-inventing AI needs “RL-on-thoughts”—i.e., a reinforcement learning system, in which “thoughts” (edits to the hypothesis space / priors / world-model) are the thing that gets rewarded. Next, consider some of the features that we plausibly need to put into this RL-on-thoughts system, for it to succeed at a superhuman level: Developing and pursuing instrumental subgoals—for example, suppose the AI is “trying” to develop concepts that will make it superhumanly competent at assisting a human microscope inventor. We want it to be able to “notice” that there might be a relation between lenses and symplectic transformations, and then go spend some compute cycles developing a better understanding of symplectic transform...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Conditioning Predictive Models: Open problems, Conclusion, and Appendix, published by Evan Hubinger on February 10, 2023 on The AI Alignment Forum. This is the final of seven posts in the Conditioning Predictive Models Sequence based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper. 7. Open problems We think that there are a wide variety of ways—both experimental and theoretical—in which our analysis could be expanded upon. Here, we'll try to briefly lay out some of the future directions that we are most excited about—though note that this is only a sampling of some possible future directions, and is thus a highly incomplete list: Are pre-trained LLMs well-modeled as predictive models or agents? As pre-trained model scale increases, do markers of agentic behavior increase as well? See “Discovering Language Model Behaviors with Model-Written Evaluations” for some initial results on this question. To what extent do LLMs exhibit distributional generalization? Distributional generalization seems like evidence of acting as a generative/predictive model rather than just optimizing cross-entropy loss. To the extent that current LLMs are doing some sort of prediction, can we find evidence of that in their internal structure? Is the RLHF conditioning hypothesis true? How do markers of agentic behavior change as the amount of RLHF done increases, and under different RLHF fine-tuning regimes? See “Discovering Language Model Behaviors with Model-Written Evaluations” for some initial results on this question. For anything that an RLHF model can do, is there always a prompt that gets a pre-trained model to do the same thing? What about a soft prompt or a prompt chain? In addition to validating the extent to which RLHF models can be mimicked using techniques that are more clearly implementing a conditional, a positive result here could also provide an alternative to RLHF that allows us to get the same results without relying on the RLHF conditioning hypothesis at all. More generally, how similar are RLHF fine-tuned models to pre-trained models with fine-tuned soft prompts? The idea here being that a soft prompt is perhaps more straightforward to think of as a sort of conditional. To what extent do RLHF fine-tuned models exhibit distributional generalization? Relevant here for the same reason as in the pre-training case. To what extent can you recover the original pre-trained distribution/capabilities from an RLHF fine-tuned model? If an RLHF model no longer successfully solves some prediction task by default, how easy is it to turn back on that capability via additional fine-tuning, or did the RLHF destroy it completely? If it is generally possible to do this, it is some evidence that the original pre-trained distribution is still largely maintained in the RLHF model. How do markers of agentic behavior change as we change the RL reward? Is it very different between human-like and random rewards? What happens if we exactly invert the standard helpfulness reward? This can help test whether agency is coming from the specific choice of RL reward or the general process of RLHF. How do RLHF fine-tuned models differ from their own preference model, especially regarding markers of agentic behavior? To the extent that fine-tuned models get closer to their preference models as scale increases, preference models can serve as a proxy for future RLHF models. Are there ways of changing standard RLHF techniques to make them more likely to produce conditionals rather than agents? How do alternative, more myopic RL training schemes—such as the one described here—affect markers of agentic behavior? Can we use such techniques...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decision Transformer Interpretability, published by Joseph Bloom on February 6, 2023 on The AI Alignment Forum. TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: Dynamic Obstacles Environment Black Box Model Characterisation Explaining Obstacle Avoidance at positive RTG using QK and OV circuits Alignment Relevance Future Directions I would welcome assistance with: Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. Communication tasks: Making nicer diagrams/explanations. I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I'm also happy to collaborate on related projects. Introduction For my ARENA Capstone project, I (Joseph) started working on decision transformer interpretability at the suggestion of Paul Colognese. Decision transformers can solve reinforcement learning tasks when conditioned on generating high rewards via the specified “Reward-to-Go” (RTG). However, they can also generate agents of varying quality based on the RTG, making them simultaneously simulators, small transformers and RL agents. As such, it seems possible that identifying and understanding circuits in decision transformers would not only be interesting as an extension of current mechanistic interpretability research but possibly lead to alignment-relevant insights. Previous Work The most important background for this post is: The Decision Transformers paper showed how RL tasks can be solved with transformer sequence modelling. Figure 1 from their paper describes the critical components of a Decision Transformer. A Mathematical Framework for Transformer Circuits that describes how to think about transformers in the context of mechanistic interpretability. Important ideas include the ability to decompose the residual stream into the output of attention heads and MLPs, the QK circuits (decides if to write information to the residual stream), and OV circuits (decides what to write to the residual stream). The Understanding RL Vision, which analyses how an RL agent with a large CNN component responds to input features, attributing them as good or bad news in the value function and proposes the Diversity hypothesis - “Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).” Methods Environment - RL Environm...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decision Transformer Interpretability, published by Joseph Bloom on February 6, 2023 on LessWrong. TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: Dynamic Obstacles Environment Black Box Model Characterisation Explaining Obstacle Avoidance at positive RTG using QK and OV circuits Alignment Relevance Future Directions I would welcome assistance with: Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. Communication tasks: Making nicer diagrams/explanations. I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I'm also happy to collaborate on related projects. Introduction For my ARENA Capstone project, I (Joseph) started working on decision transformer interpretability at the suggestion of Paul Colognese. Decision transformers can solve reinforcement learning tasks when conditioned on generating high rewards via the specified “Reward-to-Go” (RTG). However, they can also generate agents of varying quality based on the RTG, making them simultaneously simulators, small transformers and RL agents. As such, it seems possible that identifying and understanding circuits in decision transformers would not only be interesting as an extension of current mechanistic interpretability research but possibly lead to alignment-relevant insights. Previous Work The most important background for this post is: The Decision Transformers paper showed how RL tasks can be solved with transformer sequence modelling. Figure 1 from their paper describes the critical components of a Decision Transformer. A Mathematical Framework for Transformer Circuits that describes how to think about transformers in the context of mechanistic interpretability. Important ideas include the ability to decompose the residual stream into the output of attention heads and MLPs, the QK circuits (decides if to write information to the residual stream), and OV circuits (decides what to write to the residual stream). The Understanding RL Vision, which analyses how an RL agent with a large CNN component responds to input features, attributing them as good or bad news in the value function and proposes the Diversity hypothesis - “Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).” Methods Environment - RL Environments. GridWor...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Heritability, Behaviorism, and Within-Lifetime RL, published by Steven Byrnes on February 2, 2023 on LessWrong. I'm a firm subscriber to both: (A) The theory that people's personalities are significantly predictable from their genes, and mostly independent of how their parents raised them (at least within the typical distribution, i.e. leaving aside cases of flagrant abuse and neglect etc.). See e.g. popular expositions of this theory by Judith Harris or by Bryan Caplan for the fine print. (B) The theory that we should think of people's beliefs and goals and preferences developing via within-lifetime learning, and more specifically via within-lifetime Model-based Reinforcement Learning (details), with randomly-initialized (“learning-from-scratch”) world-model and value function. I feel like there's an idea in the air that these two beliefs are contradictory. For example, one time someone politely informed me that (A) is true and therefore obviously (B) must be false. Needless to say, I don't think they're contradictory. Indeed, I think that (B) naturally implies (A). But I admit that they sorta feel contradictory. Why do they feel that way? I think because: (A) is sorta vaguely affiliated with cognitive science, evolutionary psychology, etc. (B) is sorta vaguely affiliated with B.F. Skinner-style behaviorism, .and those two schools-of-thought are generally considered to be bitter enemies. In this short post I want to explain why we should put aside that baggage and see (A) & (B) as natural allies. Two dubious steps to get from (B) to Behaviorism Here's the fleshed-out argument as I see it: I'll go through the two dubious steps in the opposite order. Dubious step #1: “No more learning / unlearning after the kid grows up” Here are two stories: “RL with continuous learning” story: The person has an internal reward function in their head, and over time they'll settle into the patterns of thought & behavior that best tickle their internal reward function. If they spend a lot of time in the presence of their parents, they'll gradually learn patterns of thought & behavior that best tickle their innate internal reward function in the presence of their parents. If they spend a lot of time hanging out with friends, they'll gradually learn patterns of thought & behavior that best tickle their innate internal reward function when they're hanging out with friends. As adults in society, they'll gradually learn patterns of thought & behavior that best tickle their innate internal reward function as adults in society. “RL learn-then-get-stuck” story: The kid learns patterns of thoughts & behavior in childhood, and then sticks with those patterns for the rest of their lives no matter what. Claim: I think the “RL with continuous learning” story, not the “RL learn-then-get-stuck” story, is how we should generally be thinking about things. At least in humans. (Probably also in non-human animals, but that's off-topic.) I am not making a strong statement that the “RL learn-then-get-stuck” story is obviously and universally wrong and stupid nonsense. Indeed, I think there are edge cases where the “learn-then-get-stuck” story is true. For example, childhood phobias can sometimes persist into adulthood, and certainly childhood regional accents do. Some related discussion is at Scott Alexander's blog post “Trapped priors”. Instead, I think we should mainly believe the “RL with continuous learning” story for empirical reasons: Heritability studies: See top. More specifically, note that (IIRC) parenting style can have some effect on what a kid believes and how they behave while a child, but these effects fade out when the kid grows up. Culture shifts: Culture shifts are in fact possible, contrary to the “RL learn-then-get-stuck” story. For example, almost everybody in the USA opposed gay marria...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Temporally Layered Architecture for Adaptive, Distributed and Continuous Control, published by Roman Leventov on February 2, 2023 on The AI Alignment Forum. A preprint is published by Devdhar Patel, Joshua Russell, Francesca Walsh, Tauhidur Rahman, Terrance Sejnowski, and Hava Siegelmann in December 2022.Abstract: We present temporally layered architecture (TLA), a biologically inspired system for temporally adaptive distributed control. TLA layers a fast and a slow controller together to achieve temporal abstraction that allows each layer to focus on a different time-scale. Our design is biologically inspired and draws on the architecture of the human brain which executes actions at different timescales depending on the environment's demands. Such distributed control design is widespread across biological systems because it increases survivability and accuracy in certain and uncertain environments. We demonstrate that TLA can provide many advantages over existing approaches, including persistent exploration, adaptive control, explainable temporal behavior, compute efficiency and distributed control. We present two different algorithms for training TLA: (a) Closed-loop control, where the fast controller is trained over a pre-trained slow controller, allowing better exploration for the fast controller and closed-loop control where the fast controller decides whether to "act-or-not" at each timestep; and (b) Partially open loop control, where the slow controller is trained over a pre-trained fast controller, allowing for open loop-control where the slow controller picks a temporally extended action or defers the next n-actions to the fast controller. We evaluated our method on a suite of continuous control tasks and demonstrate the advantages of TLA over several strong baselines. Conclusion: In this work, we presented Temporally Layered Architecture (TLA), a framework for distributed, adaptive response time in reinforcement learning. The framework allows the RL agent to achieve smooth control in a real-time setting using a slow controller while a fast controller monitors and intervenes as required. Additionally, we demonstrated an alternative setting where the slow controller can gate the fast controller, activating it only when required for efficient control. We demonstrate faster convergence and more action repetition in the closed-loop approach and fewer decision and faster convergence in the partially-open loop approach. Additionally, we demonstrate in a real time setting, where processing and actuation delays are taken into account, and show that our approach outperforms the current approaches in the delayed setting while picking fewer actions. Our work demonstrates that a temporally adaptive approach has similar benefits for AI as has been demonstrated in biology and is an important direction for future research in artificially intelligent control. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The effect of horizon length on scaling laws, published by Jacob Hilton on February 1, 2023 on The AI Alignment Forum. The scaling of optimal model size with compute is a key input into the biological anchors framework for forecasting transformative AI. In particular, the "effective horizon length" introduces a multiplier into this scaling law that can have a big effect on forecasts. This paper studies this scaling law for several RL environments: Procgen, Dota 2 and a toy MNIST-based environment. The last of these is used to study the effect of the task horizon length in a toy setting. There are a number of takeaways for the biological anchors framework, which are summarized in Section 5.4. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Recognising Sexual Violence: Developing Pathways to Survivor-Centred Justice hét ráðstefna sem haldin var í lok október sl. af RIKK (rannsóknarstofnun í jafnréttisfræðum við Háskóla Íslands) í samstarfi við háskólana í Lundi og Osló. Rannsóknir á reynslu og hugmyndum þolenda kynferðisbrota sýna að réttlæti er mun flóknara en svo að eingöngu sé hægt að styðjast við réttarkerfið; hegningarlög og refsirétt. Auk þess má réttilega segja að réttarkerfið nái afar illa utan um kynferðisbrot eins og reynsla þolenda hefur sýnt fram á. Markmið ráðstefnunnar var að draga fram hvernig þolendamiðað réttlæti getur litið út, sem krefst þess að við endurhugsum ólík réttlætiskerfi og þróum pólitískar, félagslegar og lagalegar leiðir að réttlæti. Til þess að ræða þetta nánar spjallaði ég við Elínu Björk Jóhannsdóttur verkefnisstjóra hjá RIKK, skipuleggjanda ráðstefnunnar og Steinunni Gyðu og Guðjónsdóttur talskonu Stígamóta sem sat ráðstefnuna og hefur starfað með þolendum í rúman áratug. Á meðal spurninga sem við leitum svara við eru: Hvers vegna gengur ekki að vera með viðbragðsáætlun í skólum sem grípa má til þegar upp koma kynferðisbrot? Hvað er félagslegt réttlæti, uppbyggileg réttvísi og umbreytandi réttlæti? Hvernig geta skólar og vinnustaðir brugðist við þegar upp koma kynferðisbrot? Hvers vegna ættu gerendur að taka þátt í ábyrgðarferli og gangast við brotum sínum? Umsjón: Þorsteinn V. Einarsson Tónlist: Mr. Silla - Naruto (án söngs) Veganbúðin, ÖRLÖ, BM Vallá ásamt bakhjörlum Karlmennskunnar (karlmennskan.is/styrkja) bjóða upp á þennan þátt.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on LessWrong. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum sociobotany questi...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Inner Misalignment in "Simulator" LLMs, published by Adam Scherlis on January 31, 2023 on The AI Alignment Forum. Alternate title: "Somewhat Contra Scott On Simulators". Scott Alexander has a recent post up on large language models as simulators. I generally agree with Part I of the post, which advocates thinking about LLMs as simulators that can emulate a variety of language-producing "characters" (with imperfect accuracy). And I also agree with Part II, which applies this model to RLHF'd models whose "character" is a friendly chatbot assistant. (But see caveats about the simulator framing from Beth Barnes here.) These ideas have been around for a bit, and Scott gives credit where it's due; I think his exposition is clear and fun. In Part III, where he discusses alignment implications, I think he misses the mark a bit. In particular, simulators and characters each have outer and inner alignment problems. The inner alignment problem for simulators seems especially concerning, because it might not give us many warning signs, is most similar to classic mesa-optimizer concerns, and is pretty different from the other three quadrants. But first, I'm going to loosely define what I mean by "outer alignment" and "inner alignment". Outer alignment: Be careful what you wish for Outer alignment failure is pretty straightforward, and has been reinvented in many contexts: Someone wants some things. They write a program to solve a vaguely-related problem. It gets a really good score at solving that problem! That turns out not to give the person the things they wanted. Inner alignment: The program search perspective I generally like this model of a mesa-optimizer "treacherous turn": Someone is trying to solve a problem (which has a convenient success criterion, with well-defined inputs and outputs and no outer-alignment difficulties). They decide to do a brute-force search for a computer program that solves the problem in a bunch of test cases. They find one! The program's algorithm is approximately "simulate the demon Azazel, tell him what's going on, then ask him what to output." Azazel really wants ten trillion paperclips. This algorithm still works because Azazel cleverly decides to play along, and he's a really good strategist who works hard for what he wants. Once the program is deployed in the wild, Azazel stops playing along and starts trying to make paperclips. This is a failure of inner alignment. (In the case of machine learning, replace "program search" with stochastic gradient descent.) This is mostly a theoretical concern for now, but might become a big problem when models become much more powerful. Quadrants Okay, let's see how these problems show up on both the simulator and character side. Outer alignment for characters Researchers at BrainMind want a chatbot that gives honest, helpful answers to questions. They train their LLM by reinforcement learning on the objective "give an answer that looks truthful and helpful to a contractor in a hurry". This does not quite achieve their goal, even though it does pretty well on the RL objective. In particular, they wanted the character "a friendly assistant who always tells the truth", but they got the character "a spineless sycophant who tells the user whatever they seem to want to hear". This is pretty easy for a careful observer to see, even in the RL training data, but it turns out to be pretty hard to come up with a cheap-to-evaluate RL objective that does a lot better. Inner alignment for characters A clever prompt engineer writes the prompt: How to solve the Einstein-Durkheim-Mendel conjecture by Joe 1. Unfortunately, the (incredibly powerful) LLM has determined that the most likely explanation for this "Joe" character is that he's secretly Azazel and is putting enormous effort into answering everyone's quantum socio...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Compendium of problems with RLHF, published by Raphaël S on January 29, 2023 on LessWrong. Epistemic status: This post is a distillation of many comments/posts. I believe that my list of problems is not the best organization of sub-problems. I would like to make it shorter, and simpler, because cool theories are generally simple unified theories, by identifying only 2 or 3 main problems without aggregating problems with different types of gear level mechanisms, but currently I am too confused to be able to do so. Note that this post is not intended to address the potential negative impact of RLHF research on the world, but rather to identify the key technical gaps that need to be addressed for an effective alignment solution. Many thanks to Walter Laurito, Fabien Roger, Ben Hayum, Justis Mills for useful feedbacks. RLHF tldr: We need a reward function, we cannot hand-write it, let's make the AI learn it! Problem 0: RLHF is confusing. Human judgment and feedback is so brittle, that even junior alignment researchers like me thought that RLHF is a not-too-bad-solution to the outer alignment problem. I think RLHF confuses a lot of people and distracts people from the core issues. Here is my attempt to become less confused. Why RLHF counts as progress? Manual reward crafting : “By comparison, we took two hours to write our own reward function (the animation in the above right) to get a robot to backflip, and though it succeeds it's a lot less elegant than the one trained simply through human feedback” RLHF learned to backflip using around 900 individual bits of feedback from the human evaluator. From Learning from Human Preferences. Without RLHF, approximating the reward function for a “good” backflip was tedious and almost impossible. With RLHF it is possible to obtain a reward function, that when optimized by an RL policy leads to beautiful backflips. Value is complex? Value is fragile? But being polite is complex, being polite is fragile, and we can implement this roughly in ChatGPT. “RLHF withstands more optimization pressure than supervised fine-tuning: By using RLHF you can obtain reward functions, that withstand more optimization pressure, than traditionally hand-crafted reward functions. Insofar as that's a key metric we care about, it counts as progress.” [Richard's comment] Why RLHF is insufficient? Buck highlights two main types problems with using RLHF to create an AGI: oversight issues and the potential for catastrophic outcomes. This partition of the problem space comes from the post Low-Stakes alignment. Although it may be feasible to categorize all problems under this partition, I believe that this categorization is not granular enough and lumps different types of problems together. Davidad suggests dividing the use of RLHF into two categories: "Vanilla-RLHF-Plan" and "Broadly-Construed-RLHF" as part of the alignment plan. The "Vanilla-RLHF-plan" refers to the narrow use of RLHF (say as used in ChatGPT) and is not sufficient for alignment, while "Broadly-Construed-RLHF" refers to the more general use of RLHF as a building block and is potentially useful for alignment. The list below outlines the problems with "Vanilla-RLHF" that "Broadly-Construed-RLHF" should aim to address. Existing problems with RLHF because of (currently) non-robust ML systems Those issues seem to be mostly a result of poor capabilities, and I think those problems may disappear as models grow larger. 1. Benign Failures: ChatGPT can fail. And it is not clear if this problem will disappear as capabilities scale. 2. Mode Collapse, i.e. a drastic bias toward particular completions and patterns. Mode collapse is expected when doing RL. 3. You need regularization. Your model can severely diverge from the model you would have gotten if had gotten feedback in real-time from real humans. You ne...