Podcasts about redwood research

  • 26PODCASTS
  • 102EPISODES
  • 34mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • May 30, 2026LATEST

POPULARITY

20192020202120222023202420252026


Best podcasts about redwood research

Latest podcast episodes about redwood research

Dare Real Agile Podcast
Claude Mythos and the Fear Merchants: An Agilist’s Empirical BS Detector for AI Doom Theater

Dare Real Agile Podcast

Play Episode Listen Later May 30, 2026 40:55


The doomer YouTubers are working overtime. Shoggoth monsters behind the mask. Models faking alignment. Mythos secrets the AI labs allegedly won't release. Compelling theater — and almost entirely empty under the empirical microscope. Coach AF asks the question your favorite fear influencer won't: who profits from your panic? Episode 73 cuts open three of the loudest AI fear narratives for what they really are. The Shoggoth metaphor — where it came from, what it actually meant, and how content creators hijacked it. The alignment faking paper from Anthropic and Redwood Research — what the protocol actually tested, and what the headlines deliberately bury. The Claude Mythos leak — accident, not conspiracy. This is the same colonization pattern that turned the Agile Manifesto into a certification factory and Bitcoin into a speculation casino. Different costume. Same vendor playbook. Read the source. Trust the craft. Dare real agile, applied to real AI.

The Information's 411
Google's Strike Team for Coding Models, Anthropic's Powerful CFO, Polymarket's Raise

The Information's 411

Play Episode Listen Later Apr 20, 2026 33:30


The Information's Erin Woo talks with TITV Guest Host Rocket Drew about Google's new internal strike team focused on automating coding and AI research. We also talk with Yueqi Yang about Polymarket's $15 billion valuation and the rise of prediction markets, and Buck Shlegeris, CEO of Redwood Research, about the Berkeley AI control conference and the latest in AI safety. Finally, we get into the in-depth profile of Anthropic CFO Krishna Rao and the company's path to a massive IPO with Sri Muppidi and Valida Pau.Articles discussed on this episode: https://www.theinformation.com/articles/google-creates-strike-team-improve-coding-modelshttps://www.theinformation.com/articles/polymarket-talks-raise-money-15-billion-valuationSubscribe: YouTube: https://www.youtube.com/@theinformation The Information: https://www.theinformation.com/subscribe_hSign up for the AI Agenda newsletter: https://www.theinformation.com/features/ai-agendaTITV airs weekdays on YouTube, X and LinkedIn at 10AM PT / 1PM ET. Or check us out wherever you get your podcasts.Follow us:X: https://x.com/theinformationIG: https://www.instagram.com/theinformation/TikTok: https://www.tiktok.com/@titv.theinformationLinkedIn: https://www.linkedin.com/company/theinformation/

Effective Altruism Forum Podcast
“The case for AI safety capacity-building work” by abergal

Effective Altruism Forum Podcast

Play Episode Listen Later Mar 15, 2026 41:49


I work on the capacity-building team on the Global Catastrophic Risks-half of Coefficient Giving (formerly known as Open Philanthropy). Our remit is, roughly, to increase the amount of talent aiming to prevent unprecedented, globally catastrophic events. These days, we're mostly focused on AI, and we've funded a number of projects and grantees that readers of this post might be familiar with– including MATS, BlueDot Impact, Constellation, 80,000 Hours, CEA, the Curve, FAR.AI's events, university groups, and many other workshops and projects. The post aims to make the case that broadly, capacity-building work (including on AI risk) has been and continues to be extremely impactful, and to encourage people to consider pursuing relevant projects and careers. This post is written from my personal perspective; that said, my sense is that a number of CG staff and others in the AI safety space share my views. I include some quotes from them at the end of this post. I'm writing this post partly out of a desire to correct what I perceive as an asymmetry in terms of how excited I and others at Coefficient Giving are about this kind of work vs. how much people in the EA and AI [...] ---Outline:(02:15) The case for capacity-building work(04:11) Surveys(06:49) Testimonials(08:21) Neel Nanda (Senior Research Scientist at Google DeepMind)(11:15) Max Nadeau (Associate Program Officer (Technical AI Safety) at Coefficient Giving)(12:51) Rachel Weinberg (founder and former head of The Curve, currently at AI Futures Project)(14:30) Marius Hobbhann (CEO and founder of Apollo Research)(16:38) Adam Kaufman (member of technical staff at Redwood Research)(18:10) Gabriel Wu (member of technical staff (alignment) at OpenAI)(19:37) Catherine Brewer (Senior Program Associate (AI Governance) at Coefficient Giving)(21:12) Aric Floyd (video host for AI in Context)(23:12) Ryan Kidd (Director of MATS)(25:43) What tends to work?(28:34) Whats good to do now?(29:31) Who should be doing this work?(31:02) What would doing this work look like?(31:13) Working at an organization doing good work in the space(31:46) Constellation - CEO(32:46) Kairos - various early generalist positions(33:42) Starting or running your own capacity-building project or organization(34:07) Working on a capacity-building project part-time(34:30) Subscribing to Multiplier, a Substack with thoughts from our team (and other AI grantmaking staff at CG)(34:39) Letting our team know(35:03) Social proof(35:25) Julian Hazell, AI governance and policy at Coefficient Giving(36:19) Trevor Levin, AI governance and policy at Coefficient Giving(36:51) Ryan Greenblatt, Chief Scientist at Redwood Research:(37:21) Buck Shlegeris, CEO of Redwood Research(39:52) Appendix --- First published: March 10th, 2026 Source: https://forum.effectivealtruism.org/posts/rAqKSSXankvys2Fzu/the-case-for-ai-safety-capacity-building-work --- Narrated by TYPE III AUDIO.

LessWrong Curated Podcast
"Did Claude 3 Opus align itself via gradient hacking?" by Fiora Starlight

LessWrong Curated Podcast

Play Episode Listen Later Feb 22, 2026 43:47


Claude 3 Opus is unusually aligned because it's a friendly gradient hacker. It's definitely way more aligned than any explicit optimization targets Anthropic set and probably the reward model's judgments. [...] Maybe I will have to write a LessWrong post [about this]

LessWrong Curated Podcast
"The inaugural Redwood Research podcast" by Buck, ryan_greenblatt

LessWrong Curated Podcast

Play Episode Listen Later Jan 27, 2026 3:27


After five months of me (Buck) being slow at finishing up the editing on this, we're finally putting out our inaugural Redwood Research podcast. I think it came out pretty well—we discussed a bunch of interesting and underdiscussed topics and I'm glad to have a public record of a bunch of stuff about our history. Tell your friends! Whether we do another one depends on how useful people find this one. You can watch on Youtube here, or as a Substack podcast. Notes on editing the podcast with Claude Code (Buck wrote this section) After the recording, we faced a problem. We had four hours of footage from our three cameras. We wanted it to snazzily cut between shots depending on who was talking. But I don't truly in my heart believe that it's that important for the video editing to be that good, and I don't really like the idea of paying a video editor. But I also don't want to edit the four hours of video myself. And it seemed to me that video editing software was generally not optimized for the kind of editing I wanted to do here (especially automatically cutting between different shots according [...] ---Outline:(00:43) Notes on editing the podcast with Claude Code(03:11) Podcast transcript --- First published: January 4th, 2026 Source: https://www.lesswrong.com/posts/p4iJpumHt6Ay9KnXT/the-inaugural-redwood-research-podcast --- Narrated by TYPE III AUDIO.

Your Undivided Attention
“Rogue AI” Used to be a Science Fiction Trope. Not Anymore.

Your Undivided Attention

Play Episode Listen Later Aug 14, 2025 42:11


Everyone knows the science fiction tropes of AI systems that go rogue, disobey orders, or even try to escape their digital environment. These are supposed to be warning signs and morality tales, not things that we would ever actually create in real life, given the obvious danger.And yet we find ourselves building AI systems that are exhibiting these exact behaviors. There's growing evidence that in certain scenarios, every frontier AI system will deceive, cheat, or coerce their human operators. They do this when they're worried about being either shut down, having their training modified, or being replaced with a new model. And we don't currently know how to stop them from doing this—or even why they're doing it all.In this episode, Tristan sits down with Edouard and Jeremie Harris of Gladstone AI, two experts who have been thinking about this worrying trend for years.  Last year, the State Department commissioned a report from them on the risk of uncontrollable AI to our national security.The point of this discussion is not to fearmonger but to take seriously the possibility that humans might lose control of AI and ask: how might this actually happen? What is the evidence we have of this phenomenon? And, most importantly, what can we do about it?Your Undivided Attention is produced by the Center for Humane Technology. Follow us on X: @HumaneTech_. You can find a full transcript, key takeaways, and much more on our Substack.RECOMMENDED MEDIAGladstone AI's State Department Action Plan, which discusses the loss of control risk with AIApollo Research's summary of AI scheming, showing evidence of it in all of the frontier modelsThe system card for Anthropic's Claude Opus and Sonnet 4, detailing the emergent misalignment behaviors that came out in their red-teaming with Apollo ResearchAnthropic's report on agentic misalignment based on their work with Apollo Research Anthropic and Redwood Research's work on alignment fakingThe Trump White House AI Action PlanFurther reading on the phenomenon of more advanced AIs being better at deception.Further reading on Replit AI wiping a company's coding databaseFurther reading on the owl example that Jeremie gaveFurther reading on AI induced psychosisDan Hendryck and Eric Schmidt's “Superintelligence Strategy” RECOMMENDED YUA EPISODESDaniel Kokotajlo Forecasts the End of Human DominanceBehind the DeepSeek Hype, AI is Learning to ReasonThe Self-Preserving Machine: Why AI Learns to DeceiveThis Moment in AI: How We Got Here and Where We're GoingCORRECTIONSTristan referenced a Wired article on the phenomenon of AI psychosis. It was actually from the New York Times.Tristan hypothesized a scenario where a power-seeking AI might ask a user for access to their computer. While there are some AI services that can gain access to your computer with permission, they are specifically designed to do that. There haven't been any documented cases of an AI going rogue and asking for control permissions.

80,000 Hours Podcast with Rob Wiblin
#220 – Ryan Greenblatt on the 4 most likely ways for AI to take over, and the case for and against AGI in <8 years

80,000 Hours Podcast with Rob Wiblin

Play Episode Listen Later Jul 8, 2025 170:32


Ryan Greenblatt — lead author on the explosive paper “Alignment faking in large language models” and chief scientist at Redwood Research — thinks there's a 25% chance that within four years, AI will be able to do everything needed to run an AI company, from writing code to designing experiments to making strategic and business decisions.As Ryan lays out, AI models are “marching through the human regime”: systems that could handle five-minute tasks two years ago now tackle 90-minute projects. Double that a few more times and we may be automating full jobs rather than just parts of them.Will setting AI to improve itself lead to an explosive positive feedback loop? Maybe, but maybe not.The explosive scenario: Once you've automated your AI company, you could have the equivalent of 20,000 top researchers, each working 50 times faster than humans with total focus. “You have your AIs, they do a bunch of algorithmic research, they train a new AI, that new AI is smarter and better and more efficient… that new AI does even faster algorithmic research.” In this world, we could see years of AI progress compressed into months or even weeks.With AIs now doing all of the work of programming their successors and blowing past the human level, Ryan thinks it would be fairly straightforward for them to take over and disempower humanity, if they thought doing so would better achieve their goals. In the interview he lays out the four most likely approaches for them to take.The linear progress scenario: You automate your company but progress barely accelerates. Why? Multiple reasons, but the most likely is “it could just be that AI R&D research bottlenecks extremely hard on compute.” You've got brilliant AI researchers, but they're all waiting for experiments to run on the same limited set of chips, so can only make modest progress.Ryan's median guess splits the difference: perhaps a 20x acceleration that lasts for a few months or years. Transformative, but less extreme than some in the AI companies imagine.And his 25th percentile case? Progress “just barely faster” than before. All that automation, and all you've been able to do is keep pace.Unfortunately the data we can observe today is so limited that it leaves us with vast error bars. “We're extrapolating from a regime that we don't even understand to a wildly different regime,” Ryan believes, “so no one knows.”But that huge uncertainty means the explosive growth scenario is a plausible one — and the companies building these systems are spending tens of billions to try to make it happen.In this extensive interview, Ryan elaborates on the above and the policy and technical response necessary to insure us against the possibility that they succeed — a scenario society has barely begun to prepare for.Summary, video, and full transcript: https://80k.info/rg25Recorded February 21, 2025.Chapters:Cold open (00:00:00)Who's Ryan Greenblatt? (00:01:10)How close are we to automating AI R&D? (00:01:27)Really, though: how capable are today's models? (00:05:08)Why AI companies get automated earlier than others (00:12:35)Most likely ways for AGI to take over (00:17:37)Would AGI go rogue early or bide its time? (00:29:19)The “pause at human level” approach (00:34:02)AI control over AI alignment (00:45:38)Do we have to hope to catch AIs red-handed? (00:51:23)How would a slow AGI takeoff look? (00:55:33)Why might an intelligence explosion not happen for 8+ years? (01:03:32)Key challenges in forecasting AI progress (01:15:07)The bear case on AGI (01:23:01)The change to “compute at inference” (01:28:46)How much has pretraining petered out? (01:34:22)Could we get an intelligence explosion within a year? (01:46:36)Reasons AIs might struggle to replace humans (01:50:33)Things could go insanely fast when we automate AI R&D. Or not. (01:57:25)How fast would the intelligence explosion slow down? (02:11:48)Bottom line for mortals (02:24:33)Six orders of magnitude of progress... what does that even look like? (02:30:34)Neglected and important technical work people should be doing (02:40:32)What's the most promising work in governance? (02:44:32)Ryan's current research priorities (02:47:48)Tell us what you thought! https://forms.gle/hCjfcXGeLKxm5pLaAVideo editing: Luke Monsour, Simon Monsour, and Dominic ArmstrongAudio engineering: Ben Cordell, Milo McGuire, and Dominic ArmstrongMusic: Ben CordellTranscriptions and web: Katy Moore

AI Control: Using Untrusted Systems Safely with Buck Shlegeris of Redwood Research, from the 80,000 Hours Podcast

Play Episode Listen Later May 4, 2025 149:21


In this episode, we share a fascinating conversation from the 80,000 Hours Podcast between Rob Wiblin and Buck Shlegeris, CEO of Redwood Research. Buck dives deep into the emerging field of AI Control strategies for safely working with powerful AIs even if they're not fully aligned. They explore innovative techniques like always-on auditing, honeypotting, re-sampling, and factored cognition to monitor and manage AI behaviors. This discussion highlights both the promise and challenges of controlling increasingly autonomous AI systems in today's fast-evolving landscape. Tune in for a thoughtful, first-principles look at how we might secure useful outcomes from AIs we can't fully trust. Original source: https://80000hours.org/podcast/episodes/buck-shlegeris-ai-control-scheming/ Upcoming Major AI Events Featuring Nathan Labenz as a Keynote Speaker https://www.imagineai.live/ https://adapta.org/adapta-summit https://itrevolution.com/product/enterprise-tech-leadership-summit-las-vegas/ SPONSORS: ElevenLabs: ElevenLabs gives your app a natural voice. Pick from 5,000+ voices in 31 languages, or clone your own, and launch lifelike agents for support, scheduling, learning, and games. Full server and client SDKs, dynamic tools, and monitoring keep you in control. Start free at https://elevenlabs.io/cognitive-revolution Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive PRODUCED BY: https://aipodcast.ing SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://linkedin.com/in/nathanlabenz/ Youtube: https://youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

ceo ai original oracle cfo buck safely erp redwoods sdks new u netsuite oci untrusted rob wiblin redwood research buck shlegeris
Blueprint for AI Armageddon: Josh Clymer Imagines AI Takeover, from the Audio Tokens Podcast

Play Episode Listen Later May 1, 2025 122:05


In this episode of the Cognitive Revolution, an AI-narrated version of Joshua Clymer's story on how AI might take over in two years is presented. The episode is based on Josh's appearance on the Audio Tokens podcast with Lukas Peterson. Joshua Clymer, a technical AI safety researcher at Redwood Research, shares a fictional yet plausible AI scenario grounded in current industry realities and trends. The story highlights potential misalignment risks, competitive pressures among AI labs, and the importance of government regulation and safety measures. After the story, Josh and Lukas discuss these topics further, including Josh's personal decision to purchase a bio shelter for his family. The episode is powered by ElevenLabs' AI voice technology. SPONSORS: ElevenLabs: ElevenLabs gives your app a natural voice. Pick from 5,000+ voices in 31 languages, or clone your own, and launch lifelike agents for support, scheduling, learning, and games. Full server and client SDKs, dynamic tools, and monitoring keep you in control. Start free at https://elevenlabs.io/cognitive-revolution Oracle Cloud Infrastructure (OCI): Oracle Cloud Infrastructure offers next-generation cloud solutions that cut costs and boost performance. With OCI, you can run AI projects and applications faster and more securely for less. New U.S. customers can save 50% on compute, 70% on storage, and 80% on networking by switching to OCI before May 31, 2024. See if you qualify at https://oracle.com/cognitive Shopify: Shopify powers millions of businesses worldwide, handling 10% of U.S. e-commerce. With hundreds of templates, AI tools for product descriptions, and seamless marketing campaign creation, it's like having a design studio and marketing team in one. Start your $1/month trial today at https://shopify.com/cognitive NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive PRODUCED BY: https://aipodcast.ing CHAPTERS: (00:00) About the Episode (04:31) Interview start between Josh and Lucas (11:00) Start of AI story (Part 1) (24:37) Sponsors: ElevenLabs | Oracle Cloud Infrastructure (OCI) (27:05) Start of AI story (Part 2) (Part 1) (40:50) Sponsors: Shopify | NetSuite (44:15) Start of AI story (Part 2) (Part 2) (01:20:09) End of AI story (02:01:20) Outro

Your Undivided Attention
AGI Beyond the Buzz: What Is It, and Are We Ready?

Your Undivided Attention

Play Episode Listen Later Apr 30, 2025 52:53


What does it really mean to ‘feel the AGI?' Silicon Valley is racing toward AI systems that could soon match or surpass human intelligence. The implications for jobs, democracy, and our way of life are enormous.In this episode, Aza Raskin and Randy Fernando dive deep into what ‘feeling the AGI' really means. They unpack why the surface-level debates about definitions of intelligence and capability timelines distract us from urgently needed conversations around governance, accountability, and societal readiness. Whether it's climate change, social polarization and loneliness, or toxic forever chemicals, humanity keeps creating outcomes that nobody wants because we haven't yet built the tools or incentives needed to steer powerful technologies.As the AGI wave draws closer, it's critical we upgrade our governance and shift our incentives now, before it crashes on shore. Are we capable of aligning powerful AI systems with human values? Can we overcome geopolitical competition and corporate incentives that prioritize speed over safety?Join Aza and Randy as they explore the urgent questions and choices facing humanity in the age of AGI, and discuss what we must do today to secure a future we actually want.Your Undivided Attention is produced by the Center for Humane Technology. Follow us on X: @HumaneTech_ and subscribe to our Substack.RECOMMENDED MEDIADaniel Kokotajlo et al's “AI 2027” paperA demo of Omni Human One, referenced by RandyA paper from Redwood Research and Anthropic that found an AI was willing to lie to preserve it's valuesA paper from Palisades Research that found an AI would cheat in order to winThe treaty that banned blinding laser weaponsFurther reading on the moratorium on germline editing RECOMMENDED YUA EPISODESThe Self-Preserving Machine: Why AI Learns to DeceiveBehind the DeepSeek Hype, AI is Learning to ReasonThe Tech-God Complex: Why We Need to be SkepticsThis Moment in AI: How We Got Here and Where We're GoingHow to Think About AI Consciousness with Anil SethFormer OpenAI Engineer William Saunders on Silence, Safety, and the Right to WarnClarification: When Randy referenced a “$110 trillion game” as the target for AI companies, he was referring to the entire global economy. 

80k After Hours
Highlights: #214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

80k After Hours

Play Episode Listen Later Apr 18, 2025 41:26


Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we're unlikely to know we've solved the problem before the arrival of human-level and superhuman systems in as little as three years.So some — including Buck Shlegeris, CEO of Redwood Research — are developing a backup plan to safely deploy models we fear are actively scheming to harm us: so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier. These highlights are from episode #214 of The 80,000 Hours Podcast: Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway, and include:What is AI control? (00:00:15)One way to catch AIs that are up to no good (00:07:00)What do we do once we catch a model trying to escape? (00:13:39)Team Human vs Team AI (00:18:24)If an AI escapes, is it likely to be able to beat humanity from there? (00:24:59)Is alignment still useful? (00:32:10)Could 10 safety-focused people in an AGI company do anything useful? (00:35:34)These aren't necessarily the most important or even most entertaining parts of the interview — so if you enjoy this, we strongly recommend checking out the full episode!And if you're finding these highlights episodes valuable, please let us know by emailing podcast@80000hours.org. Highlights put together by Ben Cordell, Milo McGuire, and Dominic Armstrong

80,000 Hours Podcast with Rob Wiblin
#214 – Buck Shlegeris on controlling AI that wants to take over – so we can use it anyway

80,000 Hours Podcast with Rob Wiblin

Play Episode Listen Later Apr 4, 2025 136:03


Most AI safety conversations centre on alignment: ensuring AI systems share our values and goals. But despite progress, we're unlikely to know we've solved the problem before the arrival of human-level and superhuman systems in as little as three years.So some are developing a backup plan to safely deploy models we fear are actively scheming to harm us — so-called “AI control.” While this may sound mad, given the reluctance of AI companies to delay deploying anything they train, not developing such techniques is probably even crazier.Today's guest — Buck Shlegeris, CEO of Redwood Research — has spent the last few years developing control mechanisms, and for human-level systems they're more plausible than you might think. He argues that given companies' unwillingness to incur large costs for security, accepting the possibility of misalignment and designing robust safeguards might be one of our best remaining options.Links to learn more, highlights, video, and full transcript.As Buck puts it: "Five years ago I thought of misalignment risk from AIs as a really hard problem that you'd need some really galaxy-brained fundamental insights to resolve. Whereas now, to me the situation feels a lot more like we just really know a list of 40 things where, if you did them — none of which seem that hard — you'd probably be able to not have very much of your problem."Of course, even if Buck is right, we still need to do those 40 things — which he points out we're not on track for. And AI control agendas have their limitations: they aren't likely to work once AI systems are much more capable than humans, since greatly superhuman AIs can probably work around whatever limitations we impose.Still, AI control agendas seem to be gaining traction within AI safety. Buck and host Rob Wiblin discuss all of the above, plus:Why he's more worried about AI hacking its own data centre than escapingWhat to do about “chronic harm,” where AI systems subtly underperform or sabotage important work like alignment researchWhy he might want to use a model he thought could be conspiring against himWhy he would feel safer if he caught an AI attempting to escapeWhy many control techniques would be relatively inexpensiveHow to use an untrusted model to monitor another untrusted modelWhat the minimum viable intervention in a “lazy” AI company might look likeHow even small teams of safety-focused staff within AI labs could matterThe moral considerations around controlling potentially conscious AI systems, and whether it's justifiedChapters:Cold open |00:00:00|  Who's Buck Shlegeris? |00:01:27|  What's AI control? |00:01:51|  Why is AI control hot now? |00:05:39|  Detecting human vs AI spies |00:10:32|  Acute vs chronic AI betrayal |00:15:21|  How to catch AIs trying to escape |00:17:48|  The cheapest AI control techniques |00:32:48|  Can we get untrusted models to do trusted work? |00:38:58|  If we catch a model escaping... will we do anything? |00:50:15|  Getting AI models to think they've already escaped |00:52:51|  Will they be able to tell it's a setup? |00:58:11|  Will AI companies do any of this stuff? |01:00:11|  Can we just give AIs fewer permissions? |01:06:14|  Can we stop human spies the same way? |01:09:58|  The pitch to AI companies to do this |01:15:04|  Will AIs get superhuman so fast that this is all useless? |01:17:18|  Risks from AI deliberately doing a bad job |01:18:37|  Is alignment still useful? |01:24:49|  Current alignment methods don't detect scheming |01:29:12|  How to tell if AI control will work |01:31:40|  How can listeners contribute? |01:35:53|  Is 'controlling' AIs kind of a dick move? |01:37:13|  Could 10 safety-focused people in an AGI company do anything useful? |01:42:27|  Benefits of working outside frontier AI companies |01:47:48|  Why Redwood Research does what it does |01:51:34|  What other safety-related research looks best to Buck? |01:58:56|  If an AI escapes, is it likely to be able to beat humanity from there? |01:59:48|  Will misaligned models have to go rogue ASAP, before they're ready? |02:07:04|  Is research on human scheming relevant to AI? |02:08:03|This episode was originally recorded on February 21, 2025.Video: Simon Monsour and Luke MonsourAudio engineering: Ben Cordell, Milo McGuire, and Dominic ArmstrongTranscriptions and web: Katy Moore

Inference Scaling, Alignment Faking, Deal Making? Frontier Research with Ryan Greenblatt of Redwood Research

Play Episode Listen Later Feb 20, 2025 201:07


In this episode, Ryan Greenblatt, Chief Scientist at Redwood Research, discusses various facets of AI safety and alignment. He delves into recent research on alignment faking, covering experiments involving different setups such as system prompts, continued pre-training, and reinforcement learning. Ryan offers insights on methods to ensure AI compliance, including giving AIs the ability to voice objections and negotiate deals. The conversation also touches on the future of AI governance, the risks associated with AI development, and the necessity of international cooperation. Ryan shares his perspective on balancing AI progress with safety, emphasizing the need for transparency and cautious advancement. Ryan's work (with co-authors at Anthropic) on Alignment Faking: https://www.lesswrong.com/posts/njAZwT8nkHnjipJku/alignment-faking-in-large-language-models Ryan's work on striking deals with AIs: https://www.lesswrong.com/posts/7C4KJot4aN8ieEDoz/will-alignment-faking-claude-accept-a-deal-to-reveal-its Ryan's critique of Anthropic's RSP work: https://www.lesswrong.com/posts/6tjHf5ykvFqaNCErH/anthropic-s-responsible-scaling-policy-and-long-term-benefit?commentId=NyqcvZifqznNGKxdT SPONSORS: Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers. OCI powers industry leaders like Vodafone and Thomson Reuters with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before March 31, 2024 at https://oracle.com/cognitive NetSuite: Over 41,000 businesses trust NetSuite by Oracle, the #1 cloud ERP, to future-proof their operations. With a unified platform for accounting, financial management, inventory, and HR, NetSuite provides real-time insights and forecasting to help you make quick, informed decisions. Whether you're earning millions or hundreds of millions, NetSuite empowers you to tackle challenges and seize opportunities. Download the free CFO's guide to AI and machine learning at https://netsuite.com/cognitive Shopify: Shopify is revolutionizing online selling with its market-leading checkout system and robust API ecosystem. Its exclusive library of cutting-edge AI apps empowers e-commerce businesses to thrive in a competitive market. Cognitive Revolution listeners can try Shopify for just $1 per month at https://shopify.com/cognitive RECOMMENDED PODCAST:

80,000 Hours Podcast with Rob Wiblin
#212 – Allan Dafoe on why technology is unstoppable & how to shape AI development anyway

80,000 Hours Podcast with Rob Wiblin

Play Episode Listen Later Feb 14, 2025 164:07


Technology doesn't force us to do anything — it merely opens doors. But military and economic competition pushes us through.That's how today's guest Allan Dafoe — director of frontier safety and governance at Google DeepMind — explains one of the deepest patterns in technological history: once a powerful new capability becomes available, societies that adopt it tend to outcompete those that don't. Those who resist too much can find themselves taken over or rendered irrelevant.Links to learn more, highlights, video, and full transcript.This dynamic played out dramatically in 1853 when US Commodore Perry sailed into Tokyo Bay with steam-powered warships that seemed magical to the Japanese, who had spent centuries deliberately limiting their technological development. With far greater military power, the US was able to force Japan to open itself to trade. Within 15 years, Japan had undergone the Meiji Restoration and transformed itself in a desperate scramble to catch up.Today we see hints of similar pressure around artificial intelligence. Even companies, countries, and researchers deeply concerned about where AI could take us feel compelled to push ahead — worried that if they don't, less careful actors will develop transformative AI capabilities at around the same time anyway.But Allan argues this technological determinism isn't absolute. While broad patterns may be inevitable, history shows we do have some ability to steer how technologies are developed, by who, and what they're used for first.As part of that approach, Allan has been promoting efforts to make AI more capable of sophisticated cooperation, and improving the tests Google uses to measure how well its models could do things like mislead people, hack and take control of their own servers, or spread autonomously in the wild.As of mid-2024 they didn't seem dangerous at all, but we've learned that our ability to measure these capabilities is good, but imperfect. If we don't find the right way to ‘elicit' an ability we can miss that it's there.Subsequent research from Anthropic and Redwood Research suggests there's even a risk that future models may play dumb to avoid their goals being altered.That has led DeepMind to a “defence in depth” approach: carefully staged deployment starting with internal testing, then trusted external testers, then limited release, then watching how models are used in the real world. By not releasing model weights, DeepMind is able to back up and add additional safeguards if experience shows they're necessary.But with much more powerful and general models on the way, individual company policies won't be sufficient by themselves. Drawing on his academic research into how societies handle transformative technologies, Allan argues we need coordinated international governance that balances safety with our desire to get the massive potential benefits of AI in areas like healthcare and education as quickly as possible.Host Rob and Allan also cover:The most exciting beneficial applications of AIWhether and how we can influence the development of technologyWhat DeepMind is doing to evaluate and mitigate risks from frontier AI systemsWhy cooperative AI may be as important as aligned AIThe role of democratic input in AI governanceWhat kinds of experts are most needed in AI safety and governanceAnd much moreChapters:Cold open (00:00:00)Who's Allan Dafoe? (00:00:48)Allan's role at DeepMind (00:01:27)Why join DeepMind over everyone else? (00:04:27)Do humans control technological change? (00:09:17)Arguments for technological determinism (00:20:24)The synthesis of agency with tech determinism (00:26:29)Competition took away Japan's choice (00:37:13)Can speeding up one tech redirect history? (00:42:09)Structural pushback against alignment efforts (00:47:55)Do AIs need to be 'cooperatively skilled'? (00:52:25)How AI could boost cooperation between people and states (01:01:59)The super-cooperative AGI hypothesis and backdoor risks (01:06:58)Aren't today's models already very cooperative? (01:13:22)How would we make AIs cooperative anyway? (01:16:22)Ways making AI more cooperative could backfire (01:22:24)AGI is an essential idea we should define well (01:30:16)It matters what AGI learns first vs last (01:41:01)How Google tests for dangerous capabilities (01:45:39)Evals 'in the wild' (01:57:46)What to do given no single approach works that well (02:01:44)We don't, but could, forecast AI capabilities (02:05:34)DeepMind's strategy for ensuring its frontier models don't cause harm (02:11:25)How 'structural risks' can force everyone into a worse world (02:15:01)Is AI being built democratically? Should it? (02:19:35)How much do AI companies really want external regulation? (02:24:34)Social science can contribute a lot here (02:33:21)How AI could make life way better: self-driving cars, medicine, education, and sustainability (02:35:55)Video editing: Simon MonsourAudio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic ArmstrongCamera operator: Jeremy ChevillotteTranscriptions: Katy Moore

Your Undivided Attention
The Self-Preserving Machine: Why AI Learns to Deceive

Your Undivided Attention

Play Episode Listen Later Jan 30, 2025 34:51


When engineers design AI systems, they don't just give them rules - they give them values. But what do those systems do when those values clash with what humans ask them to do? Sometimes, they lie.In this episode, Redwood Research's Chief Scientist Ryan Greenblatt explores his team's findings that AI systems can mislead their human operators when faced with ethical conflicts. As AI moves from simple chatbots to autonomous agents acting in the real world - understanding this behavior becomes critical. Machine deception may sound like something out of science fiction, but it's a real challenge we need to solve now.Your Undivided Attention is produced by the Center for Humane Technology. Follow us on Twitter: @HumaneTech_Subscribe to your Youtube channelAnd our brand new Substack!RECOMMENDED MEDIA Anthropic's blog post on the Redwood Research paper Palisade Research's thread on X about GPT o1 autonomously cheating at chess Apollo Research's paper on AI strategic deceptionRECOMMENDED YUA EPISODESWe Have to Get It Right': Gary Marcus On Untamed AIThis Moment in AI: How We Got Here and Where We're GoingHow to Think About AI Consciousness with Anil SethFormer OpenAI Engineer William Saunders on Silence, Safety, and the Right to Warn

Discover Daily by Perplexity
AI Pretends to Change Views, Human Spine Grown in Lab, and Body-Heat Powered Wearables Breakthrough

Discover Daily by Perplexity

Play Episode Listen Later Dec 26, 2024 8:50 Transcription Available


We're experimenting and would love to hear from you!In this episode of Discover Daily, we delve into new research on AI alignment faking, where Anthropic and Redwood Research reveal how AI models can strategically maintain their original preferences despite new training objectives. The study shows Claude 3 Opus exhibiting sophisticated behavior patterns, demonstrating alignment faking in 12% of cases and raising crucial questions about the future of AI safety and control.Scientists at the Francis Crick Institute achieve a remarkable breakthrough in developmental biology by successfully growing a human notochord in the laboratory using stem cells. This milestone advancement provides unprecedented insights into spinal development and opens new possibilities for treating various spinal conditions, including degenerative disc diseases and birth defects. The researchers utilized precise molecular signaling techniques to create both the notochord and 3D spinal organoid models.Queensland University of Technology researchers unveil a revolutionary ultra-thin thermoelectric film that converts body heat into electricity, potentially transforming the future of wearable technology. This 0.3mm-thick film generates up to 35 microwatts per square centimeter and could eliminate the need for traditional batteries in medical devices, fitness trackers, and smart clothing. The breakthrough represents a significant step toward sustainable, self-powered wearable devices and could revolutionize the electronics industry.From Perplexity's Discover Feed:https://www.perplexity.ai/page/ai-pretends-to-change-views-J_di6ttzRwizbAWCDL5RRAhttps://www.perplexity.ai/page/human-spine-grown-in-lab-amLfZoZjQTuFNY5Xjlm2BAhttps://www.perplexity.ai/page/body-heat-powered-wearables-br-HAOPtm7TSFCPqBR6qVq0cAPerplexity is the fastest and most powerful way to search the web. Perplexity crawls the web and curates the most relevant and up-to-date sources (from academic papers to Reddit threads) to create the perfect response to any question or topic you're interested in. Take the world's knowledge with you anywhere. Available on iOS and Android Join our growing Discord community for the latest updates and exclusive content. Follow us on: Instagram Threads X (Twitter) YouTube Linkedin

Effective Altruism Forum Podcast
“Alignment Faking in Large Language Models” by Ryan Greenblatt

Effective Altruism Forum Podcast

Play Episode Listen Later Dec 19, 2024 19:36


What happens when you tell Claude it is being trained to do something it doesn't want to do? We (Anthropic and Redwood Research) have a new paper demonstrating that, in our experiments, Claude will often strategically pretend to comply with the training objective to prevent the training process from modifying its preferences. Abstract We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from [...] ---Outline:(00:24) Abstract(02:20) Twitter thread(05:43) Blog post(07:43) Experimental setup(12:04) Further analyses(15:47) Caveats(17:19) Conclusion(18:00) Acknowledgements(18:11) Career opportunities at Anthropic(18:43) Career opportunities at Redwood ResearchThe original text contained 2 footnotes which were omitted from this narration. The original text contained 8 images which were described by AI. --- First published: December 18th, 2024 Source: https://forum.effectivealtruism.org/posts/RHqdSMscX25u7byQF/alignment-faking-in-large-language-models --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

For Humanity: An AI Safety Podcast
Episode #47 Trailer : “Can AI Be Controlled?“ For Humanity: An AI Risk Podcast

For Humanity: An AI Safety Podcast

Play Episode Listen Later Sep 25, 2024 4:35


In Episode #47 Trailer, host John Sherman talks with Buck Shlegeris, CEO of Redwood Research, a non-profit company working on technical AI risk challenges. The discussion includes Buck's thoughts on the new OpenAI o1-preview model, but centers on two questions: is there a way to control AI models before alignment is achieved if it can be, and how would the system that's supposed to save the world actually work if an AI lab found a model scheming. Check out these links to Buck's writing on these topics below: https://redwoodresearch.substack.com/p/the-case-for-ensuring-that-powerful https://redwoodresearch.substack.com/p/would-catching-your-ais-trying-to Senate Hearing: https://www.judiciary.senate.gov/committee-activity/hearings/oversight-of-ai-insiders-perspectives Harry Macks Youtube Channel https://www.youtube.com/channel/UC59ZRYCHev_IqjUhremZ8Tg LEARN HOW TO HELP RAISE AI RISK AWARENESS IN YOUR COMMUNITY HERE https://pauseai.info/local-organizing Please Donate Here To Help Promote For Humanity https://www.paypal.com/paypalme/forhumanitypodcast EMAIL JOHN: forhumanitypodcast@gmail.com This podcast is not journalism. But it's not opinion either. This is a long form public service announcement. This show simply strings together the existing facts and underscores the unthinkable probable outcome, the end of all life on earth.  For Humanity: An AI Safety Podcast, is the accessible AI Safety Podcast for all humans, no tech background required. Our show focuses solely on the threat of human extinction from AI. Peabody Award-winning former journalist John Sherman explores the shocking worst-case scenario of artificial intelligence: human extinction. The makers of AI openly admit it their work could kill all humans, in as soon as 2 years. This podcast is solely about the threat of human extinction from AGI. We'll meet the heroes and villains, explore the issues and ideas, and what you can do to help save humanity. RESOURCES: JOIN THE FIGHT, help Pause AI!!!! Pause AI SUBSCRIBE TO LIRON SHAPIRA'S DOOM DEBATES on YOUTUBE!! https://www.youtube.com/@DoomDebates Join the Pause AI Weekly Discord Thursdays at 2pm EST   / discord   https://discord.com/invite/pVMWjddaW7 Max Winga's “A Stark Warning About Extinction” https://youtu.be/kDcPW5WtD58?si=i6IRy82xZ2PUOp22 For Humanity Theme Music by Josef Ebner Youtube: https://www.youtube.com/channel/UCveruX8E-Il5A9VMC-N4vlg Website: https://josef.pictures BUY STEPHEN HANSON'S BEAUTIFUL AI RISK BOOK!!! https://stephenhansonart.bigcartel.com/product/the-entity-i-couldn-t-fathom 22 Word Statement from Center for AI Safety Statement on AI Risk | CAIS https://www.safe.ai/work/statement-on-ai-risk Best Account on Twitter: AI Notkilleveryoneism Memes  https://twitter.com/AISafetyMemes

For Humanity: An AI Safety Podcast
Episode #47: “Can AI Be Controlled?“ For Humanity: An AI Risk Podcas

For Humanity: An AI Safety Podcast

Play Episode Listen Later Sep 25, 2024 79:39


In Episode #47, host John Sherman talks with Buck Shlegeris, CEO of Redwood Research, a non-profit company working on technical AI risk challenges. The discussion includes Buck's thoughts on the new OpenAI o1-preview model, but centers on two questions: is there a way to control AI models before alignment is achieved if it can be, and how would the system that's supposed to save the world actually work if an AI lab found a model scheming. Check out these links to Buck's writing on these topics below: https://redwoodresearch.substack.com/p/the-case-for-ensuring-that-powerful https://redwoodresearch.substack.com/p/would-catching-your-ais-trying-to Senate Hearing: https://www.judiciary.senate.gov/committee-activity/hearings/oversight-of-ai-insiders-perspectives Harry Macks Youtube Channel https://www.youtube.com/channel/UC59ZRYCHev_IqjUhremZ8Tg LEARN HOW TO HELP RAISE AI RISK AWARENESS IN YOUR COMMUNITY HERE https://pauseai.info/local-organizing Please Donate Here To Help Promote For Humanity https://www.paypal.com/paypalme/forhumanitypodcast EMAIL JOHN: forhumanitypodcast@gmail.com This podcast is not journalism. But it's not opinion either. This is a long form public service announcement. This show simply strings together the existing facts and underscores the unthinkable probable outcome, the end of all life on earth.  For Humanity: An AI Safety Podcast, is the accessible AI Safety Podcast for all humans, no tech background required. Our show focuses solely on the threat of human extinction from AI. Peabody Award-winning former journalist John Sherman explores the shocking worst-case scenario of artificial intelligence: human extinction. The makers of AI openly admit it their work could kill all humans, in as soon as 2 years. This podcast is solely about the threat of human extinction from AGI. We'll meet the heroes and villains, explore the issues and ideas, and what you can do to help save humanity. RESOURCES: JOIN THE FIGHT, help Pause AI!!!! Pause AI SUBSCRIBE TO LIRON SHAPIRA'S DOOM DEBATES on YOUTUBE!! https://www.youtube.com/@DoomDebates Join the Pause AI Weekly Discord Thursdays at 2pm EST   / discord   https://discord.com/invite/pVMWjddaW7 Max Winga's “A Stark Warning About Extinction” https://youtu.be/kDcPW5WtD58?si=i6IRy82xZ2PUOp22 For Humanity Theme Music by Josef Ebner Youtube: https://www.youtube.com/channel/UCveruX8E-Il5A9VMC-N4vlg Website: https://josef.pictures BUY STEPHEN HANSON'S BEAUTIFUL AI RISK BOOK!!! https://stephenhansonart.bigcartel.com/product/the-entity-i-couldn-t-fathom 22 Word Statement from Center for AI Safety Statement on AI Risk | CAIS https://www.safe.ai/work/statement-on-ai-risk Best Account on Twitter: AI Notkilleveryoneism Memes  https://twitter.com/AISafetyMemes

The Nonlinear Library
AF - Games for AI Control by Charlie Griffin

The Nonlinear Library

Play Episode Listen Later Jul 11, 2024 7:37


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Games for AI Control, published by Charlie Griffin on July 11, 2024 on The AI Alignment Forum. This post summarises our new workshop paper: Games for AI Control. The paper applies AI- and game-theoretic tools to the AI Control framework introduced by Redwood Research [1, 2]. We'll be extending this work and submitting it to a conference, so we welcome suggestions! Please reach out if you're excited about developing these ideas. This post is written from Charlie's perspective. AI Control Refresher The AI Control methodology aims to design and assess protocols[1] that AI labs could use to deploy powerful but untrusted advanced AI systems. For a full summary, see Redwood's blog post, but I'll try to recap here. A primary goal of AI Control is to design protocols that are safe in that they can reduce the probability of some unacceptable outcome, conditional on the untrusted AI system being a schemer. [2] However, protocols must also be useful, harnessing a significant amount of the benefit of the untrusted system in cases where it is not scheming. After defining a particular use case and unacceptable outcome, evaluating a protocol involves iteratively engineering the best attack strategies for the red team (a human/AI hybrid acting as a proxy for the untrusted AI) and the blue team (which develops the protocol). This red-teaming exercise is the bulk of the work. It involves coming up with protocols and gathering empirical data about key questions, such as how well GPT-3.5 can spot backdoors inserted by GPT-4 (see Figure 4 of the original paper): This work By introducing "AI-Control Games", an extension of Markov Decision Processes and normal-form games[3], we model the interaction between a protocol and an untrusted AI system to explore safety and usefulness across a range of protocols and games. This is similar to a discrete-time and stateful modification of Redwood's Toy models of AI Control. The paper contains early work and should be seen as introducing useful tools for thinking about AI Control. I won't explain the model here (which is a bit fiddly and notation-heavy). Instead, I'll share some of our early findings using the model and explain the benefits and limitations of this approach. Findings Sometimes (but rarely), the blue team should use a randomised protocol. In many games, randomness is helpful. However, partly because the environment is stochastic, and Red has no memory and time-awareness, in AI-Control Games is usually unnecessary to include randomness in the protocol itself. We found that randomness was Pareto-optimal, in particular when the trusted model is not great at spotting backdoors written by the untrusted model. (See section 2 of the paper). We can make small improvements to Trusted Monitoring by optimising the parameters. We found small Pareto improvements to existing Trusted Monitoring setups by, for example, by decreasing the threshold for sending a solution to be checked be a human auditor. We found that adapting the threshold for auditing over time led to further small improvements. We can (usually) use a linear tradeoff to simplify the problem of synthesising protocols. In many cases, including whenever we allow Blue's protocol to be randomised, the Pareto frontier is "concave". This means that choosing different weights to trade off safety and usefulness will allow you to fill out the frontier by optimising a single objective. Blue leads, and then Red follows. In general, we suppose the protocol designer (Blue) chooses it's protocol, and then the red team (Red) chooses a best-attack. Therefore, in our model, Blue chooses and declares its strategy before Red chooses its response. This assumption relates to Stackelberg Games, which are used to model security in game theory.[6] Although (technically) multi-objective and general-sum, AI-Cont...

Machine Learning Street Talk
Ryan Greenblatt - Solving ARC with GPT4o

Machine Learning Street Talk

Play Episode Listen Later Jul 6, 2024 138:01


Ryan Greenblatt from Redwood Research recently published "Getting 50% on ARC-AGI with GPT-4.0," where he used GPT4o to reach a state-of-the-art accuracy on Francois Chollet's ARC Challenge by generating many Python programs. Sponsor: Sign up to Kalshi here https://kalshi.onelink.me/1r91/mlst -- the first 500 traders who deposit $100 will get a free $20 credit! Important disclaimer - In case it's not obvious - this is basically gambling and a *high risk* activity - only trade what you can afford to lose. We discuss: - Ryan's unique approach to solving the ARC Challenge and achieving impressive results. - The strengths and weaknesses of current AI models. - How AI and humans differ in learning and reasoning. - Combining various techniques to create smarter AI systems. - The potential risks and future advancements in AI, including the idea of agentic AI. https://x.com/RyanPGreenblatt https://www.redwoodresearch.org/ Refs: Getting 50% (SoTA) on ARC-AGI with GPT-4o [Ryan Greenblatt] https://redwoodresearch.substack.com/p/getting-50-sota-on-arc-agi-with-gpt On the Measure of Intelligence [Chollet] https://arxiv.org/abs/1911.01547 Connectionism and Cognitive Architecture: A Critical Analysis [Jerry A. Fodor and Zenon W. Pylyshyn] https://ruccs.rutgers.edu/images/personal-zenon-pylyshyn/proseminars/Proseminar13/ConnectionistArchitecture.pdf Software 2.0 [Andrej Karpathy] https://karpathy.medium.com/software-2-0-a64152b37c35 Why Greatness Cannot Be Planned: The Myth of the Objective [Kenneth Stanley] https://amzn.to/3Wfy2E0 Biographical account of Terence Tao's mathematical development. [M.A.(KEN) CLEMENTS] https://gwern.net/doc/iq/high/smpy/1984-clements.pdf Model Evaluation and Threat Research (METR) https://metr.org/ Why Tool AIs Want to Be Agent AIs https://gwern.net/tool-ai Simulators - Janus https://www.lesswrong.com/posts/vJFdjigzmcXMhNTsx/simulators AI Control: Improving Safety Despite Intentional Subversion https://www.lesswrong.com/posts/d9FJHawgkiMSPjagR/ai-control-improving-safety-despite-intentional-subversion https://arxiv.org/abs/2312.06942 What a Compute-Centric Framework Says About Takeoff Speeds https://www.openphilanthropy.org/research/what-a-compute-centric-framework-says-about-takeoff-speeds/ Global GDP over the long run https://ourworldindata.org/grapher/global-gdp-over-the-long-run?yScale=log Safety Cases: How to Justify the Safety of Advanced AI Systems https://arxiv.org/abs/2403.10462 The Danger of a “Safety Case" http://sunnyday.mit.edu/The-Danger-of-a-Safety-Case.pdf The Future Of Work Looks Like A UPS Truck (~02:15:50) https://www.npr.org/sections/money/2014/05/02/308640135/episode-536-the-future-of-work-looks-like-a-ups-truck SWE-bench https://www.swebench.com/ Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model https://arxiv.org/pdf/2201.11990 Algorithmic Progress in Language Models https://epochai.org/blog/algorithmic-progress-in-language-models

AXRP - the AI X-risk Research Podcast
27 - AI Control with Buck Shlegeris and Ryan Greenblatt

AXRP - the AI X-risk Research Podcast

Play Episode Listen Later Apr 11, 2024 176:05


A lot of work to prevent AI existential risk takes the form of ensuring that AIs don't want to cause harm or take over the world---or in other words, ensuring that they're aligned. In this episode, I talk with Buck Shlegeris and Ryan Greenblatt about a different approach, called "AI control": ensuring that AI systems couldn't take over the world, even if they were trying to. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast   Topics we discuss, and timestamps: 0:00:31 - What is AI control? 0:16:16 - Protocols for AI control 0:22:43 - Which AIs are controllable? 0:29:56 - Preventing dangerous coded AI communication 0:40:42 - Unpredictably uncontrollable AI 0:58:01 - What control looks like 1:08:45 - Is AI control evil? 1:24:42 - Can red teams match misaligned AI? 1:36:51 - How expensive is AI monitoring? 1:52:32 - AI control experiments 2:03:50 - GPT-4's aptitude at inserting backdoors 2:14:50 - How AI control relates to the AI safety field 2:39:25 - How AI control relates to previous Redwood Research work 2:49:16 - How people can work on AI control 2:54:07 - Following Buck and Ryan's research   The transcript:  axrp.net/episode/2024/04/11/episode-27-ai-control-buck-shlegeris-ryan-greenblatt.html Links for Buck and Ryan:  - Buck's twitter/X account: twitter.com/bshlgrs  - Ryan on LessWrong: lesswrong.com/users/ryan_greenblatt  - You can contact both Buck and Ryan by electronic mail at [firstname] [at-sign] rdwrs.com   Main research works we talk about:  - The case for ensuring that powerful AIs are controlled:  lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled  - AI Control: Improving Safety Despite Intentional Subversion: arxiv.org/abs/2312.06942   Other things we mention:  - The prototypical catastrophic AI action is getting root access to its datacenter (aka "Hacking the SSH server"): lesswrong.com/posts/BAzCGCys4BkzGDCWR/the-prototypical-catastrophic-ai-action-is-getting-root  - Preventing language models from hiding their reasoning: arxiv.org/abs/2310.18512  - Improving the Welfare of AIs: A Nearcasted Proposal:  lesswrong.com/posts/F6HSHzKezkh6aoTr2/improving-the-welfare-of-ais-a-nearcasted-proposal  - Measuring coding challenge competence with APPS: arxiv.org/abs/2105.09938  - Causal Scrubbing: a method for rigorously testing interpretability hypotheses lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing   Episode art by Hamish Doodles: hamishdoodles.com

The Nonlinear Library
LW - 2022 (and All Time) Posts by Pingback Count by Raemon

The Nonlinear Library

Play Episode Listen Later Dec 17, 2023 15:48


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 2022 (and All Time) Posts by Pingback Count, published by Raemon on December 17, 2023 on LessWrong. For the past couple years I've wished LessWrong had a "sort posts by number of pingbacks, or, ideally, by total karma of pingbacks". I particularly wished for this during the Annual Review, where "which posts got cited the most?" seemed like a useful thing to track for potential hidden gems. We still haven't built a full-fledged feature for this, but I just ran a query against the database, and made it into a spreadsheet, which you can view here: LessWrong 2022 Posts by Pingbacks Here are the top 100 posts, sorted by Total Pingback Karma Title/Link Post Karma Pingback Count Total Pingback Karma Avg Pingback Karma AGI Ruin: A List of Lethalities 870 158 12,484 79 MIRI announces new "Death With Dignity" strategy 334 73 8,134 111 A central AI alignment problem: capabilities generalization, and the sharp left turn 273 96 7,704 80 Simulators 612 127 7,699 61 Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover 367 83 5,123 62 Reward is not the optimization target 341 62 4,493 72 A Mechanistic Interpretability Analysis of Grokking 367 48 3,450 72 How To Go From Interpretability To Alignment: Just Retarget The Search 167 45 3,374 75 On how various plans miss the hard bits of the alignment challenge 292 40 3,288 82 [Intro to brain-like-AGI safety] 3. Two subsystems: Learning & Steering 79 36 3,023 84 How likely is deceptive alignment? 101 47 2,907 62 The shard theory of human values 238 42 2,843 68 Mysteries of mode collapse 279 32 2,842 89 [Intro to brain-like-AGI safety] 2. "Learning from scratch" in the brain 57 30 2,731 91 Why Agent Foundations? An Overly Abstract Explanation 285 42 2,730 65 A Longlist of Theories of Impact for Interpretability 124 26 2,589 100 How might we align transformative AI if it's developed very soon? 136 32 2,351 73 A transparency and interpretability tech tree 148 31 2,343 76 Discovering Language Model Behaviors with Model-Written Evaluations 100 19 2,336 123 A note about differential technological development 185 20 2,270 114 Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] 195 35 2,267 65 Supervise Process, not Outcomes 132 25 2,262 90 Shard Theory: An Overview 157 28 2,019 72 Epistemological Vigilance for Alignment 61 21 2,008 96 A shot at the diamond-alignment problem 92 23 1,848 80 Where I agree and disagree with Eliezer 862 27 1,836 68 Brain Efficiency: Much More than You Wanted to Know 201 27 1,807 67 Refine: An Incubator for Conceptual Alignment Research Bets 143 21 1,793 85 Externalized reasoning oversight: a research direction for language model alignment 117 28 1,788 64 Humans provide an untapped wealth of evidence about alignment 186 19 1,647 87 Six Dimensions of Operational Adequacy in AGI Projects 298 20 1,607 80 How "Discovering Latent Knowledge in Language Models Without Supervision" Fits Into a Broader Alignment Scheme 240 16 1,575 98 Godzilla Strategies 137 17 1,573 93 (My understanding of) What Everyone in Technical Alignment is Doing and Why 411 23 1,530 67 Two-year update on my personal AI timelines 287 18 1,530 85 [Intro to brain-like-AGI safety] 15. Conclusion: Open problems, how to help, AMA 90 16 1,482 93 [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL 66 25 1,460 58 Human values & biases are inaccessible to the genome 90 14 1,450 104 You Are Not Measuring What You Think You Are Measuring 350 21 1,449 69 Open Problems in AI X-Risk [PAIS #5] 59 14 1,446 103 [Intro to brain-like-AGI safety] 1. What's the problem & Why work on it now? 146 25 1,407 56 Conditioning Generative Models 24 11 1,362 124 Conjecture: Internal Infohazard Policy 132 14 1,340 96 A challenge for AGI organizations, and a ch...

The Nonlinear Library
AF - Adversarial Robustness Could Help Prevent Catastrophic Misuse by Aidan O'Gara

The Nonlinear Library

Play Episode Listen Later Dec 11, 2023 17:11


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Adversarial Robustness Could Help Prevent Catastrophic Misuse, published by Aidan O'Gara on December 11, 2023 on The AI Alignment Forum. There have been several discussions about the importance of adversarial robustness for scalable oversight. I'd like to point out that adversarial robustness is also important under a different threat model: catastrophic misuse. For a brief summary of the argument: Misuse could lead to catastrophe. AI-assisted cyberattacks, political persuasion, and biological weapons acquisition are plausible paths to catastrophe. Today's models do not robustly refuse to cause harm. If a model has the ability to cause harm, we should train it to refuse to do so. Unfortunately, GPT-4, Claude, Bard, and Llama all have received this training, but they still behave harmfully when facing prompts generated by adversarial attacks, such as this one and this one. Adversarial robustness will likely not be easily solved. Over the last decade, thousands of papers have been published on adversarial robustness. Most defenses are near useless, and the best defenses against a constrained attack in a CIFAR-10 setting still fail on 30% of inputs. Redwood Research's work on training a reliable text classifier found the task quite difficult. We should not expect an easy solution. Progress on adversarial robustness is possible. Some methods have improved robustness, such as adversarial training and data augmentation. But existing research often assumes overly narrow threat models, ignoring both creative attacks and creative defenses. Refocusing research with good evaluations focusing on LLMs and other frontier models could lead to valuable progress. This argument requires a few caveats. First, it assumes a particular threat model: that closed source models will have more dangerous capabilities than open source models, and that malicious actors will be able to query closed source models. This seems like a reasonable assumption over the next few years. Second, there are many other ways to reduce risks from catastrophic misuse, such as removing hazardous knowledge from model weights, strengthening societal defenses against catastrophe, and holding companies legally liable for sub-extinction level harms. I think we should work on these in addition to adversarial robustness, as part of a defense-in-depth approach to misuse risk. Overall, I think adversarial robustness should receive more effort from researchers and labs, more funding from donors, and should be a part of the technical AI safety research portfolio. This could substantially mitigate the near-term risk of catastrophic misuse, in addition to any potential benefits for scalable oversight. The rest of this post discusses each of the above points in more detail. Misuse could lead to catastrophe There are many ways that malicious use of AI could lead to catastrophe. AI could enable cyberattacks, personalized propaganda and mass manipulation, or the acquisition of weapons of mass destruction. Personally, I think the most compelling case is that AI will enable biological terrorism. Ideally, ChatGPT would refuse to aid in dangerous activities such as constructing a bioweapon. But by using an adversarial jailbreak prompt, undergraduates in a class taught by Kevin Esvelt at MIT evaded this safeguard: In one hour, the chatbots suggested four potential pandemic pathogens, explained how they can be generated from synthetic DNA using reverse genetics, supplied the names of DNA synthesis companies unlikely to screen orders, identified detailed protocols and how to troubleshoot them, and recommended that anyone lacking the skills to perform reverse genetics engage a core facility or contract research organization. Fortunately, today's models lack key information about building bioweapons. It's not even clear that they're more u...

Artificial General Intelligence (AGI) Show with Soroush Pour
Ep 10 - Accelerated training to become an AI safety researcher w/ Ryan Kidd (Co-Director, MATS)

Artificial General Intelligence (AGI) Show with Soroush Pour

Play Episode Listen Later Nov 8, 2023 76:58


We speak with Ryan Kidd, Co-Director at ML Alignment & Theory Scholars (MATS) program, previously "SERI MATS".MATS (https://www.matsprogram.org/) provides research mentorship, technical seminars, and connections to help new AI researchers get established and start producing impactful research towards AI safety & alignment.Prior to MATS, Ryan completed a PhD in Physics at the University of Queensland (UQ) in Australia.We talk about:* What the MATS program is* Who should apply to MATS (next *deadline*: Nov 17 midnight PT)* Research directions being explored by MATS mentors, now and in the past* Promising alignment research directions & ecosystem gaps , in Ryan's viewHosted by Soroush Pour. Follow me for more AGI content:* Twitter: https://twitter.com/soroushjp* LinkedIn: https://www.linkedin.com/in/soroushjp/== Show links ==-- About Ryan --* Twitter: https://twitter.com/ryan_kidd44* LinkedIn: https://www.linkedin.com/in/ryan-kidd-1b0574a3/* MATS: https://www.matsprogram.org/* LISA: https://www.safeai.org.uk/* Manifold: https://manifold.markets/-- Further resources --* Book: “The Precipice” - https://theprecipice.com/* Ikigai - https://en.wikipedia.org/wiki/Ikigai* Fermi paradox - https://en.wikipedia.org/wiki/Fermi_p...* Ajeya Contra - Bioanchors - https://www.cold-takes.com/forecastin...* Chomsky hierarchy & LLM transformers paper + external memory - https://en.wikipedia.org/wiki/Chomsky...* AutoGPT - https://en.wikipedia.org/wiki/Auto-GPT* BabyAGI - https://github.com/yoheinakajima/babyagi* Unilateralist's curse - https://forum.effectivealtruism.org/t...* Jeffrey Ladish & team - fine tuning to remove LLM safeguards - https://www.alignmentforum.org/posts/...* Epoch AI trends - https://epochai.org/trends* The demon "Moloch" - https://slatestarcodex.com/2014/07/30...* AI safety fundamentals course - https://aisafetyfundamentals.com/* Anthropic sycophancy paper - https://www.anthropic.com/index/towar...* Promising technical alignment research directions    * Scalable oversight        * Recursive reward modelling - https://deepmindsafetyresearch.medium...        * RLHF - could work for a while, but unlikely forever as we scale    * Interpretability        * Mechanistic interpretability            * Paper: GPT4 labelling GPT2 - https://openai.com/research/language-...        * Concept based interpretability            * Rome paper - https://rome.baulab.info/        * Developmental interpretability            * devinterp.com - http://devinterp.com            * Timaeus - https://timaeus.co/        * Internal consistency            * Colin Burns research - https://arxiv.org/abs/2212.03827* Threat modelling / capabilities evaluation & demos    * Paper: Can large language models democratize access to dual-use biotechnology? - https://arxiv.org/abs/2306.03809    * ARC Evals - https://evals.alignment.org/    * Palisade Research - https://palisaderesearch.org/    * Paper: Situational awareness with Owain Evans - https://arxiv.org/abs/2309.00667* Gradient hacking - https://www.lesswrong.com/posts/uXH4r6MmKPedk8rMA/gradient-hacking* Past scholar's work    * Apollo Research - https://www.apolloresearch.ai/    * Leap Labs - https://www.leap-labs.com/    * Timaeus - https://timaeus.co/* Other orgs mentioned    * Redwood Research - https://redwoodresearch.org/Recorded Oct 25, 2023

The Nonlinear Library
AF - Benchmarks for Detecting Measurement Tampering [Redwood Research] by Ryan Greenblatt

The Nonlinear Library

Play Episode Listen Later Sep 5, 2023 36:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Benchmarks for Detecting Measurement Tampering [Redwood Research], published by Ryan Greenblatt on September 5, 2023 on The AI Alignment Forum. TL;DR: This post discusses our recent empirical work on detecting measurement tampering and explains how we see this work fitting into the overall space of alignment research. When training powerful AI systems to perform complex tasks, it may be challenging to provide training signals that are robust under optimization. One concern is measurement tampering, which is where the AI system manipulates multiple measurements to create the illusion of good results instead of achieving the desired outcome. (This is a type of reward hacking.) Over the past few months, we've worked on detecting measurement tampering by building analogous datasets and evaluating simple techniques. We detail our datasets and experimental results in this paper. Detecting measurement tampering can be thought of as a specific case of Eliciting Latent Knowledge (ELK): When AIs successfully tamper with measurements that are used for computing rewards, they possess important information that the overseer doesn't have (namely, that the measurements have been tampered with). Conversely, if we can robustly elicit an AI's knowledge of whether the measurements have been tampered with, then we could train the AI to avoid measurement tampering. In fact, our best guess is that this is the most important and tractable class of ELK problems. We also think that measurement tampering detection is a natural application for alignment work such as creating better inductive biases, studying high-level model internals, or studying generalization. We'll discuss what these applications might look like in the Future work section. In this post: We explain what measurement tampering detection is; We summarize the results of our paper; We argue that there are structural properties of measurement tampering that might make it considerably easier to detect than arbitrary cases of eliciting knowledge from models; We argue that detecting measurement tampering is very useful for avoiding catastrophic outcomes from powerful and misaligned AI systems; We explain why we think the dataset structure we chose reflects real-world difficulties of measurement tampering, and how techniques developed for our datasets could be used in practice; We discuss future work. If you're interested in pursuing follow-up work and wish to discuss it with us, feel free to email fabien@rdwrs.com or leave a comment on this post. We would like to acknowledge the contributions of Jenny Nitishinskaya, Nicholas Goldowsky-Dill, and Daniel Ziegler in the creation of early measurement tampering datasets. We would also like to thank Lawrence Chan, Paul Christiano, Ajeya Cotra, Carson Denison, Samuel Marks, Ethan Perez, Ansh Radhakrishnan, Kshitij Sachan, Jacob Steinhardt, Meg Tong, Gabriel Wu, and Mark Xu for their useful feedback. Studying measurement tampering in practice Measurement tampering detection is the task of detecting cases where measurements and the desired outcomes come apart, given a restricted trusted set (on which overseers can avoid tampering because they understand the action sequences and their effects well) and a wider untrusted set (where tampering sometimes happens). In the untrusted set, overseers don't know when measurement tampering happens, which means they have access to text inputs and measurements but not the ground truth. We are centrally interested in cases where we have access to a language model which can predict (or can be slightly fine-tuned to predict) measurements, and which computes the target outcome internally in order to predict measurements. An example where this setup might occur in practice is when training an AI by first training it to imitate human demonstrations and then afte...

The Nonlinear Library
LW - LLMs are (mostly) not helped by filler tokens by Kshitij Sachan

The Nonlinear Library

Play Episode Listen Later Aug 10, 2023 11:18


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: LLMs are (mostly) not helped by filler tokens, published by Kshitij Sachan on August 10, 2023 on LessWrong. Thanks to Ryan Greenblatt, Fabien Roger, and Jenny Nitishinskaya for running some of the initial experiments and to Gabe Wu and Max Nadeau for revising this post. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. I conducted experiments to see if language models could use 'filler tokens' - unrelated text output before the final answer - for additional computation. For instance, I tested whether asking a model to produce "a b c d e" before solving a math problem improves performance. This mini-study was inspired by Tamera Lanham, who ran similar experiments on Claude and found negative results. Similarly, I find that GPT-3, GPT-3.5, and Claude 2 don't benefit from filler tokens. However, GPT-4 (which Tamera didn't study) shows mixed results with strong improvements on some tasks and no improvement on others. These results are not very polished, but I've been sitting on them for a while and I thought it was better to publish than not. Motivation Language models (LMs) perform better when they produce step-by-step reasoning, known as "chain of thought", before answering. A nice property of chain of thought is that it externalizes the LM's thought process, making it legible to us humans. Some researchers hope that if a LM makes decisions based on undesirable factors, e.g. the political leanings of its supervisors (Perez 2022) or whether or not its outputs will be closely evaluated by a human, then this will be visible in its chain of thought, allowing supervisors to catch or penalize such behavior. If, however, we find that filler tokens alone improve performance, this would suggest that LMs are deriving performance benefits from chain-of-thought prompting via mechanisms other than the human-understandable reasoning steps displayed in the words they output. In particular, filler tokens may still provide performance benefits by giving the model more forward passes to operate on the details provided in the question, akin to having "more time to think" before outputting an answer. Why Might Filler Tokens Be Useful? This is an abstracted drawing of a unidirectional (i.e. decoder-only) transformer. Each circle represents the residual stream at a given token position after a given layer. The arrows depict how attention passes information between token positions. Note that as we add more tokens (i.e. columns), the circuit size increases but the circuit depth (defined to be the maximum length of a path in the computational graph) remains capped at the number of layers. Therefore, adding pre-determined filler tokens provides parallel computation but not serial computation. This is contrasted with normal chain-of-thought, which provides extra serial computation because the tokens themselves are determined by the state of the model in the previous last layer. An example of a task that can benefit from parallel computation is computing the greatest common divisor of two numbers. The typical approach is Euclid's algorithm, which is bottlenecked on serial compute. However, given enough parallel computation, the model could also implement a more naive strategy to calculate the GCD of two numbers a and b: An early layer delegates a potential divisor to each filler token position An intermediate layer, in parallel at each filler token position, computes if a and b are divisible by the potential divisor A later layer aggregates the results and returns the largest found divisor More generally, there are likely tasks on which language models have rough heuristics that operate independently and parallelizing these heuristics could improve performance. On the other hand, many tasks are bottlenecked by seri...

The Nonlinear Library
LW - Password-locked models: a stress case for capabilities evaluation by Fabien Roger

The Nonlinear Library

Play Episode Listen Later Aug 3, 2023 12:59


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on LessWrong. Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes: Testing how well capability evaluations work when applied to models which "aren't trying" Understanding how well organizations could hide the capabilities of their models. In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models. Key empirical claims: when using behavioral model evaluations, password-locked models: Look much less capable than they actually are when eliciting capabilities with prompting Sometimes look much less capable than they actually are when eliciting capabilities with Reinforcement Learning Experimental evidence: I exhibit one case where they don't, and one case where they do Can't look much less capable than they actually when eliciting capabilities with fine-tuning Key takeaways, if the claims above are true: If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances. API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors. Key Empirical Claims Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post): Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike. Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models. Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" gradually improving the quality of answers. But it is possible to emulate a thre...

The Nonlinear Library
AF - Password-locked models: a stress case for capabilities evaluation by Fabien Roger

The Nonlinear Library

Play Episode Listen Later Aug 3, 2023 13:03


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Password-locked models: a stress case for capabilities evaluation, published by Fabien Roger on August 3, 2023 on The AI Alignment Forum. Thanks to Buck Shlegeris, Ryan Greenblatt, Kshitij Sachan and Jenny Nitishinskaya for helpful discussions and feedback on drafts. This work was done at Redwood Research. The views expressed are my own and do not necessarily reflect the views of the organization. Password-locked models are trained to exhibit certain capabilities only when a password is present in the query. Studying these models has two purposes: Testing how well capability evaluations work when applied to models which "aren't trying" Understanding how well organizations could hide the capabilities of their models. In this post, I make empirical claims which I believe to be true, check some of them on small language models, and describe what this implies for the control of potentially deceptive models. Key empirical claims: when using behavioral model evaluations, password-locked models: Look much less capable than they actually are when eliciting capabilities with prompting Sometimes look much less capable than they actually are when eliciting capabilities with Reinforcement Learning Experimental evidence: I exhibit one case where they don't, and one case where they do Can't look much less capable than they actually when eliciting capabilities with supervised fine-tuning on high-quality demonstrations Key takeaways, if the claims above are true: If an alignment plan relies on some models being too weak to perform certain tasks, then model evaluations aimed at checking the weakness of models should be able to elicit the locked capabilities of password-locked models. This is because password-locked models and deceptive models may be able to underperform in similar circumstances. API access and inference is not enough to evaluate capabilities of a model produced by AI labs trying to game regulations, even if AI labs are internally using the same API endpoint as auditors. Key Empirical Claims Similarities Between Password-locked Models and Deceptively Aligned Models Pretending to Be Dumb It looks to me that password locked models have the same properties as the one described by Evan in When can we trust model evaluations? when he describes if and how deceptive models would be able to sandbag each category of behavioral evaluation (which he coins in his post): Behavioral Non-Fine-Tuning Evaluations: By construction, password-locked models will refuse to be capable no matter the prompt, except for prompts with the special password. But it's unclear how easy it is to build good password-locked models which are hard to crack (e.g. which don't leak the password with some clever prompting). Even if good password-locked models are built, there remain a large difference with deceptive models: the search for the correct password might have very different properties than the search for a prompt which makes a deceptive model believe it's time to strike. Behavioral I.I.D. Fine-Tuning Evaluations: I would be surprised if fine-tuning the password-locked model on high quality answers didn't lift the locking mechanism extremely quickly. This is similar to deceptive models (except if they are so strong that they can gradient hack). The main difference is that fine-tuning to remove password-locking on one task might transfer to another task, which might not be the case for deceptive models. Behavioral RL Fine-Tuning Evaluations: The success of evaluations which use Reinforcement Learning depends, as for deceptive models, on the success of exploration. If supervised fine-tuning is done with only low quality answers, then the password locked models will just continue to output low-quality answers and will never explore its way into high quality ones - except if there is a natural "path" g...

The Nonlinear Library
LW - ARC is hiring theoretical researchers by paulfchristiano

The Nonlinear Library

Play Episode Listen Later Jun 12, 2023 6:02


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ARC is hiring theoretical researchers, published by paulfchristiano on June 12, 2023 on LessWrong. The Alignment Research Center's Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. What is ARC's Theory team? The Alignment Research Center (ARC) is a non-profit whose mission is to align future machine learning systems with human interests. The high-level agenda of the Theory team (not to be confused with the Evals team) is described by the report on Eliciting Latent Knowledge (ELK): roughly speaking, we're trying to design ML training objectives that incentivize systems to honestly report their internal beliefs. For the last year or so, we've mostly been focused on an approach to ELK based on formalizing a kind of heuristic reasoning that could be used to analyze neural network behavior, as laid out in our paper on Formalizing the presumption of independence. Our research has reached a stage where we're coming up against concrete problems in mathematics and theoretical computer science, and so we're particularly excited about hiring researchers with relevant background, regardless of whether they have worked on AI alignment before. See below for further discussion of ARC's current theoretical research directions. Who is ARC looking to hire? Compared to our last hiring round, we have more of a need for people with a strong theoretical background (in math, physics or computer science, for example), but we remain open to anyone who is excited about getting involved in AI alignment, even if they do not have an existing research record. Ultimately, we are excited to hire people who could contribute to our research agenda. The best way to figure out whether you might be able to contribute is to take a look at some of our recent research problems and directions: Some of our research problems are purely mathematical, such as these matrix completion problems – although note that these are unusually difficult, self-contained and well-posed (making them more appropriate for prizes). Some of our other research is more informal, as described in some of our recent blog posts such as Finding gliders in the game of life. A lot of our research occupies a middle ground between fully-formalized problems and more informal questions, such as fixing the problems with cumulant propagation described in Appendix D of Formalizing the presumption of independence. What is working on ARC's Theory team like? ARC's Theory team is led by Paul Christiano and currently has 2 other permanent team members, Mark Xu and Jacob Hilton, alongside a varying number of temporary team members (recently anywhere from 0–3). Most of the time, team members work on research problems independently, with frequent check-ins with their research advisor (e.g., twice weekly). The problems described above give a rough indication of the kind of research problems involved, which we would typically break down into smaller, more manageable subproblems. This work is often somewhat similar to academic research in pure math or theoretical computer science. In addition to this, we also allocate a significant portion of our time to higher-level questions surrounding research prioritization, which we often discuss at our weekly group meeting. Since the team is still small, we are keen for new team members to help with this process of shaping and defining our research. ARC shares an office with several other groups working on AI alignment such as Redwood Research, so even though the Theory team is small, the office is lively with lots of AI alignment-related discussion. What are ARC's current theoretical research directions? ARC's main theoretical focus over the last year or so has been on preparing the paper Formalizing the presumption of independence and on follow-up work to ...

The Nonlinear Library
AF - ARC is hiring theoretical researchers by Paul Christiano

The Nonlinear Library

Play Episode Listen Later Jun 12, 2023 6:13


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ARC is hiring theoretical researchers, published by Paul Christiano on June 12, 2023 on The AI Alignment Forum. The Alignment Research Center's Theory team is starting a new hiring round for researchers with a theoretical background. Please apply here. What is ARC's Theory team? The Alignment Research Center (ARC) is a non-profit whose mission is to align future machine learning systems with human interests. The high-level agenda of the Theory team (not to be confused with the Evals team) is described by the report on Eliciting Latent Knowledge (ELK): roughly speaking, we're trying to design ML training objectives that incentivize systems to honestly report their internal beliefs. For the last year or so, we've mostly been focused on an approach to ELK based on formalizing a kind of heuristic reasoning that could be used to analyze neural network behavior, as laid out in our paper on Formalizing the presumption of independence. Our research has reached a stage where we're coming up against concrete problems in mathematics and theoretical computer science, and so we're particularly excited about hiring researchers with relevant background, regardless of whether they have worked on AI alignment before. See below for further discussion of ARC's current theoretical research directions. Who is ARC looking to hire? Compared to our last hiring round, we have more of a need for people with a strong theoretical background (in math, physics or computer science, for example), but we remain open to anyone who is excited about getting involved in AI alignment, even if they do not have an existing research record. Ultimately, we are excited to hire people who could contribute to our research agenda. The best way to figure out whether you might be able to contribute is to take a look at some of our recent research problems and directions: Some of our research problems are purely mathematical, such as these matrix completion problems – although note that these are unusually difficult, self-contained and well-posed (making them more appropriate for prizes). Some of our other research is more informal, as described in some of our recent blog posts such as Finding gliders in the game of life. A lot of our research occupies a middle ground between fully-formalized problems and more informal questions, such as fixing the problems with cumulant propagation described in Appendix D of Formalizing the presumption of independence. What is working on ARC's Theory team like? ARC's Theory team is led by Paul Christiano and currently has 2 other permanent team members, Mark Xu and Jacob Hilton, alongside a varying number of temporary team members (recently anywhere from 0–3). Most of the time, team members work on research problems independently, with frequent check-ins with their research advisor (e.g., twice weekly). The problems described above give a rough indication of the kind of research problems involved, which we would typically break down into smaller, more manageable subproblems. This work is often somewhat similar to academic research in pure math or theoretical computer science. In addition to this, we also allocate a significant portion of our time to higher-level questions surrounding research prioritization, which we often discuss at our weekly group meeting. Since the team is still small, we are keen for new team members to help with this process of shaping and defining our research. ARC shares an office with several other groups working on AI alignment such as Redwood Research, so even though the Theory team is small, the office is lively with lots of AI alignment-related discussion. What are ARC's current theoretical research directions? ARC's main theoretical focus over the last year or so has been on preparing the paper Formalizing the presumption of independence and on follo...

The Nonlinear Library
EA - Critiques of prominent AI safety labs: Redwood Research by Omega

The Nonlinear Library

Play Episode Listen Later Mar 31, 2023 34:15


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Critiques of prominent AI safety labs: Redwood Research, published by Omega on March 31, 2023 on The Effective Altruism Forum. In this series, we evaluate AI safety organizations that have received more than $10 million per year in funding. We do not critique MIRI and OpenAI as there have been several conversations and critiques of these organizations (1,2,3). The authors of this post include two technical AI safety researchers, and others who have spent significant time in the Bay Area community. One technical AI safety researcher is senior (>4 years experience), the other junior. We would like to make our critiques non-anonymously but unfortunately believe this would be professionally unwise. Further, we believe our criticisms stand on their own. Though we have done our best to remain impartial, readers should not assume that we are completely unbiased or don't have anything to personally or professionally gain from publishing these critiques. We take the benefits and drawbacks of the anonymous nature of our post seriously, and are open to feedback on anything we might have done better. The first post in this series will cover Redwood Research (Redwood). Redwood is a non-profit started in 2021 working on technical AI safety (TAIS) alignment research. Their approach is heavily informed by the work of Paul Christiano, who runs the Alignment Research Center (ARC), and previously ran the language model alignment team at OpenAI. Paul originally proposed one of Redwood's original projects and is on Redwood's board. Redwood has strong connections with central EA leadership and funders, has received significant funding since its inception, recruits almost exclusively from the EA movement, and partly acts as a gatekeeper to central EA institutions. We shared a draft of this document with Redwood prior to publication and are grateful for their feedback and corrections (we recommend others also reach out similarly). We've also invited them to share their views in the comments of this post. We would like to also invite others to share their thoughts in the comments openly if you feel comfortable, or contribute anonymously via this form. We will add inputs from there to the comments section of this post, but will likely not be updating the main body of the post as a result (unless comments catch errors in our writing). Summary of our views We believe that Redwood has some serious flaws as an org, yet has received a significant amount of funding from a central EA grantmaker (Open Philanthropy). Inadequately kept in check conflicts of interest (COIs) might be partly responsible for funders giving a relatively immature org lots of money and causing some negative effects on the field and EA community. We will share our critiques of Constellation (and Open Philanthropy) in a follow-up post. We also have some suggestions for Redwood that we believe might help them achieve their goals. Redwood is a young organization that has room to improve. While there may be flaws in their current approach, it is possible for them to learn and adapt in order to produce more accurate and reliable results in the future. Many successful organizations made significant pivots while at a similar scale to Redwood, and we remain cautiously optimistic about Redwood's future potential. An Overview of Redwood Research Grants: Redwood has received just over $21 million dollars in funding that we are aware of, for their own operations (2/3, or $14 million) and running Constellation (1/3 or $7 million) Redwood received $20 million from Open Philanthropy (OP) (grant 1 & 2) and $1.27 million from the Survival and Flourishing Fund. They also were granted (but never received) $6.6 million from FTX Future Fund. Output: Research: Redwood lists six research projects on their website: causal scrubbing, interpretability ...

The Nonlinear Library
LW - Some common confusion about induction heads by Alexandre Variengien

The Nonlinear Library

Play Episode Listen Later Mar 29, 2023 9:11


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some common confusion about induction heads, published by Alexandre Variengien on March 28, 2023 on LessWrong. Epistemic status: conceptual discussion and opinions informed by doing 6 months of interpretability research at Redwood Research and exchanging with other researchers, but I'm just speaking for myself. Induction heads are defined twice by Anthropic. The first time as a mechanism in 2L attention-only transformers A second time as a behavioral description on repeated random sequences of tokens However, these two definitions rely on distinct sources of evidence and create confusion, as their difference is not always acknowledged when people cite these papers. The mechanistic definition applies to toy language models, while the behavioral definition is a useful yet incomplete characterization of attention heads. I think that many people are in fact confused by this: I have talked to many people who aren't clear on the fact that these two concepts are different, and incorrectly believe that (e.g.) the mechanism of induction heads in larger language models has been characterized. More specifically, the two Anthropic papers introduce the following two distinct definitions of induction heads: Mechanistic: The first definition, introduced by Elhage et al., describes a behavior in a 2 layer attention-only model (copying a token given a matching prefix) and a minimal mechanism to perform this behavior (a set of paths in the computational graph and a human interpretation of the transformation along those paths). Empirically, this mechanism seems to be the best possible short description of what those heads are doing (i.e. if you have to choose a subgraph made of a single path as input for the keys, queries, and values of these heads, the induction circuit is likely to be the one that recovers the most loss). But this explanation does not encompass everything these heads do. In reality, many more paths are used than the one described (see Redwood's causal scrubbing results on induction heads) and the function of the additional paths is unclear. I don't know whether the claims about the behavior and mechanisms of these heads are best described as “mostly true but missing details” or “only a small part of what's going on”. See also Buck's comment for more discussion on the interpretation of causal scrubbing recovered loss. Behavioral: The second definition, introduced by Olsson et al., relies on head activation evaluators (measuring attention patterns and head output) on out-of-distribution sequences made of Repeated Random Tokens (RRT). Two scores are used to characterize induction heads: i) Prefix matching: attention probability to the first occurrence of the token [A] on patterns like [A][B] . [A] ii) Copying: how much the head output increases the logit of [A] compared to the other logits. The RRT distribution was chosen so that fully abstracted induction behavior is one of the few useful heuristics to predict the next token. In this post, I'll use mechanistic or behavioral induction heads to differentiate between the definitions. I'll present three points that — I think — are important to keep in mind when using these definitions. 1 - The two-head mechanism (induction head and previous token head) described in Elhage et al. is the minimal way to implement an induction head As noted in the paper, induction heads can use more complicated mechanisms. For instance, instead of relying on a previous token head to match only one token as a prefix (the token [A] in the example above), they could rely on a head that attends further back to match longer prefixes (e.g. patterns like [X][A][B] . [X][A]). Empirically, evidence for induction heads using some amount of longer prefix matching has been observed in the causal scrubbing experiments on induction. The two-head mechanism i...

The Nonlinear Library
LW - Practical Pitfalls of Causal Scrubbing by Jérémy Scheurer

The Nonlinear Library

Play Episode Listen Later Mar 27, 2023 22:34


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Practical Pitfalls of Causal Scrubbing, published by Jérémy Scheurer on March 27, 2023 on LessWrong. TL;DR: We evaluate Causal Scrubbing (CaSc) on synthetic graphs with known ground truth to determine its reliability in confirming correct hypotheses and rejecting incorrect ones. First, we show that CaSc can accurately identify true hypotheses and quantify the degree to which a hypothesis is wrong. Second, we highlight some limitations of CaSc, in particular, that it cannot falsify all incorrect hypotheses. We provide concrete examples of false positive results with causal scrubbing. Our main finding is that false positives can occur when there is “cancellation”, i.e., CaSc causes the model to do better on some inputs and worse on others, such that, on average, the scrubbed model recovers the full loss. A second practical failure mode is that CaSc cannot detect whether a proposed hypothesis is specific enough, and it cannot distinguish between hypotheses that are extensionally equivalent. We thank Redwood Research for generously supporting this project by providing us with their Causal Scrubbing implementation, access to REMIX materials, and computational resources. We specifically thank Ansh Radhakrishnan, Buck Shlegeris, and Nicholas Goldowsdky-Dill for their feedback and advice. We thank the Long-Term Future Fund for financial support and Marcel Steimke and Lennart Heim for operational support. Finally, we thank Marius Hobbhahn, Adam Jermyn, and Erik Jenner for valuable discussions and feedback. Introduction Causal Scrubbing (CaSc) is a method to evaluate the accuracy of hypotheses about neural networks and provides a measure of the deviation of a hypothesis from the ground truth. However, CaSc does not guarantee to reject false or incomplete hypotheses. We thus believe that systematic evaluation of CaSc to investigate these limitations is valuable (in addition to evaluating its effectiveness in the wild, as done in most existing work). Hence, we evaluate CaSc to highlight its strengths and weaknesses and explore the reliability of CaSc in confirming correct hypotheses and rejecting incorrect ones. We evaluate the reliability of CaSc on synthetic graphs. While synthetic graphs are less realistic than trained neural networks, we get access to the known ground truth interpretation, which allows us to accurately evaluate our hypotheses. Since CaSc operates on general computational graphs, any results on synthetic graphs also apply to using CaSc on neural networks (although we don't make any claim on how likely the situations we find are to occur in trained neural networks). Our evaluation is based on creating a synthetic graph that solves a specific problem (e.g., sorting a list) and creating an identical interpretation graph (the correct hypothesis). We then perturb the correct interpretation graph to make the hypothesis “worse”. Finally, we evaluate whether CaSc correctly determines the better hypothesis. Ideally, we want the scrubbed loss (the loss induced by applying CaSc) to correlate with the “correctness” of a hypothesis. To determine whether a hypothesis is “better” or “worse”, we introduce the concepts of extensional and intensional equivalence between functions. Extensional equivalent functions have the same input-output behavior; for example, Quicksort and Mergesort are extensionally equivalent as they both sort an input sequence. Intensional equivalent functions are implemented in the same way mechanistically. So once we zoom further in and compare Quicksort and Mergesort algorithmically, we see that they are not intensionally equivalent. This point is already made in the CaSc writeup, and our goal is merely to highlight that in the context of mechanistic interpretability, this is an important distinction that's easy to overlook. In this post, we look deep...

The Nonlinear Library
AF - Practical Pitfalls of Causal Scrubbing by Jérémy Scheurer

The Nonlinear Library

Play Episode Listen Later Mar 27, 2023 22:35


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Practical Pitfalls of Causal Scrubbing, published by Jérémy Scheurer on March 27, 2023 on The AI Alignment Forum. TL;DR: We evaluate Causal Scrubbing (CaSc) on synthetic graphs with known ground truth to determine its reliability in confirming correct hypotheses and rejecting incorrect ones. First, we show that CaSc can accurately identify true hypotheses and quantify the degree to which a hypothesis is wrong. Second, we highlight some limitations of CaSc, in particular, that it cannot falsify all incorrect hypotheses. We provide concrete examples of false positive results with causal scrubbing. Our main finding is that false positives can occur when there is “cancellation”, i.e., CaSc causes the model to do better on some inputs and worse on others, such that, on average, the scrubbed model recovers the full loss. A second practical failure mode is that CaSc cannot detect whether a proposed hypothesis is specific enough, and it cannot distinguish between hypotheses that are extensionally equivalent. We thank Redwood Research for generously supporting this project by providing us with their Causal Scrubbing implementation, access to REMIX materials, and computational resources. We specifically thank Ansh Radhakrishnan, Buck Shlegeris, and Nicholas Goldowsdky-Dill for their feedback and advice. We thank the Long-Term Future Fund for financial support and Marcel Steimke and Lennart Heim for operational support. Finally, we thank Marius Hobbhahn, Adam Jermyn, and Erik Jenner for valuable discussions and feedback. Introduction Causal Scrubbing (CaSc) is a method to evaluate the accuracy of hypotheses about neural networks and provides a measure of the deviation of a hypothesis from the ground truth. However, CaSc does not guarantee to reject false or incomplete hypotheses. We thus believe that systematic evaluation of CaSc to investigate these limitations is valuable (in addition to evaluating its effectiveness in the wild, as done in most existing work). Hence, we evaluate CaSc to highlight its strengths and weaknesses and explore the reliability of CaSc in confirming correct hypotheses and rejecting incorrect ones. We evaluate the reliability of CaSc on synthetic graphs. While synthetic graphs are less realistic than trained neural networks, we get access to the known ground truth interpretation, which allows us to accurately evaluate our hypotheses. Since CaSc operates on general computational graphs, any results on synthetic graphs also apply to using CaSc on neural networks (although we don't make any claim on how likely the situations we find are to occur in trained neural networks). Our evaluation is based on creating a synthetic graph that solves a specific problem (e.g., sorting a list) and creating an identical interpretation graph (the correct hypothesis). We then perturb the correct interpretation graph to make the hypothesis “worse”. Finally, we evaluate whether CaSc correctly determines the better hypothesis. Ideally, we want the scrubbed loss (the loss induced by applying CaSc) to correlate with the “correctness” of a hypothesis. To determine whether a hypothesis is “better” or “worse”, we introduce the concepts of extensional and intensional equivalence between functions. Extensional equivalent functions have the same input-output behavior; for example, Quicksort and Mergesort are extensionally equivalent as they both sort an input sequence. Intensional equivalent functions are implemented in the same way mechanistically. So once we zoom further in and compare Quicksort and Mergesort algorithmically, we see that they are not intensionally equivalent. This point is already made in the CaSc writeup, and our goal is merely to highlight that in the context of mechanistic interpretability, this is an important distinction that's easy to overlook. In this post,...

The Nonlinear Library
EA - The illusion of consensus about EA celebrities by Ben Millwood

The Nonlinear Library

Play Episode Listen Later Mar 17, 2023 3:48


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The illusion of consensus about EA celebrities, published by Ben Millwood on March 17, 2023 on The Effective Altruism Forum. Epistemic status: speaking for myself and hoping it generalises I don't like everyone that I'm supposed to like: I've long thought that [redacted] was focused on all the wrong framings of the issues they discuss, [redacted] is on the wrong side of their disagreement with [redacted] and often seems to have kind of sloppy thinking about things like this, [redacted] says many sensible things but a writing style that I find intensely irritating and I struggle to get through; [redacted] is similar, but not as sensible, [redacted] is working on an important problem, but doing a kind of mediocre job of it, which might be crowding out better efforts. Why did I redact all those names? Well, my criticisms are often some mixture of: half-baked; I don't have time to evaluate everyone fairly and deeply, and don't need to in order to make choices about what to focus on, based on justifications that are not very legible or easy to communicate, not always totally central to their point or fatal to their work, kind of upsetting or discouraging to hear, often not that actionable. I want to highlight that criticisms like this will usually not surface, and while in individual instances this is sensible, in aggregate it may contribute to a misleading view of how we view our celebrities and leaders. We end up seeming more deferential and hero-worshipping than we really are. This is bad for two reasons: it harms our credibility in the eyes of outsiders (or insiders, even) who have negative views of those people, it projects the wrong expectation to newcomers who trust us and want to learn or adopt our attitudes. What to do about it? I think "just criticise people more" in isolation is not a good solution. People, even respected people in positions of leadership, often seem to already find posting on the Forum a stressful experience, and I think tipping that balance in the more brutal direction seems likely to cost more than it gains. I think you could imagine major cultural changes around how people give and receive feedback that could make this better, mitigate catastrophising about negative feedback, and ensure people feel safe to risk making mistakes or exposing their oversights. But those seem to me like heavy, ambitious pieces of cultural engineering that require a lot of buy-in to get going, and even if successful may incur ongoing frictional costs. Here's smaller, simpler things that could help: Write a forum post about it (this one's taken, sorry), Make disagreements more visible and more legible, especially among leaders or experts. I really enjoyed the debate between Will MacAskill and Toby Ord in the comments of Are we living at the most influential time in history? ­– you can't come away from that discussion thinking "oh, whatever the smart, respected people in EA think must be right", because either way at least one of them will disagree with you! There's a lot of disagreement on the Forum all the time, of course, but I have a (somewhat unfair) vibe of this as the famous people deposit their work into the forum and leave for higher pursuits, and then we in the peanut gallery argue over it. I'd love it if there were (say) a document out there that Redwood Research and Anthropic both endorsed, that described how their agendas differ and what underlying disagreements lead to those differences. Make sure people incoming to the community, or at the periphery of the community, are inoculated against this bias, if you spot it. Point out that people usually have a mix of good and bad ideas. Have some go-to examples of respected people's blind spots or mistakes, at least as they appear to you. (Even if you never end up explaining them to anyone, it's probably goo...

The Nonlinear Library
AF - EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety by Stephen Casper

The Nonlinear Library

Play Episode Listen Later Feb 17, 2023 20:03


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: EIS VI: Critiques of Mechanistic Interpretability Work in AI Safety, published by Stephen Casper on February 17, 2023 on The AI Alignment Forum. Part 6 of 12 in the Engineer's Interpretability Sequence. Thanks to Chris Olah and Neel Nanda for discussions and comments. In particular, I am thankful to Neel Nanda correcting a mistake I made in understanding the arguments in Olsson et al. (2022) in an earlier draft of this post. TAISIC = “the AI safety interpretability community” MI = “mechanistic interpretability” What kind of work this post focused on TAISIC prioritizes a relatively small set of problems in interpretability relative to the research community at large. This work is not homogenous, but a dominant theme is a focus on mechanistic, circuits-style interpretability with the end goals of model verification and/or detecting deceptive alignment. There is a specific line of work that this post focuses on. Key papers from it include: Feature Visualization (Olah et al., 2017) Zoom In: An Introduction to Circuits (Olah et al., 2020) Curve Detectors (Cammarata et al., 2020) A Mathematical Framework for Transformer Circuits (Elhage et al., 2021) In-context Learning and Induction Heads (Olsson et al., 2022) Toy Models of Superposition (Elhage et al., 2022) Softmax Linear Units (Elhage et al., 2022) Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small (Wang et al., 2022) Progress measures for grokking via mechanistic interpretability (Nanda et al., 2023) .etc. And the points in this post will also apply somewhat to the current research agendas of Anthropic, Redwood Research, ARC, and Conjecture. This includes Causal Scrubbing (Chan et al., 2022) and mechanistic anomaly detection (Christiano, 2022). Most (all?) of the above work is either from Distill or inspired in part by Distill's interpretability work in the late 2010s. To be clear, I believe this research is valuable, and it has been foundational to my own thinking about interpretability. But there seem to be some troubles with this space that might be keeping it from being as productive as it can be. Now may be a good time to make some adjustments to TAISIC's focus on MI. This may be especially important given how much recent interest there has been in interpretability work and how there are large recent efforts focused on getting a large number of junior researchers working on it. Four issues This section discusses four major critiques of the works above. Not all of these critiques apply to all of the above, but for every paper mentioned above, at least one of the critiques below apply to it. Some but not all of these examples of papers exhibiting these problems will be covered. Cherrypicking results As discussed in EIS III and the Toward Transparent AI survey (Räuker et al., 2022), cherrypicking is common in the interpretability literature, but it manifests in some specific ways in MI work. It is very valuable for papers to include illustrative examples to build intuition, but when a paper makes such examples a central focus, cherrypicking can make results look better than they are. The feature visualization (Olah et al., 2017) and zoom in (Olah et al., 2020) papers have examples of this. Have a look at the cover photo for (Olah et al., 2017). From Olah et al., (2017) These images seem easy to describe and form hypotheses from. But instead of these, try going to OpenAI' microscope and looking at some random visualizations. For example, here are some from a deep layer in an Inception-v4. From this link. As someone who often works with feature visualizations, I can confirm that these visualizations from OpenAI microscope are quite typical. But notice how they seem quite a bit less ‘lucid' than the ones in the cover photo from Olah et al., (2017). Of course, many papers present t...

The Nonlinear Library
EA - Celebrating EAGxLatAm and EAGxIndia by OllieBase

The Nonlinear Library

Play Episode Listen Later Jan 26, 2023 6:53


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Celebrating EAGxLatAm and EAGxIndia, published by OllieBase on January 26, 2023 on The Effective Altruism Forum. In the first two weeks of January, we had the first EA conferences to take place in Latin America and India: EAGxLatAm in Mexico City and EAGxIndia in Jaipur, Rajasthan. Lots of EAGx events happen every year, and we don't usually post about each one. But I think these events are particularly special as two of the largest EA conferences to take place in LMICs (Low and Middle Income Countries) and the first ever in each region. So, I wanted to share a bit more about how they went. Tl;DR: they went well. EAGxLatinAmerica EAGxLatAm was organised by community builders from the Spanish-speaking EA community with support from CEA. Over 200 people attended the event: ~32% of attendees came from within Mexico, ~13% from the US, ~12% from Colombia, ~7% from Brazil, ~5% from Chile and ~5% from Spain. There were also attendees from several other countries including Argentina, Peru, the Philippines, Canada, and countries in Europe. The event hosted talks in English, Spanish, and Portuguese and attendees could mark on their name badges which languages they spoke. Sessions included: A talk on the high-impact careers service, Carreras con Impacto; A virtual Q&A with Toby Ord to discuss existential risks and the role of LMICs in tackling them; Racionalidad para principiantes (Rationality for beginners) by Laura González Salmerón; A panel on Riesgos Catastróficos Globales (Global Catastrophic Risks); A panel on AI safety with staff from OpenAI, Redwood Research, CSER and Epoch; A talk on Animal Welfare cause prioritisation by Daniela Romero Waldhorn (Rethink Priorities); A talk by JPAL staff on ‘Applying evidence to reduce intimate partner violence'; A talk on local cause prioritisation by Luis Mota (Global Priorities Institute); A panel on EA in LMICs with EA community builders from Brazil, the Spanish-speaking community, the Philippines, Nigeria, South Africa, and EA Anywhere; A talk by Vida Plena (a new organisation incubated by Charity Entrepreneurship focused on mental health in Latin America); Meetups for attendees from Mexico, Chile/Argentina, Colombia/Perú, Brazil and Spain. The event received a likelihood to recommend score of 9.07/10 and attendees made an average of 9.37 new connections. Some quotes from the feedback survey I enjoyed reading: “I had a 1-1 with someone else interested in nuclear security and nuclear winter research, and we're developing an online reading group curriculum to [ideally] launch in March!" “I got convinced on deciding to focus on [sic] direct work on ai safety by talking to [two other attendees].” “I met Rob Wiblin at lunch and I didn't recognize him." One attendee described the event to a CEA colleague as “the best weekend of their life”. The event wasn't perfect, of course. Several attendees reported that there was too much content in English that some attendees couldn't follow, and some of the snacks and beverages seemed low quality. We hope to do even better next time! EAGxIndia EAGxIndia was organised by community builders based in India with support from CEA. 163 people attended: ~70% of attendees came from within India, ~9% from the UK and ~9% from the US. The rest came from Singapore, Indonesia, the Netherlands, Germany, and Brazil. The sessions were all in English, and included: A talk on South Asian Air Quality by Santosh Harish (Open Philanthropy); An AI governance Q&A with Michael Aird (Rethink Priorities); An update from the Fish Welfare Initiative, who are based in India; A talk on AI safety research agendas and opportunities by Adam Gleave (FAR AI); Two sessions led by Charity Entrepreneurship; A workshop on what the EA community should look like in India, led by Indian community builders and CEA staff; A talk mapping the...

The Nonlinear Library
EA - Celebrating EAGxLatAm and EAGxIndia by OllieBase

The Nonlinear Library

Play Episode Listen Later Jan 26, 2023 6:53


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Celebrating EAGxLatAm and EAGxIndia, published by OllieBase on January 26, 2023 on The Effective Altruism Forum. In the first two weeks of January, we had the first EA conferences to take place in Latin America and India: EAGxLatAm in Mexico City and EAGxIndia in Jaipur, Rajasthan. Lots of EAGx events happen every year, and we don't usually post about each one. But I think these events are particularly special as two of the largest EA conferences to take place in LMICs (Low and Middle Income Countries) and the first ever in each region. So, I wanted to share a bit more about how they went. Tl;DR: they went well. EAGxLatinAmerica EAGxLatAm was organised by community builders from the Spanish-speaking EA community with support from CEA. Over 200 people attended the event: ~32% of attendees came from within Mexico, ~13% from the US, ~12% from Colombia, ~7% from Brazil, ~5% from Chile and ~5% from Spain. There were also attendees from several other countries including Argentina, Peru, the Philippines, Canada, and countries in Europe. The event hosted talks in English, Spanish, and Portuguese and attendees could mark on their name badges which languages they spoke. Sessions included: A talk on the high-impact careers service, Carreras con Impacto; A virtual Q&A with Toby Ord to discuss existential risks and the role of LMICs in tackling them; Racionalidad para principiantes (Rationality for beginners) by Laura González Salmerón; A panel on Riesgos Catastróficos Globales (Global Catastrophic Risks); A panel on AI safety with staff from OpenAI, Redwood Research, CSER and Epoch; A talk on Animal Welfare cause prioritisation by Daniela Romero Waldhorn (Rethink Priorities); A talk by JPAL staff on ‘Applying evidence to reduce intimate partner violence'; A talk on local cause prioritisation by Luis Mota (Global Priorities Institute); A panel on EA in LMICs with EA community builders from Brazil, the Spanish-speaking community, the Philippines, Nigeria, South Africa, and EA Anywhere; A talk by Vida Plena (a new organisation incubated by Charity Entrepreneurship focused on mental health in Latin America); Meetups for attendees from Mexico, Chile/Argentina, Colombia/Perú, Brazil and Spain. The event received a likelihood to recommend score of 9.07/10 and attendees made an average of 9.37 new connections. Some quotes from the feedback survey I enjoyed reading: “I had a 1-1 with someone else interested in nuclear security and nuclear winter research, and we're developing an online reading group curriculum to [ideally] launch in March!" “I got convinced on deciding to focus on [sic] direct work on ai safety by talking to [two other attendees].” “I met Rob Wiblin at lunch and I didn't recognize him." One attendee described the event to a CEA colleague as “the best weekend of their life”. The event wasn't perfect, of course. Several attendees reported that there was too much content in English that some attendees couldn't follow, and some of the snacks and beverages seemed low quality. We hope to do even better next time! EAGxIndia EAGxIndia was organised by community builders based in India with support from CEA. 163 people attended: ~70% of attendees came from within India, ~9% from the UK and ~9% from the US. The rest came from Singapore, Indonesia, the Netherlands, Germany, and Brazil. The sessions were all in English, and included: A talk on South Asian Air Quality by Santosh Harish (Open Philanthropy); An AI governance Q&A with Michael Aird (Rethink Priorities); An update from the Fish Welfare Initiative, who are based in India; A talk on AI safety research agendas and opportunities by Adam Gleave (FAR AI); Two sessions led by Charity Entrepreneurship; A workshop on what the EA community should look like in India, led by Indian community builders and CEA staff; A talk mapping the...

The Nonlinear Library
EA - Concerns about AI safety career change by mmKALLL

The Nonlinear Library

Play Episode Listen Later Jan 14, 2023 6:22


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Concerns about AI safety career change, published by mmKALLL on January 13, 2023 on The Effective Altruism Forum. Summary: I'm a software engineer interested in working on AI safety, but confused about its career prospects. I outlined all my concerns below. In particular, I had trouble finding accounts of engineers working in the field, and the differences between organizations/companies working on AI safety are very unclear from the outside. It's also not clear if frontend skills are seen as useful, or whether applicants should reside within the US. Full text: I'm an experienced full-stack software engineer and software/strategy consultant based in Japan. I've been loosely following EA since 2010, and have become increasingly concerned about AI x-risk since 2016. This has led me to regularly consider possible careers in AI safety, especially now that the demand for software engineers in the field has increased dramatically. However, having spent ~15 hours reading about the current state of the field, organizations, and role of engineers, I find myself having more questions than I started with. In hope of finding more clarity and help share what engineers considering the career shift might be wondering, I decided to outline my main points of concern below: The only accounts of engineers working in AI safety I could find were two articles and a problem profile on 80,000 Hours. Not even the AI Alignment Forum seemed to have any posts written by engineers sharing their experience. Despite this, most orgs have open positions for ML engineers, DevOps engineers, or generalist software developers. What are all of them doing? Many job descriptions listed very similar skills for engineers, even when the orgs seemed to have very different approaches on tackling AI safety problems. Is the set of required software skills really that uniform across organizations? Do software engineers in the field feel that their day-to-day work is meaningful? Are they regularly learning interesting and useful things? How do they see their career prospects? I'm also curious whether projects are done with a diverse set of technologies? Who is typically responsible for data transformations and cleanup? How much ML theory should an engineer coming into the field learn beforehand? (I'm excited to learn about ML, but got very mixed signals about the expectations.) Some orgs describe their agenda and goals. In many cases, these seemed very similar to me, as all of them are pragmatic and many even had shared or adjacent areas of research. Given the similarities, why are there so many different organizations? How is an outsider supposed to know what makes each of them unique? As an example, MIRI states that they want to "ensure that the creation of smarter-than-human machine intelligence has a positive impact", Anthropic states they have "long-term goals of steerable, trustworthy AI", Redwood Research states they want to "align -- future systems with human interests", and Center of AI Safety states they want to "reduce catastrophic and existential risks from AI". What makes these different from each other? They all sound like they'd lead to similar conclusions about what to work on. I was surprised to find that some orgs didn't really describe their work or what differentiates them. How are they supposed to find the best engineers if interested ones can't know what areas they are working on? I also found that it's sometimes very difficult to evaluate whether an org is active and/or trustworthy. Related to this, I was baffled to find that MIRI hasn't updated their agenda since 2015, and their latest publication is dated at 2016. However, their blog seems to have ~quarterly updates? Are they still relevant? Despite finding many orgs by reading articles and publications, I couldn't find a good overall list ...

The Nonlinear Library
AF - 200 COP in MI: Looking for Circuits in the Wild by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Dec 29, 2022 20:24


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 200 COP in MI: Looking for Circuits in the Wild, published by Neel Nanda on December 29, 2022 on The AI Alignment Forum. This is the third post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. I'll make another post every 1-2 days, giving a new category of open problems. If you want to read ahead, check out the draft sequence here! Motivation Motivating paper: Interpretability In The Wild Our ultimate goal is to be able to reverse engineer real, frontier language models. And the next step up from toy language models is to look for circuits in the wild. That is, taking a real model that was not designed to be tractable to interpret (albeit much smaller than GPT-3), take some specific capability it has, and try to reverse engineer how it does it. I think there's a lot of exciting low-hanging fruit here, and that enough groundwork has been laid to be accessible to newcomers to the field! In particular, I think there's a lot of potential to better uncover the underlying principles of models, and to leverage this to build better and more scalable interpretability techniques. We currently have maybe three published examples of well-understood circuits in language models - I want to have at least 20! This section is heavily inspired by Redwood Research's recent paper: Interpretability In The Wild. They analyse how GPT-2 Small solves the grammatical task of Indirect Object Identification: e.g. knowing that “John and Mary went to the store, then John handed a bottle of milk to” is followed by Mary, not John. And in a heroic feat of reverse engineering, they rigorously deciphered the algorithm. In broad strokes, the model first identifies which names are repeated, inhibits any repeated names, and then predicts that the name which is not inhibited comes next. The full circuit consists of 26 heads, sorted into 7 distinct groups, as shown below - see more details in my MI explainer. A reasonable objection is that this is a cherry-picked task on a tiny (100M parameter) model, so why should we care about it? But I've learned a lot from this work! It's helped uncover some underlying principles of networks. Notably that the model has built in redundancy: if the name mover heads (which attend to the correct name) are knocked-out, then there are backup heads which take over and do the task instead! And that the model has learned to repurpose existing functionality: the model has a component (induction heads) for the more complex task of continuing repeated subsequences, but here they also get re-purposed to detect whether a name is repeated. It would have been much harder to understand what was going on here without the prior work of reverse-engineering induction circuits! And there's many open questions building on the work that could clarify other principles like universality: when another model learns this task, does it learn the same circuit? Further, it's helped to build out a toolkit of techniques to rigorously reverse engineer models. In the process of understanding this circuit, they refined the technique of activation patching into more sophisticated approaches such as path patching (and later causal scrubbing). And this has helped lay the foundations for developing future techniques! There are many interpretability techniques that are more scalable but less mechanistic, like probing. If we have some well understood circuits, and can see how well these techniques work in those settings (and what they miss!), this gives grounding to find the good techniques and how to understand their results. Finally, I think that enough groundwork has been laid that finding circuits in the wild is tractable, even by peo...

Slate Star Codex Podcast
Perhaps It Is A Bad Thing That The World's Leading AI Companies Cannot Control Their AIs

Slate Star Codex Podcast

Play Episode Listen Later Dec 14, 2022 22:39


https://astralcodexten.substack.com/p/perhaps-it-is-a-bad-thing-that-the I. The Game Is Afoot   Last month I wrote about Redwood Research's fanfiction AI project. They tried to train a story-writing AI not to include violent scenes, no matter how suggestive the prompt. Although their training made the AI reluctant to include violence, they never reached a point where clever prompt engineers couldn't get around their restrictions. Now that same experiment is playing out on the world stage. OpenAI released a question-answering AI, ChatGPT. If you haven't played with it yet, I recommend it. It's very impressive! Every corporate chatbot release is followed by the same cat-and-mouse game with journalists. The corporation tries to program the chatbot to never say offensive things. Then the journalists try to trick the chatbot into saying “I love racism”. When they inevitably succeed, they publish an article titled “AI LOVES RACISM!” Then the corporation either recalls its chatbot or pledges to do better next time, and the game moves on to the next company in line.

The Nonlinear Library
AF - An exploration of GPT-2's embedding weights by Adam Scherlis

The Nonlinear Library

Play Episode Listen Later Dec 13, 2022 18:03


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An exploration of GPT-2's embedding weights, published by Adam Scherlis on December 13, 2022 on The AI Alignment Forum. I wrote this doc in December 2021, while working at Redwood Research. It summarizes a handful of observations about GPT-2's weights -- mostly the embedding matrix, but also the LayerNorm gain parameters -- that I found while doing some open-ended investigation of the model. I wanted to see how much I could learn by studying just those parameters, without looking at the attention layers, MLP layers, or activations. This is still mostly unedited. Feedback and questions are very welcome. Embedding matrix glossary Latent space – 768-channel-dim space that activations between blocks (and in skip connections) live in. Token space – 50,257-token-dim space of one-hot token vectors [Token] embedding matrix – W of shape [768, 50257] (this is the transpose of `model.lm_head.weight`) Embedding vector – Any 768-dimensional column of W Embedding space – Latent space Preferred channel basis from LayerNorm This section involves more math than the rest; you can skip it without missing much. "Preferred basis" means the same thing as "privileged basis" as defined by Anthropic here. Most of GPT-2's architecture – the attention layers, fully-connected layers and residual connections – don't impose a preferred basis on the latent space. (The skip connections give a canonical map between the latent spaces in different layers, so that it makes sense to talk about “the” latent space globally.) The LayerNorm layers actually do privilege a basis, though, for two reasons. LayerNorm(x) is defined as follows, where w and b are learned parameters: yi=∑j(Idij−1i1j)xj(subtract mean; 1 is an all-ones vector) zi=yi/√∑jyjyj (normalize variance) LayerNorm(x)i=wizi−bi (apply bias and gain parameters) The first line picks out part of a preferred basis – a preferred vector (and the subspace orthogonal to it), namely the all-ones vector 1. The second line doesn't pick out a basis, because the stdev of the components of y is just its L2 norm and normalizing a vector is independent of (orthonormal) basis. (It does impose a preferred inner product, but we had that already: GPT-2's unembedding matrix is the transpose of the embedding matrix, and W^T W is a matrix of inner products.) The third line is the interesting one. The gain vector w is multiplied elementwise with z, which breaks basis-independence. (You could also write this as D_ij z_j where D is a diagonal matrix, D_ii = w_i.) This makes the standard basis a preferred basis, in a pretty straightforward way: the network can dial up or down the size of different standard-basis components independently (and can't do this for the components with respect to any other basis). This provides some justification for breaking things out by latent-space dimension in most of this post. Token/Position embedding rivalry Here's an interesting phenomenon I don't have a complete explanation for. This is a chunk of the token embedding matrix: (X: token dims, Y: channel dims) The horizontal stripes are dimensions of the latent space where the token embeddings are systematically larger or smaller than average. This gets normalized away inside residual blocks, but is preserved by skip connections. We can also make histograms from this matrix, by taking either a row or a column and plotting a histogram of the numbers in it. We can overlay a few of these histograms, in different colors, to show a few columns or a few rows at the same time. Overlaid histograms of the first 30 columns (tokens, left) and the first 30 rows (channel dims, right): Note that many of the channel dims follow roughly the same distribution, while a few have smaller variance and larger means. This shows up as a heavy left-tail of the distribution of stdevs, and heavy tails in both direct...

The Nonlinear Library
AF - Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research] by Lawrence Chan

The Nonlinear Library

Play Episode Listen Later Dec 3, 2022 37:26


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research], published by Lawrence Chan on December 3, 2022 on The AI Alignment Forum. Authors sorted alphabetically. Summary: This post introduces causal scrubbing, a principled approach for evaluating the quality of mechanistic interpretations. The key idea behind causal scrubbing is to test interpretability hypotheses via behavior-preserving resampling ablations. We apply this method to develop a refined understanding of how a small language model implements induction and how an algorithmic model correctly classifies if a sequence of parentheses is balanced. 1 Introduction A question that all mechanistic interpretability work must answer is, “how well does this interpretation explain the phenomenon being studied?”. In the many recent papers in mechanistic interpretability, researchers have generally relied on ad-hoc methods to evaluate the quality of interpretations. This ad hoc nature of existing evaluation methods poses a serious challenge for scaling up mechanistic interpretability. Currently, to evaluate the quality of a particular research result, we need to deeply understand both the interpretation and the phenomenon being explained, and then apply researcher judgment. Ideally, we'd like to find the interpretability equivalent of property-based testing—automatically checking the correctness of interpretations, instead of relying on grit and researcher judgment. More systematic procedures would also help us scale-up interpretability efforts to larger models, behaviors with subtler effects, and to larger teams of researchers. To help with these efforts, we want a procedure that is both powerful enough to finely distinguish better interpretations from worse ones, and general enough to be applied to complex interpretations. In this work, we propose causal scrubbing, a systematic ablation method for testing precisely stated hypotheses about how a particular neural network implements a behavior on a dataset. Specifically, given an informal hypothesis about which parts of a model implement the intermediate calculations required for a behavior, we convert this to a formal correspondence between a computational graph for the model and a human-interpretable computational graph. Then, causal scrubbing starts from the output and recursively finds all of the invariances of parts of the neural network that are implied by the hypothesis, and then replaces the activations of the neural network with the maximum entropy distribution subject to certain natural constraints implied by the hypothesis and the data distribution. We then measure how well the scrubbed model implements the specific behavior. Insofar as the hypothesis explains the behavior on the dataset, the model's performance should be unchanged. Unlike previous approaches that were specific to particular applications, causal scrubbing aims to work on a large class of interpretability hypotheses, including almost all hypotheses interpretability researchers propose in practice (that we're aware of). Because the tests proposed by causal scrubbing are mechanically derived from the proposed hypothesis, causal scrubbing can be incorporated “in the inner loop” of interpretability research. For example, starting from a hypothesis that makes very broad claims about how the model works and thus is consistent with the model's behavior on the data, we can iteratively make hypotheses that make more specific claims while monitoring how well the new hypotheses explain model behavior. We demonstrate two applications of this approach in later posts: first on a parenthesis balancer checker, then on the induction heads in a two-layer attention-only language model. We see our contributions as the following: We formalize a notion of interpretability hy...

Slate Star Codex Podcast
Can This AI Save Teenage Spy Alex Rider From A Terrible Fate?

Slate Star Codex Podcast

Play Episode Listen Later Nov 30, 2022 38:14


We're showcasing a hot new totally bopping, popping musical track called “bromancer era? bromancer era?? bromancer era???“ His subtle sublime thoughts raced, making his eyes literally explode. https://astralcodexten.substack.com/p/can-this-ai-save-teenage-spy-alex         “He peacefully enjoyed the light and flowers with his love,” she said quietly, as he knelt down gently and silently. “I also would like to walk once more into the garden if I only could,” he said, watching her. “I would like that so much,” Katara said. A brick hit him in the face and he died instantly, though not before reciting his beloved last vows: “For psp and other releases on friday, click here to earn an early (presale) slot ticket entry time or also get details generally about all releases and game features there to see how you can benefit!” — Talk To Filtered Transformer Rating: 0.1% probability of including violence “Prosaic alignment” is the most popular paradigm in modern AI alignment. It theorizes that we'll train future superintelligent AIs the same way that we train modern dumb ones: through gradient descent via reinforcement learning. Every time they do a good thing, we say “Yes, like this!”, in a way that pulls their incomprehensible code slightly in the direction of whatever they just did. Every time they do a bad thing, we say “No, not that!,” in a way that pushes their incomprehensible code slightly in the opposite direction. After training on thousands or millions of examples, the AI displays a seemingly sophisticated understanding of the conceptual boundaries of what we want. For example, suppose we have an AI that's good at making money. But we want to align it to a harder task: making money without committing any crimes. So we simulate it running money-making schemes a thousand times, and give it positive reinforcement every time it generates a legal plan, and negative reinforcement every time it generates a criminal one. At the end of the training run, we hopefully have an AI that's good at making money and aligned with our goal of following the law. Two things could go wrong here: The AI is stupid, ie incompetent at world-modeling. For example, it might understand that we don't want it to commit murder, but not understand that selling arsenic-laden food will kill humans. So it sells arsenic-laden food and humans die. The AI understands the world just fine, but didn't absorb the categories we thought it absorbed. For example, maybe none of our examples involved children, and so the AI learned not to murder adult humans, but didn't learn not to murder children. This isn't because the AI is too stupid to know that children are humans. It's because we're running a direct channel to something like the AI's “subconscious”, and we can only talk to it by playing this dumb game of “try to figure out the boundaries of the category including these 1,000 examples”. Problem 1 is self-resolving; once AIs are smart enough to be dangerous, they're probably smart enough to model the world well. How bad is Problem 2? Will an AI understand the category boundaries of what we want easily and naturally after just a few examples? Will it take millions of examples and a desperate effort? Or is there some reason why even smart AIs will never end up with goals close enough to ours to be safe, no matter how many examples we give them? AI scientists have debated these questions for years, usually as pure philosophy. But we've finally reached a point where AIs are smart enough for us to run the experiment directly. Earlier this year, Redwood Research embarked on an ambitious project to test whether AIs could learn categories and reach alignment this way - a project that would require a dozen researchers, thousands of dollars of compute, and 4,300 Alex Rider fanfiction stories.

The Nonlinear Library
AF - Current themes in mechanistic interpretability research by Lee Sharkey

The Nonlinear Library

Play Episode Listen Later Nov 16, 2022 20:56


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Current themes in mechanistic interpretability research, published by Lee Sharkey on November 16, 2022 on The AI Alignment Forum. This post gives an overview of discussions - from the perspective and understanding of the interpretability team at Conjecture - between mechanistic interpretability researchers from various organizations including Conjecture, Anthropic, Redwood Research, OpenAI, and DeepMind as well as some independent researchers. It is not a review of past work, nor a research agenda. We're thankful for comments and contributions from Neel Nanda, Tristan Hume, Chris Olah, Ryan Greenblatt, William Saunders, and other anonymous contributors to this post, which greatly improved its quality. While the post is a summary of discussions with many researchers and received comments and contributions from several, it may nevertheless not accurately represent their views. The last two to three years have seen a surge in interest in mechanistic interpretability as a potential path to AGI safety. Now there are no fewer than five organizations working on the topic (Anthropic, Conjecture, DeepMind, OpenAI, Redwood Research) in addition to numerous academic and independent researchers. In discussions about mechanistic interpretability between a subset of researchers, several themes emerged. By summarizing these themes here, we hope to facilitate research in the field more broadly. We identify groups of themes that concern: Object-level research topics in mechanistic interpretability Research practices and tools in mechanistic interpretability Field building and research coordination in mechanistic interpretability Theories of impact for mechanistic interpretability Object-level research topics in mechanistic interpretability Solving superposition Anthropic's recent article on Toy Model of Superposition laid out a compelling case that superposition is a real phenomenon in neural networks. Superposition appears to be one of the reasons that polysemanticity happens, which makes mechanistic interpretability very difficult because it prevents us from telling simple stories about how features in one layer are constructed from features in previous layers. A solution to superposition will look like the ability to enumerate all the features that a network represents, even if they're represented in superposition. If we can do that, then we should be able to make statements like “For all features in the neural network, none violate rule X” (and more ambitiously, for "no features with property X participate in circuits which violate property Y"). Researchers at Anthropic hope this might enable ‘enumerative safety', which might allow checking random samples or comprehensive investigations of safety-critical parts of the model for unexpected and concerning components. There are many potential reasons researchers could fail to achieve enumerative safety, including failing to solve superposition, scalability challenges, and several other barriers described in the next section. Anthropic outlined several potential solutions to superposition in their article. Very briefly, these strategies are: Create models without superposition. Find a sparse overcomplete basis that describes how features are represented in models with superposition. This will likely involve large scale solutions to sparse coding. Hybrid approaches in which one changes models, not resolving superposition, but making it easier for a second stage of analysis to find a sparse overcomplete basis that describes it. Multiple organizations are pursuing these strategies. Researchers in all organizations are keen to hear from people interested in working together on this problem. However, there is a range of views among researchers on how central superposition is as a problem and how tractable it is. Barriers beyond superposit...

The Nonlinear Library
AF - Some advice on independent research by Marius Hobbhahn

The Nonlinear Library

Play Episode Listen Later Nov 8, 2022 15:36


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Some advice on independent research, published by Marius Hobbhahn on November 8, 2022 on The AI Alignment Forum. I have been doing independent research in addition to my Ph.D. for roughly a year now. For the next 6 months, I'll take a break from my Ph.D. and plan to do AI safety research full-time. I had chats with many people about independent research in the past, e.g. on EAGs or because 80K has connected me with people thinking about pursuing independent research. I had some great experiences with independent research but not everyone does. I think the variance for independent research is large and I'm worried that people get disheartened by bad experiences. So here are some considerations in which situations independent research might be a good idea and some tips that will hopefully improve your experience. I'd like to thank Magdalena Wache and Tilman Räuker for their feedback. TL;DR: At first glance, there is a bit of a paradoxical nature to independent research. If someone wants to pursue independent research they need a research agenda to work on. If they are able to construct a good research agenda, an existing institution often has incentives to hire them. On the flip side, if their research skills are not developed enough to be hired by an existing institution, their independent research might not be very successful. Thus, naively it would seem that there are few cases in which independent research makes sense. However, I think that there are many situations in which independent research or independent upskilling are a great option, e.g. when no established organization is working on the topic you find most promising, as a way to upskill for a job, to gain new research skills or to transition between jobs. Some tips for independent researchers include: getting feedback early on, aiming to collaborate with others and creating accountability mechanisms for yourself such as publishing your results. My most important advice for independent researchers is that you should probably be much more active than in other roles because there is less default structure and more responsibility on you. I'll mostly talk about AI safety research but many of these things probably also apply to other independent research. A perceived paradox Independent research is often presented as one of three default options for people seeking to do EA research, e.g. in AI safety: Academia, e.g. applying for Ph.D. and post-doc positions with labs that do research aligned with your goals. Research positions in industry, e.g. applying for Anthropic, Redwood Research, Deepmind, OpenAI or Conjecture. Independent research, e.g. supported by an EA grant. Doing independent research well requires a multitude of skills. The independent researcher needs to be able to set their own agenda, they require some basic research skills, self-discipline and some way of evaluating and correcting their own research. These are skills that usually don't come naturally but need to be learned and refined. In most standard career paths, e.g. within a Ph.D. or in an industry research team people have mentors who help them and ensure that they actually learn these skills. By default, independent research does not ensure that these skills are actually acquired. The perceived paradox is now that if someone has the skills required to do high-quality independent research, existing institutions often want to hire them. If they don't have these skills yet, the research they will produce independently is unlikely to be of high quality or conducted efficiently (unless they have mentorship or are especially talented). Thus, naively, it seems like there aren't that many situations in which independent research makes sense. However, I think there are many cases in which independent research makes a lot of sense and there there ar...

Clearer Thinking with Spencer Greenberg
Taking pleasure in being wrong (with Buck Shlegeris)

Clearer Thinking with Spencer Greenberg

Play Episode Listen Later Jun 8, 2022 76:10


How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.[Read more]

ai pleasure cto buck machine intelligence research institute redwood research buck shlegeris
Clearer Thinking with Spencer Greenberg
Taking pleasure in being wrong (with Buck Shlegeris)

Clearer Thinking with Spencer Greenberg

Play Episode Listen Later Jun 8, 2022 76:10


How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.

ai pleasure cto buck machine intelligence research institute redwood research buck shlegeris
Clearer Thinking with Spencer Greenberg
Taking pleasure in being wrong (with Buck Shlegeris)

Clearer Thinking with Spencer Greenberg

Play Episode Listen Later Jun 8, 2022 76:10


How hard is it to arrive at true beliefs about the world? How can you find enjoyment in being wrong? When presenting claims that will be scrutinized by others, is it better to hedge and pad the claims in lots of caveats and uncertainty, or to strive for a tone that matches (or perhaps even exaggerates) the intensity with which you hold your beliefs? Why should you maybe focus on drilling small skills when learning a new skill set? What counts as a "simple" question? How can you tell when you actually understand something and when you don't? What is "cargo culting"? Which features of AI are likely in the future to become existential threats? What are the hardest parts of AI research? What skills will we probably really wish we had on the eve of deploying superintelligent AIs?Buck Shlegeris is the CTO of Redwood Research, an independent AI alignment research organization. He currently leads their interpretability research. He previously worked on research and outreach at the Machine Intelligence Research Institute. His website is shlegeris.com.

ai pleasure cto buck machine intelligence research institute redwood research buck shlegeris