Podcasts about constitutional ai

  • 37PODCASTS
  • 47EPISODES
  • 31mAVG DURATION
  • ?INFREQUENT EPISODES
  • Jun 10, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about constitutional ai

Latest podcast episodes about constitutional ai

AI Lawyer Talking Tech
June 10, 2025 - AI and the Shifting Sands of Legal Tech

AI Lawyer Talking Tech

Play Episode Listen Later Jun 10, 2025 21:26


Welcome to 'AI Lawyer Talking Tech'. Today, we dive into the rapidly evolving landscape where artificial intelligence meets the legal and technology sectors. We explore how AI is reshaping legal access and dispute resolution, while also presenting significant challenges, such as concerns about accuracy, bias, and accountability. Recent incidents in the UK courts, where lawyers cited fake case precedents generated by AI, underscore the critical need for human oversight and verification. The US Copyright Office is staking out its position on using copyrighted works for AI training, addressing prima facie infringement, fair use, and potential licensing models. Meanwhile, companies like Definely are securing significant funding to develop AI-powered legal workflow solutions aimed at transforming legal workflows by moving towards more autonomous systems. We also examine the increasing regulatory focus on AI governance globally, from dedicated government ministries in the Commonwealth Caribbean and new AI supervisory agencies in Spain to legal battles over neural data privacy in states like Montana, California, and Colorado. The financing environment for tech, particularly robotics, faces its own hurdles despite rebounds in investment, highlighting the need for strategic partnerships and protection of intellectual property, especially with the integration of AI. Join us as we discuss these pivotal developments and their implications.Don't Mind If I Do: Montana Says Hands Off Neural Data10 Jun 2025Perkins CoieThe Financing Environment and Current Trends in Robotics10 Jun 2025Mintz LevinScaling justice: How AI and ADR are reshaping legal access09 Jun 2025Thomson ReutersAI Scandal Rocks British Courts: UK Lawyers Slammed For Citing Fake Cases Generated By Bots09 Jun 2025International Business Times UKDear Editor: Time for Commonweath Caribbean to prioritise AI governance09 Jun 2025Wired868How to succeed as a startup General Counsel09 Jun 2025Thomson ReutersHow Brent Dyer Reframes Biglaw, And What In-House Counsel Should Expect Instead09 Jun 2025Above The LawAI Victim Impact Statements Are Here, and They Are Just as Disastrous as You'd Expect09 Jun 2025Balls and StrikesUK judge warns lawyers of consequences for misusing AI in court filings09 Jun 2025JuristGetty Images vs. Stability AI: Landmark UK Copyright Lawsuit Begins09 Jun 2025Win BuzzerGetty legal action against AI photo firm is ‘day of reckoning', start of trial hears09 Jun 2025Press GazetteA firm of one's own09 Jun 2025Law Society GazetteHow Digital Forensics Drive Smarter Investigations09 Jun 2025JD SupraRaising the bar: 2025 innovation in the legal sector09 Jun 2025RaconteurHalf of Brits would trust AI for legal advice, survey finds – but experts urge caution09 Jun 2025Business Matters MagazineHigh Court warns lawyers over AI use after ‘fake' cases cited in submissions09 Jun 2025Legal CheekLIDW 2025: Arbitration in Emerging Industries (Tech, Fintech and Crypto): Global Risks and Local Differences09 Jun 2025Kluwer Arbitration BlogDefinely raises the bar: AI startup secures $30M to transform legal workflows09 Jun 2025Tech Funding NewsOpenAI Legal Issues: Data Retention After Chat Deletion09 Jun 2025TechBullionSmart Strategies for Gen AI in Law Firms09 Jun 2025PatheosScaling justice: How AI and ADR are reshaping legal access09 Jun 2025Thomson Reuters InstituteAnswer Engine Optimization (AEO): Is Your Law Firm Website AEO-Friendly?09 Jun 2025Legal Marketing & Technology BlogDefinely Raises $30m + In-depth Interview09 Jun 2025Artificial LawyerInterrogating AI, Legal or Illegal?09 Jun 2025Cole SchotzCopyright Office Stakes Out Position on Use of Works for AI Training09 Jun 2025Perkins CoieAI Watch: Global regulatory tracker - Spain09 Jun 2025White & CaseWhat is Constitutional AI and Why Does it Matter for International Arbitration?07 Jun 2025Kluwer Arbitration Blog

Beyond Preference Alignment: Teaching AIs to Play Roles & Respect Norms, with Tan Zhi Xuan

Play Episode Listen Later Nov 30, 2024 117:12


In this episode of The Cognitive Revolution, Nathan explores groundbreaking perspectives on AI alignment with MIT PhD student Tan Zhi Xuan. We dive deep into Xuan's critique of preference-based AI alignment and their innovative proposal for role-based AI systems guided by social consensus. The conversation extends into their fascinating work on how AI agents can learn social norms through Bayesian rule induction. Join us for an intellectually stimulating discussion that bridges philosophical theory with practical implementation in AI development. Check out: "Beyond Preferences in AI Alignment" paper: https://arxiv.org/pdf/2408.16984 "Learning and Sustaining Shared Normative Systems via Bayesian Rule Induction in Markov Games" paper: https://arxiv.org/pdf/2402.13399 Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse SPONSORS: Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitiverevolution Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today. Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive RECOMMENDED PODCAST: Unpack Pricing - Dive into the dark arts of SaaS pricing with Metronome CEO Scott Woody and tech leaders. Learn how strategic pricing drives explosive revenue growth in today's biggest companies like Snowflake, Cockroach Labs, Dropbox and more. Apple: https://podcasts.apple.com/us/podcast/id1765716600 Spotify: https://open.spotify.com/show/38DK3W1Fq1xxQalhDSueFg CHAPTERS: (00:00:00) Teaser (00:01:09) About the Episode (00:04:25) Guest Intro (00:06:25) Xuan's Background (00:12:03) AI Near-Term Outlook (00:17:32) Sponsors: Notion | Weights & Biases RAG++ (00:20:18) Alignment Approaches (00:26:11) Critiques of RLHF (00:34:40) Sponsors: Oracle Cloud Infrastructure (OCI) (00:35:50) Beyond Preferences (00:40:27) Roles and AI Systems (00:45:19) What AI Owes Us (00:51:52) Drexler's AI Services (01:01:08) Constitutional AI (01:09:43) Technical Approach (01:22:01) Norms and Deviations (01:32:31) Norm Decay (01:38:06) Self-Other Overlap (01:44:05) Closing Thoughts (01:54:23) Outro SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://www.linkedin.com/in/nathanlabenz/ Youtube: https://www.youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk

Lex Fridman Podcast
#452 – Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity

Lex Fridman Podcast

Play Episode Listen Later Nov 11, 2024 322:14


Dario Amodei is the CEO of Anthropic, the company that created Claude. Amanda Askell is an AI researcher working on Claude's character and personality. Chris Olah is an AI researcher working on mechanistic interpretability. Thank you for listening ❤ Check out our sponsors: https://lexfridman.com/sponsors/ep452-sc See below for timestamps, transcript, and to give feedback, submit questions, contact Lex, etc. Transcript: https://lexfridman.com/dario-amodei-transcript CONTACT LEX: Feedback - give feedback to Lex: https://lexfridman.com/survey AMA - submit questions, videos or call-in: https://lexfridman.com/ama Hiring - join our team: https://lexfridman.com/hiring Other - other ways to get in touch: https://lexfridman.com/contact EPISODE LINKS: Claude: https://claude.ai Anthropic's X: https://x.com/AnthropicAI Anthropic's Website: https://anthropic.com Dario's X: https://x.com/DarioAmodei Dario's Website: https://darioamodei.com Machines of Loving Grace (Essay): https://darioamodei.com/machines-of-loving-grace Chris's X: https://x.com/ch402 Chris's Blog: https://colah.github.io Amanda's X: https://x.com/AmandaAskell Amanda's Website: https://askell.io SPONSORS: To support this podcast, check out our sponsors & get discounts: Encord: AI tooling for annotation & data management. Go to https://encord.com/lex Notion: Note-taking and team collaboration. Go to https://notion.com/lex Shopify: Sell stuff online. Go to https://shopify.com/lex BetterHelp: Online therapy and counseling. Go to https://betterhelp.com/lex LMNT: Zero-sugar electrolyte drink mix. Go to https://drinkLMNT.com/lex OUTLINE: (00:00) - Introduction (10:19) - Scaling laws (19:25) - Limits of LLM scaling (27:51) - Competition with OpenAI, Google, xAI, Meta (33:14) - Claude (36:50) - Opus 3.5 (41:36) - Sonnet 3.5 (44:56) - Claude 4.0 (49:07) - Criticism of Claude (1:01:54) - AI Safety Levels (1:12:42) - ASL-3 and ASL-4 (1:16:46) - Computer use (1:26:41) - Government regulation of AI (1:45:30) - Hiring a great team (1:54:19) - Post-training (1:59:45) - Constitutional AI (2:05:11) - Machines of Loving Grace (2:24:17) - AGI timeline (2:36:52) - Programming (2:43:52) - Meaning of life (2:49:58) - Amanda Askell - Philosophy (2:52:26) - Programming advice for non-technical people (2:56:15) - Talking to Claude (3:12:47) - Prompt engineering (3:21:21) - Post-training (3:26:00) - Constitutional AI (3:30:53) - System prompts (3:37:00) - Is Claude getting dumber? (3:49:02) - Character training (3:50:01) - Nature of truth (3:54:38) - Optimal rate of failure (4:01:49) - AI consciousness (4:16:20) - AGI (4:24:58) - Chris Olah - Mechanistic Interpretability (4:29:49) - Features, Circuits, Universality (4:47:23) - Superposition (4:58:22) - Monosemanticity (5:05:14) - Scaling Monosemanticity (5:14:02) - Macroscopic behavior of neural networks (5:18:56) - Beauty of neural networks

E3 S6 Responsible AI and K12 Implications with Aman Sahota

"The Bulldog Educator" with Kirsten Wilson

Play Episode Listen Later Oct 10, 2024 35:39


In Episode 3 of Season 6 cohosts Kirsten Wilson and Matt Caston talk with Factors Edu co-Founder and entrepreneur, Aman Sahota. Aman shares with us ideas and concepts around AI. The importance of Constitutional AI, and how, when humans remain at the center of AI use, the future is hopeful and exciting. Information shared in the episode: Perplexity AI Contact Aman Sahota: LinkedIN X- @FactorsEdu Website: factors.world Access the interactive platform (waitlist): app.factors.world Listeners please share your thoughts and ideas with us on our social media accounts on ⁠⁠⁠⁠⁠⁠⁠⁠X⁠, ⁠⁠⁠⁠⁠⁠⁠⁠Instagram ⁠⁠⁠⁠⁠⁠⁠⁠or ⁠⁠⁠⁠⁠⁠⁠⁠Facebook⁠⁠⁠⁠⁠⁠⁠⁠ @thebulldogedu  You can also follow Matt on instagram ⁠⁠⁠⁠⁠⁠⁠⁠@CastIron⁠⁠⁠⁠⁠⁠⁠⁠ or X ⁠⁠⁠⁠⁠⁠⁠⁠@MatthewCaston⁠⁠⁠⁠⁠⁠⁠⁠ and Kirsten on ⁠⁠⁠⁠⁠⁠⁠⁠Instagram⁠⁠⁠⁠⁠⁠⁠⁠ or X @teachkiwi, or ⁠⁠⁠⁠⁠⁠⁠⁠Facebook⁠⁠⁠⁠⁠⁠⁠⁠ or ⁠⁠⁠⁠⁠⁠⁠⁠LinkedIN ⁠⁠⁠⁠⁠⁠⁠⁠as Kirsten Wilson. Please subscribe to The Bulldog Educator to continue listening on your favorite podcast platform.

The Nonlinear Library
AF - Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents by Sam Brown

The Nonlinear Library

Play Episode Listen Later Jul 22, 2024 26:02


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Auto-Enhance: Developing a meta-benchmark to measure LLM agents' ability to improve other agents, published by Sam Brown on July 22, 2024 on The AI Alignment Forum. Summary Scaffolded LLM agents are, in principle, able to execute arbitrary code to achieve the goals they have been set. One such goal could be self-improvement. This post outlines our plans to build a benchmark to measure the ability of LLM agents to modify and improve other LLM agents. This 'Auto-Enhancement benchmark' measures the ability of 'top-level' agents to improve the performance of 'reference' agents on 'component' benchmarks, such as CyberSecEval 2, MLAgentBench, SWE-bench, and WMDP. Results are mostly left for a future post in the coming weeks. Scaffolds such as AutoGPT, ReAct, and SWE-agent can be built around LLMs to build LLM agents, with abilities such as long-term planning and context-window management to enable them to carry out complex general-purpose tasks autonomously. LLM agents can fix issues in large, complex code bases (see SWE-bench), and interact in a general way using web browsers, Linux shells, and Python interpreters. In this post, we outline our plans for a project to measure these LLM agents' ability to modify other LLM agents, undertaken as part of Axiom Futures' Alignment Research Fellowship. Our proposed benchmark consists of "enhancement tasks," which measure the ability of an LLM agent to improve the performance of another LLM agent (which may be a clone of the first agent) on various tasks. Our benchmark uses existing benchmarks as components to measure LLM agent capabilities in various domains, such as software engineering, cybersecurity exploitation, and others. We believe these benchmarks are consequential in the sense that good performance by agents on these tasks should be concerning for us. We plan to write an update post with our results at the end of the Fellowship, and we will link this post to that update. Motivation Agents are capable of complex SWE tasks (see, e.g., Yang et al.). One such task could be the improvement of other scaffolded agents. This capability would be a key component of autonomous replication and adaptation (ARA), and we believe it would be generally recognised as an important step towards extreme capabilities. This post outlines our initial plans for developing a novel benchmark that aims to measure the ability of LLM-based agents to improve other LLM-based agents, including those that are as capable as themselves. Threat model We present two threat models that aim to capture how AI systems may develop super-intelligent capabilities. Expediting AI research: Recent trends show how researchers are leveraging LLMs to expedite academic paper reviews (see Du et al.). ML researchers are beginning to use LLMs to design and train more advanced models (see Cotra's AIs accelerating AI research and Anthropic's work on Constitutional AI). Such LLM-assisted research may expedite progress toward super-intelligent systems. Autonomy: Another way that such capabilities are developed is through LLM agents themselves becoming competent enough to self-modify and further ML research without human assistance (see section Hard Takeoff in this note ), leading to an autonomously replicating and adapting system. Our proposed benchmark aims to quantify the ability of LLM agents to bring about such recursive self-improvement, either with or without detailed human supervision. Categories of bottlenecks and overhang risks We posit that there are three distinct categories of bottlenecks to LLM agent capabilities: 1. Architectures-of-thought, such as structured planning, progress-summarisation, hierarchy of agents, self-critique, chain-of-thought, self-consistency, prompt engineering/elicitation, and so on. Broadly speaking, this encompasses everything between the LL...

Hashtag Trending
Will the "AI bubble" burst? Hashtag Trending for Wednesday, July 10, 2024

Hashtag Trending

Play Episode Listen Later Jul 10, 2024 9:58 Transcription Available


Hashtag Trending: Canada's Streaming Regulations, AI Stock Concerns, and Constitutional AI In this episode of Hashtag Trending, host Jim Love dives into the ongoing clash between global streaming giants and Canadian regulations requiring contributions to local news production. Analysts draw parallels between the current AI stock surge and the dot com bubble, warning of potential pitfalls. Meanwhile, Anthropic introduces 'Constitutional AI' amid rising public concern about AI safety, potentially reshaping how AI systems are developed and governed. Tune in for insights into these stories and their implications for the tech industry. 00:00 Introduction: Big Tech and AI Concerns 00:26 Canadian Streaming Regulations Clash 01:51 AI Stock Market Warnings 03:10 Sequoia Capital's AI Revenue Concerns 04:45 Anthropic's Competitive Edge with Constitutional AI 08:25 U.S. Regulatory Challenges and AI Safety 09:25 Conclusion and Show Notes

Super Prompt: Generative AI w/ Tony Wan
Power and Responsibility of Large Language Models | Safety & Ethics | OpenAI Model Spec + RLHF | Anthropic Constitutional AI | Episode 27

Super Prompt: Generative AI w/ Tony Wan

Play Episode Listen Later Jun 17, 2024 16:38


With great power comes great responsibility. How do Open AI, Anthropic, and Meta implement safety and ethics? As large language models (LLMs) get larger, the potential for using them for nefarious purposes looms larger as well. Anthropic uses Constitutional AI, while OpenAI uses a model spec, combined with RLHF (Reinforcement Learning from Human Feedback). Not to be confused with ROFL (Rolling On the Floor Laughing). Tune into this episode to learn how leading AI companies use their Spidey powers to maximize usefulness and harmlessness.REFERENCEOpenAI Model Spechttps://cdn.openai.com/spec/model-spec-2024-05-08.html#overviewAnthropic Constitutional AIhttps://www.anthropic.com/news/claudes-constitutionFor more information, check out https://www.superprompt.fm There you can contact me and/or sign up for our newsletter.

Polyweb
E38: OpenAI Grant Winner: Building AI That Actually Understands What Matters to People

Polyweb

Play Episode Listen Later Apr 22, 2024 44:59


In this episode of Polyweb, host Sara Landi Tortoli interviews Oliver Klingefjord, the co-founder and technical lead at the Meaning Alignment Institute.The Institute won an OpenAI grant to explore "democratic fine-tuning" as an approach that differs from Constitutional AI. During the conversation, Oliver and Sara discuss the process of creating a "moral graph" based on people's discussions of values, the potential risks of personalized AI models, and the broader implications of their work in aligning technology with human meaning and purpose.00:00 Intro04:42 How the Meaning Alignment Institute won the OpenAI grant to explore democratic AI alignment08:13 Comparing democratic fine-tuning to RLHF and Constitutional AI12:33 Institute's process of surfacing underlying values through moral dilemmas16:00 Creating a "moral graph" to capture wisdom and agreements22:04 Risks of personalized AI models and the need for shared human values26:25 Using the moral graph to guide and evaluate AI model outputs32:12 Potential applications of the moral graph36:07 Challenges of scaling to superintelligent AI39:55 Broader implications and the need for people to articulate their values43:55 Where to find more information about the Meaning Alignment Institute

The Nonlinear Library
LW - Announcing Atlas Computing by miyazono

The Nonlinear Library

Play Episode Listen Later Apr 12, 2024 8:06


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Atlas Computing, published by miyazono on April 12, 2024 on LessWrong. Atlas Computing is a new nonprofit working to collaboratively advance AI capabilities that are asymmetrically risk-reducing. Our work consists of building scoped prototypes and creating an ecosystem around @davidad's Safeguarded AI programme at ARIA (formerly referred to as the Open Agency Architecture). We formed in Oct 2023, and raised nearly $1M, primarily from the Survival and Flourishing Fund and Protocol Labs. We have no physical office, and are currently only Evan Miyazono (CEO) and Daniel Windham (software lead), but over the coming months and years, we hope to create compelling evidence that: The Safeguarded AI research agenda includes both research and engineering projects where breakthroughs or tools can incrementally reduce AI risks. If Atlas Computing makes only partial progress toward building safeguarded AI, we'll likely have put tools into the world that are useful for accelerating human oversight and review of AI outputs, asymmetrically favoring risk reduction. When davidad's ARIA program concludes, the work of Atlas Computing will have parallelized solving some tech transfer challenges, magnifying the impact of any technologies he develops. Our overall strategy We think that, in addition to encoding human values into AI systems, a very complementary way to dramatically reduce AI risk is to create external safeguards that limit AI outputs. Users (individuals, groups, or institutions) should have tools to create specifications that list baseline safety requirements (if not full desiderata for AI system outputs) and also interrogate those specifications with non-learned tools. A separate system should then use the specification to generate candidate solutions along with evidence that the proposed solution satisfies the spec. This evidence can then be reviewed automatically for adherence to the specified safety properties. This is by comparison to current user interactions with today's generalist ML systems, where all candidate solutions are at best reviewed manually. We hope to facilitate a paradigm where the least safe user's interactions with AI looks like: Specification-based AI vs other AI risk mitigation strategies We consider near-term risk reductions that are possible with this architecture to be highly compatible with existing alignment techniques. In Constitutional AI, humans are legislators but laws are sufficiently nuanced and subjective that they require a language model to act as a scalable executive and judiciary. Using specifications to establish an objective preliminary safety baseline that is automatically validated by a non-learned system could be considered a variation or subset of Constitutional AI. Some work on evaluations focuses on finding metrics that demonstrate safety or alignment of outputs. Our architecture expresses goals in terms of states of a world-model that is used to understand the impact of policies proposed by the AI, and would be excited to see and supportive of evals researchers exploring work in this direction. This approach could also be considered a form of scalable oversight, where a baseline set of safe specifications are automatically enforced via validation and proof generation against a spec. How this differs from davidad's work at ARIA You may be aware that davidad is funding similar work as a Programme Director at ARIA (watch his 30 minute solicitation presentation here). It's worth clarifying that, while davidad and Evan worked closely at Protocol Labs, davidad is not an employee of Atlas Computing, and Atlas has received no funding from ARIA. That said, we're pursuing highly complementary paths in our hopes to reduce AI risk. His Safeguarded AI research agenda, described here, is focused on using cyberphysical systems, li...

The Nonlinear Library: LessWrong
LW - Announcing Atlas Computing by miyazono

The Nonlinear Library: LessWrong

Play Episode Listen Later Apr 12, 2024 8:06


Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Atlas Computing, published by miyazono on April 12, 2024 on LessWrong. Atlas Computing is a new nonprofit working to collaboratively advance AI capabilities that are asymmetrically risk-reducing. Our work consists of building scoped prototypes and creating an ecosystem around @davidad's Safeguarded AI programme at ARIA (formerly referred to as the Open Agency Architecture). We formed in Oct 2023, and raised nearly $1M, primarily from the Survival and Flourishing Fund and Protocol Labs. We have no physical office, and are currently only Evan Miyazono (CEO) and Daniel Windham (software lead), but over the coming months and years, we hope to create compelling evidence that: The Safeguarded AI research agenda includes both research and engineering projects where breakthroughs or tools can incrementally reduce AI risks. If Atlas Computing makes only partial progress toward building safeguarded AI, we'll likely have put tools into the world that are useful for accelerating human oversight and review of AI outputs, asymmetrically favoring risk reduction. When davidad's ARIA program concludes, the work of Atlas Computing will have parallelized solving some tech transfer challenges, magnifying the impact of any technologies he develops. Our overall strategy We think that, in addition to encoding human values into AI systems, a very complementary way to dramatically reduce AI risk is to create external safeguards that limit AI outputs. Users (individuals, groups, or institutions) should have tools to create specifications that list baseline safety requirements (if not full desiderata for AI system outputs) and also interrogate those specifications with non-learned tools. A separate system should then use the specification to generate candidate solutions along with evidence that the proposed solution satisfies the spec. This evidence can then be reviewed automatically for adherence to the specified safety properties. This is by comparison to current user interactions with today's generalist ML systems, where all candidate solutions are at best reviewed manually. We hope to facilitate a paradigm where the least safe user's interactions with AI looks like: Specification-based AI vs other AI risk mitigation strategies We consider near-term risk reductions that are possible with this architecture to be highly compatible with existing alignment techniques. In Constitutional AI, humans are legislators but laws are sufficiently nuanced and subjective that they require a language model to act as a scalable executive and judiciary. Using specifications to establish an objective preliminary safety baseline that is automatically validated by a non-learned system could be considered a variation or subset of Constitutional AI. Some work on evaluations focuses on finding metrics that demonstrate safety or alignment of outputs. Our architecture expresses goals in terms of states of a world-model that is used to understand the impact of policies proposed by the AI, and would be excited to see and supportive of evals researchers exploring work in this direction. This approach could also be considered a form of scalable oversight, where a baseline set of safe specifications are automatically enforced via validation and proof generation against a spec. How this differs from davidad's work at ARIA You may be aware that davidad is funding similar work as a Programme Director at ARIA (watch his 30 minute solicitation presentation here). It's worth clarifying that, while davidad and Evan worked closely at Protocol Labs, davidad is not an employee of Atlas Computing, and Atlas has received no funding from ARIA. That said, we're pursuing highly complementary paths in our hopes to reduce AI risk. His Safeguarded AI research agenda, described here, is focused on using cyberphysical systems, li...

The top AI news from the past week, every ThursdAI

Hello hello everyone, happy spring! Can you believe it? It's already spring! We have tons of AI news for you to cover, starting with the most impactful one, did you already use Claude 3? Anthropic decided to celebrate Claude 1's birthday early (which btw is also ThursdAI's birthday and GPT4 release date, March 14th, 2023) and gave us 3 new Clauds! Opus, Sonnet and Haiku. TL;DR of all topics covered: * Big CO LLMs + APIs*

AWS - Conversations with Leaders
Prioritizing Trust and Safety in Tech: Anthropic's Commitment to Responsible AI

AWS - Conversations with Leaders

Play Episode Listen Later Feb 20, 2024 27:53


Join host Phil Le-Brun, AWS Director of Enterprise Strategy, and Neerav Kingsland, Head of Global Accounts for Anthropic, in a conversation about Anthropic's journey to create the safest and most capable AI model in the world. Learn about the company's relentless focus on safety, their groundbreaking large language model Claude, and how they ensure ethical outputs through Constitutional AI, which draws from varied sources like the UN Declaration of Human Rights and technology trust and safety documents. Gain insights into the partnership between AWS and Anthropic, and learn exactly what differentiates Claude from other generative AI models.Learn more about Amazon Bedrock, the easiest way to build and scale gen AI applications, and Amazon Q, a gen AI-powered assistant that can be tailored to your business.

The Sam Altman Podcast
Constitutional AI: Anthropic's Approach to Training ChatGPT Rivals

The Sam Altman Podcast

Play Episode Listen Later Feb 8, 2024 16:09


Explore Anthropic's groundbreaking take on Constitutional AI as the optimal training method for ChatGPT competitors, dissecting its advantages and what it means for the future of AI development. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: https://AIBox.ai/ AI Facebook Community

ChatGPT: OpenAI, Sam Altman, AI, Joe Rogan, Artificial Intelligence, Practical AI
ChatGPT Evolution: Anthropic's Take on Constitutional AI Training

ChatGPT: OpenAI, Sam Altman, AI, Joe Rogan, Artificial Intelligence, Practical AI

Play Episode Listen Later Feb 6, 2024 15:32


Join Anthropic's discourse on the evolution of ChatGPT through Constitutional AI training methods, exploring the potential for enhanced AI competitiveness. Get on the AI Box Waitlist: AIBox.ai Join our ChatGPT Community: Facebook Group Follow me on Twitter: Twitter

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

In 2023 we did a few Fundamentals episodes covering Benchmarks 101, Datasets 101, FlashAttention, and Transformers Math, and it turns out those were some of your evergreen favorites! So we are experimenting with more educational/survey content in the mix alongside our regular founder and event coverage. Pls request more!We have a new calendar for events; join to be notified of upcoming things in 2024!Today we visit the shoggoth mask factory: how do transformer models go from trawling a deeply learned latent space for next-token prediction to a helpful, honest, harmless chat assistant? Our guest “lecturer” today is ; you might know him from his prolific online writing on and Twitter, or from his previous work leading RLHF at HuggingFace and now at the Allen Institute for AI (AI2) which recently released the open source GPT3.5-class Tulu 2 model which was trained with DPO. He's widely considered one of the most knowledgeable people on RLHF and RLAIF. He recently gave an “RLHF 201” lecture at Stanford, so we invited him on the show to re-record it for everyone to enjoy! You can find the full slides here, which you can use as reference through this episode. Full video with synced slidesFor audio-only listeners, this episode comes with slide presentation along our discussion. You can find it on our YouTube (like, subscribe, tell a friend, et al).Theoretical foundations of RLHFThe foundation and assumptions that go into RLHF go back all the way to Aristotle (and you can find guidance for further research in the slide below) but there are two key concepts that will be helpful in thinking through this topic and LLMs in general:* Von Neumann–Morgenstern utility theorem: you can dive into the math here, but the TLDR is that when humans make decision there's usually a “maximum utility” function that measures what the best decision would be; the fact that this function exists, makes it possible for RLHF to model human preferences and decision making.* Bradley-Terry model: given two items A and B from a population, you can model the probability that A will be preferred to B (or vice-versa). In our world, A and B are usually two outputs from an LLM (or at the lowest level, the next token). It turns out that from this minimal set of assumptions, you can build up the mathematical foundations supporting the modern RLHF paradigm!The RLHF loopOne important point Nathan makes is that "for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior". For example, it might be difficult for you to write a poem, but it's really easy to say if you like or dislike a poem someone else wrote. Going back to the Bradley-Terry Model we mentioned, the core idea behind RLHF is that when given two outputs from a model, you will be able to say which of the two you prefer, and we'll then re-encode that preference into the model.An important point that Nathan mentions is that when you use these preferences to change model behavior "it doesn't mean that the model believes these things. It's just trained to prioritize these things". When you have preference for a model to not return instructions on how to write a computer virus for example, you're not erasing the weights that have that knowledge, but you're simply making it hard for that information to surface by prioritizing answers that don't return it. We'll talk more about this in our future Fine Tuning 101 episode as we break down how information is stored in models and how fine-tuning affects it.At a high level, the loop looks something like this:For many RLHF use cases today, we can assume the model we're training is already instruction-tuned for chat or whatever behavior the model is looking to achieve. In the "Reward Model & Other Infrastructure" we have multiple pieces:Reward + Preference ModelThe reward model is trying to signal to the model how much it should change its behavior based on the human preference, subject to a KL constraint. The preference model itself scores the pairwise preferences from the same prompt (worked better than scalar rewards).One way to think about it is that the reward model tells the model how big of a change this new preference should make in the behavior in absolute terms, while the preference model calculates how big of a difference there is between the two outputs in relative terms. A lot of this derives from John Schulman's work on PPO:We recommend watching him talk about it in the video above, and also Nathan's pseudocode distillation of the process:Feedback InterfacesUnlike the "thumbs up/down" buttons in ChatGPT, data annotation from labelers is much more thorough and has many axis of judgement. At a simple level, the LLM generates two outputs, A and B, for a given human conversation. It then asks the labeler to use a Likert scale to score which one it preferred, and by how much:Through the labeling process, there are many other ways to judge a generation:We then use all of this data to train a model from the preference pairs we have. We start from the base instruction-tuned model, and then run training in which the loss of our gradient descent is the difference between the good and the bad prompt.Constitutional AI (RLAIF, model-as-judge)As these models have gotten more sophisticated, people started asking the question of whether or not humans are actually a better judge of harmfulness, bias, etc, especially at the current price of data labeling. Anthropic's work on the "Constitutional AI" paper is using models to judge models. This is part of a broader "RLAIF" space: Reinforcement Learning from AI Feedback.By using a "constitution" that the model has to follow, you are able to generate fine-tuning data for a new model that will be RLHF'd on this constitution principles. The RLHF model will then be able to judge outputs of models to make sure that they follow its principles:Emerging ResearchRLHF is still a nascent field, and there are a lot of different research directions teams are taking; some of the newest and most promising / hyped ones:* Rejection sampling / Best of N Sampling: the core idea here is that rather than just scoring pairwise generations, you are generating a lot more outputs (= more inference cost), score them all with your reward model and then pick the top N results. LLaMA2 used this approach, amongst many others.* Process reward models: in Chain of Thought generation, scoring each step in the chain and treating it like its own state rather than just scoring the full output. This is most effective in fields like math that inherently require step-by-step reasoning.* Direct Preference Optimization (DPO): We covered DPO in our NeurIPS Best Papers recap, and Nathan has a whole blog post on this; DPO isn't technically RLHF as it doesn't have the RL part, but it's the “GPU Poor” version of it. Mistral-Instruct was a DPO model, as do Intel's Neural Chat and StableLM Zephyr. Expect to see a lot more variants in 2024 given how “easy” this was.* Superalignment: OpenAI launched research on weak-to-strong generalization which we briefly discuss at the 1hr mark.Note: Nathan also followed up this post with RLHF resources from his and peers' work:Show Notes* Full RLHF Slides* Interconnects* Retort (podcast)* von Neumann-Morgenstern utility theorem* Bradley-Terry model (pairwise preferences model)* Constitutional AI* Tamer (2008 paper by Bradley Knox and Peter Stone)* Paul Christiano et al. RLHF paper* InstructGPT* Eureka by Jim Fan* ByteDance / OpenAI lawsuit* AlpacaEval* MTBench* TruthfulQA (evaluation tool)* Self-Instruct Paper* Open Assistant* Louis Castricato* Nazneen Rajani* Tulu (DPO model from the Allen Institute)Timestamps* [00:00:00] Introductions and background on the lecture origins* [00:05:17] History of RL and its applications* [00:10:09] Intellectual history of RLHF* [00:13:47] RLHF for decision-making and pre-deep RL vs deep RL* [00:20:19] Initial papers and intuitions around RLHF* [00:27:57] The three phases of RLHF* [00:31:09] Overfitting issues* [00:34:47] How preferences get defined* [00:40:35] Ballpark on LLaMA2 costs* [00:42:50] Synthetic data for training* [00:47:25] Technical deep dive in the RLHF process* [00:54:34] Projection / best event sampling* [00:57:49] Constitutional AI* [01:04:13] DPO* [01:08:54] What's the Allen Institute for AI?* [01:13:43] Benchmarks and models comparisonsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol AI.Swyx [00:00:15]: Hey, and today we have Dr. Nathan Lambert in the house. Welcome.Nathan [00:00:18]: Thanks guys.Swyx [00:00:19]: You didn't have to come too far. You got your PhD in Berkeley, and it seems like you've lived there most of the time in recent years. You worked on robotics and model-based reinforcement learning on your PhD, and you also interned at FAIR and DeepMind. You bootstrapped the RLHF team at Hugging Face, and you recently joined the Allen Institute as a research scientist. So that's your quick bio. What should people know about you that maybe is not super obvious about you on New LinkedIn?Nathan [00:00:43]: I stay sane in various insane sport and ultra-endurance sport activities that I do.Swyx [00:00:50]: What's an ultra-endurance sport activity?Nathan [00:00:52]: Long-distance trail running or gravel biking. Try to unplug sometimes, although it's harder these days. Yeah.Swyx [00:00:59]: Well, you know, just the Bay Area is just really good for that stuff, right?Nathan [00:01:02]: Oh, yeah. You can't beat it. I have a trailhead like 1.2 miles from my house, which is pretty unmatchable in any other urban area.Swyx [00:01:11]: Pretty excellent. You also have an incredible blog, Interconnects, which I'm a fan of. And I also just recently discovered that you have a new podcast, Retort.Nathan [00:01:20]: Yeah, we do. I've been writing for a while, and I feel like I've finally started to write things that are understandable and fun. After a few years lost in the wilderness, if you ask some of my friends that I made read the earlier blogs, they're like, oh, this is yikes, but it's coming along. And the podcast is with my friend Tom, and we just kind of like riff on what's actually happening on AI and not really do news recaps, but just what it all means and have a more critical perspective on the things that really are kind of funny, but still very serious happening in the world of machine learning.Swyx [00:01:52]: Yeah. Awesome. So let's talk about your work. What would you highlight as your greatest hits so far on Interconnects, at least?Nathan [00:01:59]: So the ones that are most popular are timely and or opinion pieces. So the first real breakout piece was when April and I also just wrote down the thing that everyone in AI was feeling, which is we're all feeling stressed, that we're going to get scooped, and that we're overworked, which is behind the curtain, what it feels to work in AI. And then a similar one, which we might touch on later in this, was about my recent job search, which wasn't the first time I wrote a job search post. People always love that stuff. It's so open. I mean, it's easy for me to do in a way that it's very on-brand, and it's very helpful. I understand that until you've done it, it's hard to share this information. And then the other popular ones are various model training techniques or fine tuning. There's an early one on RLHF, which is, this stuff is all just like when I figure it out in my brain. So I wrote an article that's like how RLHF actually works, which is just the intuitions that I had put together in the summer about RLHF, and that was pretty well. And then I opportunistically wrote about QSTAR, which I hate that you have to do it, but it is pretty funny. From a literature perspective, I'm like, open AI publishes on work that is very related to mathematical reasoning. So it's like, oh, you just poke a little around what they've already published, and it seems pretty reasonable. But we don't know. They probably just got like a moderate bump on one of their benchmarks, and then everyone lost their minds. It doesn't really matter.Swyx [00:03:15]: You're like, this is why Sam Altman was fired. I don't know. Anyway, we're here to talk about RLHF 101. You did a presentation, and I think you expressed some desire to rerecord it. And that's why I reached out on Twitter saying, like, why not rerecord it with us, and then we can ask questions and talk about it. Yeah, sounds good.Nathan [00:03:30]: I try to do it every six or 12 months is my estimated cadence, just to refine the ways that I say things. And people will see that we don't know that much more, but we have a bit of better way of saying what we don't know.Swyx [00:03:43]: Awesome. We can dive right in. I don't know if there's any other topics that we want to lay out as groundwork.Alessio [00:03:48]: No, you have some awesome slides. So for people listening on podcast only, we're going to have the slides on our show notes, and then we're going to have a YouTube version where we run through everything together.Nathan [00:03:59]: Sounds good. Yeah. I think to start skipping a lot of the, like, what is a language model stuff, everyone knows that at this point. I think the quote from the Llama 2 paper is a great kind of tidbit on RLHF becoming like a real deal. There was some uncertainty earlier in the year about whether or not RLHF was really going to be important. I think it was not that surprising that it is. I mean, with recent models still using it, the signs were there, but the Llama 2 paper essentially reads like a bunch of NLP researchers that were skeptical and surprised. So the quote from the paper was, meanwhile, reinforcement learning known for its instability seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement learning proved highly effective, particularly given its cost and time effectiveness. So you don't really know exactly what the costs and time that Meta is looking at, because they have a huge team and a pretty good amount of money here to release these Llama models. This is just the kind of thing that we're seeing now. I think any major company that wasn't doing RLHF is now realizing they have to have a team around this. At the same time, we don't have a lot of that in the open and research communities at the same scale. I think seeing that converge would be great, but it's still very early days. And the other thing on the slide is some of Anthropic's work, but everyone knows Anthropic is kind of the masters of this, and they have some of their own techniques that we're going to talk about later on, but that's kind of where we start.Alessio [00:05:17]: Can we do just a one-second RL version? So you come from a robotics background, which RL used to be, or maybe still is, state-of-the-art. And then now you're seeing a lot of LLM plus RL, so you have the gym fans, Eureka, you have MPU, which we had on the podcast when they started with RL. Now they're doing RL plus LLMs. Yeah. Any thoughts there on how we got here? Maybe how the pendulum will keep swinging?Nathan [00:05:46]: I really think RL is about a framing of viewing the world through trial and error learning and feedback, and really just one that's focused on thinking about decision-making and inputs in the world and how inputs have reactions. And in that, a lot of people come from a lot of different backgrounds, whether it's physics, electrical engineering, mechanical engineering. There are obviously computer scientists, but compared to other fields of CS, I do think it's a much more diverse background of people. My background was in electrical engineering and doing robotics and things like that. It really just changes the worldview. I think that reinforcement learning as it was back then, so to say, is really different. You're looking at these toy problems and the numbers are totally different, and everyone went kind of zero to one at scaling these things up, but people like Jim Phan and other people that were... You saw this transition in the decision transformer and papers and when people are trying to use transformers to do decision-making for things like offline RL, and I think that was kind of like the early days. But then once language models were so proven, it's like everyone is using this tool for their research. I think in the long run, it will still settle out, or RL will still be a field that people work on just because of these kind of fundamental things that I talked about. It's just viewing the whole problem formulation different than predicting text, and so there needs to be that separation. And the view of RL in language models is pretty contrived already, so it's not like we're doing real RL. I think the last slide that I have here is a way to make RLHF more like what people would think of with RL, so actually running things over time, but a weird lineage of tools that happen to get us to where we are, so that's why the name takes up so much space, but it could have gone a lot of different ways. Cool.Alessio [00:07:29]: We made it one slide before going on a tangent.Nathan [00:07:31]: Yeah, I mean, it's kind of related. This is a...Swyx [00:07:35]: Yeah, so we have a history of RL.Nathan [00:07:37]: Yeah, so to give the context, this paper really started because I have this more diverse background than some computer scientists, such as trying to understand what the difference of a cost function or a reward function and a preference function would be without going into all of the details. Costs are normally things that control theorists would work with in these kind of closed domains, and then reinforcement learning has always worked with rewards that's central to the formulation that we'll see, and then the idea was like, okay, we now are at preferences, and each step along the way there's kind of different assumptions that you're making. We'll get into these, and those assumptions are built on other fields of work. So that's what this slide is going to say, it's like RLHF, while directly building on tools from RL and language models, is really implicitly impacted and built on theories and philosophies spanning tons of human history. I think we cite Aristotle in this paper, which is fun. It's like going pre-BC, it's like 2,300 years old or something like that. So that's the reason to do this, I think. We kind of list some things in the paper about summarizing what different presumptions of RLHF could be. I think going through these is actually kind of funny. It's fun to talk about these, because they're kind of grab bags of things that you'll see return throughout this podcast that we're talking about it. The core thing of RLHF that, in order to be a believer in this, is that RL actually works. It's like, if you have a reward function, you can optimize it in some way and get a different performance out of it, and you could do this at scale, and you could do this in really complex environments, which is, I don't know how to do that in all the domains. I don't know how to exactly make chat GPT. So it's kind of, we'll overshadow everything. And then there's, go from something kind of obvious like that, and then you read the von Neumann-Morgenstern utility theorem, which is essentially an economic theory that says you can weight different probabilities of different people, which is a theoretical piece of work that is the foundation of utilitarianism, and trying to quantify preferences is crucial to doing any sort of RLHF. And if you look into this, all of these things, there's way more you could go into if you're interested in any of these. So this is kind of like grabbing a few random things, and then kind of similar to that is the Bradley-Terry model, which is the fancy name for the pairwise preferences that everyone is doing. And then all the things that are like, that Anthropic and OpenAI figured out that you can do, which is that you can aggregate preferences from a bunch of different people and different sources. And then when you actually do RLHF, you extract things from that data, and then you train a model that works somehow. And we don't know, there's a lot of complex links there, but if you want to be a believer in doing this at scale, these are the sorts of things that you have to accept as preconditions for doing RLHF. Yeah.Swyx [00:10:09]: You have a nice chart of like the sort of intellectual history of RLHF that we'll send people to refer to either in your paper or in the YouTube video for this podcast. But I like the other slide that you have on like the presumptions that you need to have for RLHF to work. You already mentioned some of those. Which one's underappreciated? Like, this is the first time I've come across the VNM Utility Theorem.Nathan [00:10:29]: Yeah, I know. This is what you get from working with people like to my co-host on the podcast, the rhetoric is that sociologist by training. So he knows all these things and like who the philosophers are that found these different things like utilitarianism. But there's a lot that goes into this. Like essentially there's even economic theories that like there's debate whether or not preferences exist at all. And there's like different types of math you can use with whether or not you actually can model preferences at all. So it's pretty obvious that RLHF is built on the math that thinks that you can actually model any human preference. But this is the sort of thing that's been debated for a long time. So all the work that's here is like, and people hear about in their AI classes. So like Jeremy Bentham, like hedonic calculus and all these things like these are the side of work where people assume that preferences can be measured. And this is like, I don't really know, like, this is what I kind of go on a rant and I say that in RLHF calling things a preference model is a little annoying because there's no inductive bias of what a preference is. It's like if you were to learn a robotic system and you learned a dynamics model, like hopefully that actually mirrors the world in some way of the dynamics. But with a preference model, it's like, Oh my God, I don't know what this model, like I don't know what chat GPT encodes as any sort of preference or what I would want it to be in a fair way. Anthropic has done more work on trying to write these things down. But even like if you look at Claude's constitution, like that doesn't mean the model believes these things. It's just trained to prioritize these things. And that's kind of what the later points I'm looking at, like what RLHF is doing and if it's actually like a repeatable process in the data and in the training, that's just unknown. And we have a long way to go before we understand what this is and the link between preference data and any notion of like writing down a specific value.Alessio [00:12:05]: The disconnect between more sociology work versus computer work already exists, or is it like a recent cross contamination? Because when we had Tri Dao on the podcast, he said FlashAttention came to be because at Hazy they have so much overlap between systems engineer and like deep learning engineers. Is it the same in this field?Nathan [00:12:26]: So I've gone to a couple of workshops for the populations of people who you'd want to include this like R. I think the reason why it's not really talked about is just because the RLHF techniques that people use were built in labs like OpenAI and DeepMind where there are some of these people. These places do a pretty good job of trying to get these people in the door when you compare them to like normal startups. But like they're not bringing in academics from economics, like social choice theory. There's just too much. Like the criticism of this paper that this is based on is like, oh, you're missing these things in RL or at least this decade of RL and it's like it would be literally be bigger than the Sutton and Barto book if you were to include everyone. So it's really hard to include everyone in a principled manner when you're designing this. It's just a good way to understand and improve the communication of what RLHF is and like what is a good reward model for society. It really probably comes down to what an individual wants and it'll probably motivate models to move more in that direction and just be a little bit better about the communication, which is a recurring theme and kind of my work is like I just get frustrated when people say things that don't really make sense, especially when it's going to manipulate individual's values or manipulate the general view of AI or anything like this. So that's kind of why RLHF is so interesting. It's very vague in what it's actually doing while the problem specification is very general.Swyx [00:13:42]: Shall we go to the, I guess, the diagram here on the reinforcement learning basics? Yeah.Nathan [00:13:47]: So reinforcement learning, I kind of mentioned this, it's a trial and error type of system. The diagram and the slides is really this classic thing where you have an agent interacting with an environment. So it's kind of this agent has some input to the environment, which is called the action. The environment returns a state and a reward and that repeats over time and the agent learns based on these states and these rewards that it's seeing and it should learn a policy that makes the rewards go up. That seems pretty simple than if you try to mentally map what this looks like in language, which is that like the language models don't make this easy. I think with the language model, it's very hard to define what an environment is. So if the language model is the policy and it's generating, it's like the environment should be a human, but setting up the infrastructure to take tens of thousands of prompts and generate them and then show them to a human and collect the human responses and then shove that into your training architecture is very far away from working. So we don't really have an environment. We just have a reward model that returns a reward and the state doesn't really exist when you look at it like an RL problem. What happens is the state is a prompt and then you do a completion and then you throw it away and you grab a new prompt. We're really in as an RL researcher, you would think of this as being like you take a state, you get some completion from it and then you look at what that is and you keep kind of iterating on it and all of that isn't here, which is why you'll hear RLHF referred to as bandits problem, which is kind of like you choose one action and then you watch the dynamics play out. There's many more debates that you can have in this. If you get the right RL people in the room, then kind of like this is an RL even when you zoom into what RLHF is doing.Alessio [00:15:22]: Does this change as you think about a chain of thought reasoning and things like that? Like does the state become part of the chain that you're going through?Nathan [00:15:29]: There's work that I've mentioned on one slide called process reward models that essentially rewards each step in the chain of thought reasoning. It doesn't really give the part of interaction, but it does make it a little bit more fine grained where you can think about like calling it at least you have many states from your initial state. That formulation I don't think people have fully settled on. I think there's a bunch of great work out there, like even OpenAI is releasing a lot of this and let's verify step by step is there pretty great paper on the matter. I think in the next year that'll probably get made more concrete by the community on like if you can easily draw out like if chain of thought reasoning is more like RL, we can talk about that more later. That's a kind of a more advanced topic than we probably should spend all the time on.Swyx [00:16:13]: RLHF for decision making. You have a slide here that compares pre-deep RL versus deep RL.Nathan [00:16:19]: This is getting into the history of things, which is showing that the work that people are using now really came from well outside of NLP and it came before deep learning was big. Next up from this paper, Tamer, which is from 2008. Some names that are still really relevant in kind of human centric RL, Bradley Knox and Peter Stone. If you have an agent take an action, you would just have a human give a score from zero to one as a reward rather than having a reward function. And then with that classifier, you can do something with a policy that learns to take actions to maximize that reward. It's a pretty simple setup. It works in simple domains. And then the reason why this is interesting is you compare it to the paper that everyone knows, which is this Paul Christiano et al. Deep Reinforced Learning from Human Preferences paper, which is where they showed that learning from human preferences, you can solve like the basic RL tasks at the time. So various control problems and simulation and this kind of like human preferences approach had higher rewards in some environments than if you just threw RL at the environment that returned a reward. So the preferences thing was you took two trajectories. So in this case, it was like complete trajectories of the agent and the human was labeling which one is better. You can see how this kind of comes to be like the pairwise preferences that are used today that we'll talk about. And there's also a really kind of interesting nugget that is the trajectory that the humans were labeling over has a lot more information than the RL algorithm would see if you just had one state, which is kind of why people think that it's why the performance in this paper was so strong. But I still think that it's surprising that there isn't more RL work of this style happening now. This paper is in 2017. So it's like six years later and I haven't seen things that are exactly similar, but it's a great paper to understand where stuff that's happening now kind of came from.Swyx [00:17:58]: Just on the Christiano paper, you mentioned the performance being strong. I don't remember what results should I have in mind when I think about that paper?Nathan [00:18:04]: It's mostly like if you think about an RL learning curve, which is like on the X axis, you have environment interactions on the Y axis, you have performance. You can think about different like ablation studies of between algorithms. So I think they use like A2C, which I don't even remember what that stands for as their baseline. But if you do the human preference version on a bunch of environments, like the human preference labels, the agent was able to learn faster than if it just learned from the signal from the environment, which means like it's happening because the reward model has more information than the agent would. But like the fact that it can do better, I was like, that's pretty surprising to me because RL algorithms are pretty sensitive. So I was like, okay.Swyx [00:18:41]: It's just one thing I do want to establish as a baseline for our listeners. We are updating all the weights. In some sense, the next token prediction task of training a language model is a form of reinforcement learning. Except that it's not from human feedback. It's just self-supervised learning from a general corpus. There's one distinction which I love, which is that you can actually give negative feedback. Whereas in a general sort of pre-training situation, you cannot. And maybe like the order of magnitude of feedback, like the Likert scale that you're going to talk about, that actually just gives more signal than a typical training process would do in a language model setting. Yeah.Nathan [00:19:15]: I don't think I'm the right person to comment exactly, but like you can make analogies that reinforcement learning is self-supervised learning as well. Like there are a lot of things that will point to that. I don't know whether or not it's a richer signal. I think that could be seen in the results. It's a good thing for people to look into more. As reinforcement learning is so much less compute, like it is a richer signal in terms of its impact. Because if they could do what RLHF is doing at pre-training, they would, but they don't know how to have that effect in like a stable manner. Otherwise everyone would do it.Swyx [00:19:45]: On a practical basis, as someone fine-tuning models, I have often wished for negative fine-tuning, which pretty much doesn't exist in OpenAI land. And it's not the default setup in open-source land.Nathan [00:19:57]: How does this work in like diffusion models and stuff? Because you can give negative prompts to something to like stable diffusion or whatever. It's for guidance.Swyx [00:20:04]: That's for clip guidance.Nathan [00:20:05]: Is that just from like how they prompt it then? I'm just wondering if we could do something similar. It's another tangent.Swyx [00:20:10]: I do want to sort of spell that out for people in case they haven't made the connection between RLHF and the rest of the training process. They might have some familiarity with it.Nathan [00:20:19]: Yeah. The upcoming slides can really dig into this, which is like this in 2018 paper, there was a position paper from a bunch of the same authors from the Christiano paper and from the OpenAI work that everyone knows, which is like, they write a position paper on what a preference reward model could do to solve alignment for agents. That's kind of based on two assumptions. The first assumption is that we can learn user intentions to a sufficiently high accuracy. That doesn't last with me because I don't know what that means. But the second one is pretty telling in the context of RLHF, which is for many tasks we want to solve, evaluation of outcomes is easier than producing the correct behavior. And this is the whole thing. It's like we can compare two poems that the model generates and it can be viewed as liking a positive example, or it could be viewed as really disliking a negative example. And that's what I think a lot of people are doing in like the harm space is like a harmful response to a language model, whether or not you agree with the company's definition of harms is that it's a really bad negative example and they downweight them by preferring something more benign in the RLHF process, among other ways of dealing with safety. So that's a good way of saying it's like this is core, this kind of like comparison and positive or negative example is core to all of the RLHF work that has continued.Swyx [00:21:29]: People often say, I don't know what I want, but I'll know when I see it. This is that expressed in reinforcement learning tools.Nathan [00:21:35]: Yeah, it is. Yeah, it is. That's what everyone's doing in the preference modeling stage that we'll get to. Yeah. Yeah. And you can see there are more papers. This is really just to have all the links for people that go deeper. There's a Ziegler et al. paper in 2019, which shows that you can do this RLHF process on language models. This familiar diagram starts to emerge in 2019, and it's just to show that this goes really far back. I think we can kind of breeze through some of these. And then 2020 is the first open AI experiment that I think caught people's eyes, which is this learning to summarize experiment. It has this three-step process that we'll go to into more when I kind of go into the main concepts. But this is like the first time you see this diagram that they reuse with InstructGPT, they reuse with ChatGPT. And the types of examples that they would have, I don't think I need to read these exactly, but one that I have read a whole bunch of times is like, they took these prompts from Reddit that was like, explain like I'm five or get career advice, and people really pour their heart and soul into these. So these are like multi-paragraph pieces of writing. And then they essentially do comparisons between a vanilla language model, like I think it was either GPT-2 or GPT-3, I don't always get the exact years.Swyx [00:22:42]: 3 was early 2020. So that's about right.Nathan [00:22:45]: Yeah. So this is probably done with GPT-2. It doesn't really matter. But the language model does normal things when you do few shot, which is like it repeats itself. It doesn't have nice text. And what they did is that this was the first time where the language model would generate like pretty nice text from an output. It was restricted to the summarization domain. But I think that I guess this is where I wish I was paying attention more because I would see the paper, but I didn't know to read the language model outputs and kind of understand this qualitative sense of the models very well then. Because you look at the plots in the papers, these Learning to Summarize and Destruct GPT have incredibly pretty plots, just like nicely separated lines with error bars and they're like superfine tuning works, the RL step works. But if you were early to see like how different the language that was written by these models was, I think you could have been early to like things like ChatGPT and knowing RLHF would matter. And now I think the good people know to chat with language models, but not even everyone does this. Like people are still looking at numbers. And I think OpenAI probably figured it out when they were doing this, how important that could be. And then they had years to kind of chisel away at that and that's why they're doing so well now. Yeah.Swyx [00:23:56]: I mean, arguably, you know, it's well known that ChatGPT was kind of an accident that they didn't think it would be that big of a deal. Yeah.Nathan [00:24:02]: So maybe they didn't. Maybe they didn't, but they were getting the proxy that they needed.Swyx [00:24:06]: I've heard off the record from other labs that it was in the air. If OpenAI didn't do it, someone else would have done it. So you've mentioned a couple of other papers that are very seminal to this period. And I love how you say way back when in referring to 2019.Nathan [00:24:19]: It feels like it in my life.Swyx [00:24:21]: So how much should people understand the relationship between RLHF, instruction tuning, PPO, KL divergence, anything like that? Like how would you construct the level of knowledge that people should dive into? What should people know at the high level? And then if people want to dive in deeper, where do they go? Is instruct tuning important here or is that part of the overall process towards modern RLHF?Nathan [00:24:44]: I think for most people, instruction tuning is probably still more important in their day to day life. I think instruction tuning works very well. You can write samples by hand that make sense. You can get the model to learn from them. You could do this with very low compute. It's easy to do almost in like no code solutions at this point. And the loss function is really straightforward. And then if you're interested in RLHF, you can kind of learn from it from a different perspective, which is like how the instruction tuning distribution makes it easier for your RLHF model to learn. There's a lot of details depending on your preference data, if it's close to your instruction model or not, if that matters. But that's really at the RLHF stage. So I think it's nice to segment and just kind of understand what your level of investment and goals are. I think instruction tuning still can do most of what you want to do. And it's like, if you want to think about RLHF, at least before DPO really had taken off at all, it would be like, do you want to have a team of at least like five people if you're really thinking about doing RLHF? I think DPO makes it a little bit easier, but that's still really limited to kind of one data set that everyone's using at this point. Like everyone's using this ultra feedback data set and it boosts AlpacaVal, MTBench, TruthfulQA and like the qualitative model a bit. We don't really know why. It's like, it might just be a data set combined with the method, but you've got to be ready for a bumpy ride if you're wanting to try to do RLHF. I don't really recommend most startups to do it unless it's like going to provide them a clear competitive advantage in their kind of niche, because you're not going to make your model chat GPT like better than OpenAI or anything like that. You've got to accept that there's some exploration there and you might get a vein of benefit in your specific domain, but I'm still like, oh, be careful going into the RLHF can of worms. You probably don't need to.Swyx [00:26:27]: Okay. So there's a bit of a time skip in what you mentioned. DPO is like a couple months old, so we'll leave that towards the end. I think the main result that I think most people talk about at this stage, we're talking about September 2020 and then going into, I guess maybe last year was Vicuña as one of the more interesting applications of instruction tuning that pushed LLAMA1 from, let's say a GPT 3-ish model to a GPT 3.5 model in pure open source with not a lot of resources. I think, I mean, they said something like, you know, they use like under $100 to makeNathan [00:26:58]: this. Yeah. Like instruction tuning can really go a long way. I think the claims of chat GPT level are long overblown in most of the things in open source. I think it's not to say, like Vicuña was a huge step and it's just kind of showing that instruction tuning with the right data will completely change what it feels like to talk with your model. Yeah.Swyx [00:27:19]: From text completion to actually chatting back and forth. Yeah. Yeah.Nathan [00:27:23]: Instruction tuning can be multi-turn. Just having a little bit of data that's like a couple of turns can go a really long way. That was like the story of the whole first part of the year is like people would be surprised by how far you can take instruction tuning on a small model. I think the things that people see now is like the small models don't really handle nuance as well and they could be more repetitive even if they have really good instruction tuning. But if you take that kind of 7 to 70 billion parameter jump, like the instruction tuning at the bigger model is like robustness, little things make more sense. So that's still just with instruction tuning and scale more than anything else.Swyx [00:27:56]: Excellent. Shall we go to technical overview?Nathan [00:27:58]: Yeah. This is kind of where we go through my own version of this like three phase process. You can talk about instruction tuning, which we've talked about a lot. It's funny because all these things, instruction tuning has the fewest slides, even though it's the most practical thing for most people. We could save the debate for like if the big labs still do instruction tuning for later, but that's a coming wave for people. And then like preference data and training and then kind of like what does reinforce learning optimization actually mean? We talk about these sequentially because you really have to be able to do each of them to be able to do the next one. You need to be able to have a model that's chatty or helpful instruction following. Every company has their own word that they like to assign to what instructions mean. And then once you have that, you can collect preference data and do some sort of optimization.Swyx [00:28:39]: When you say word, you mean like angle bracket inst or do you mean something else?Nathan [00:28:42]: Oh, I don't even know what inst means, but just saying like they use their adjective that they like. I think Entropic also like steerable is another one.Swyx [00:28:51]: Just the way they describe it. Yeah.Nathan [00:28:53]: So like instruction tuning, we've covered most of this is really about like you should try to adapt your models to specific needs. It makes models that were only okay, extremely comprehensible. A lot of the times it's where you start to get things like chat templates. So if you want to do system prompts, if you want to ask your model, like act like a pirate, that's one of the ones I always do, which is always funny, but like whatever you like act like a chef, like anything, this is where those types of things that people really know in language models start to get applied. So it's good as a kind of starting point because this chat template is used in our early childhood and all of these things down the line, but it was a basic pointer. It's like, once you see this with instruction tuning, you really know it, which is like you take things like stack overflow where you have a question and an answer. You format that data really nicely. There's much more tricky things that people do, but I still think the vast majority of it is question answer. Please explain this topic to me, generate this thing for me. That hasn't changed that much this year. I think people have just gotten better at scaling up the data that they need. Yeah, this is where this talk will kind of take a whole left turn into more technical detail land. I put a slide with the RLHF objective, which I think is good for people to know. I've started going back to this more, just kind of understand what is trying to happen here and what type of math people could do. I think because of this algorithm, we've mentioned this, it's in the air, direct preference optimization, but everything kind of comes from an equation of trying to learn a policy that maximizes the reward. The reward is some learned metric. A lot can be said about what the reward should be subject to some constraint. The most popular constraint is the KL distraint, which is just a distributional distance. Essentially in language models, that means if you have a completion from your instruction or RLHF model, you can compare that completion to a base model. And looking at the log probs from the model, which are essentially how likely each token is, you can see a rough calculation of the distance between these two models, just as a scalar number. I think what that actually looks like in code, you can look at it. It'd be like a sum of log probs that you get right from the model. It'll look much more simpler than it sounds, but it is just to make the optimization kind of stay on tracks.Make sure it doesn't overfit to the RLHF data. Because we have so little data in RLHF, overfitting is really something that could happen. I think it'll fit to specific features that labelers like to see, that the model likes to generate, punctuation, weird tokens like calculator tokens. It could overfit to anything if it's in the data a lot and it happens to be in a specific format. And the KL constraint prevents that. There's not that much documented work on that, but there's a lot of people that know if you take that away, it just doesn't work at all. I think it's something that people don't focus on too much. But the objective, as I said, it's just kind of, you optimize the reward. The reward is where the human part of this comes in. We'll talk about that next. And then subject to a constraint, don't change the model too much. The real questions are, how do you implement the reward? And then how do you make the reward go up in a meaningful way? So like a preference model, the task is kind of to design a human reward. I think the equation that most of the stuff is based on right now is something called a Bradley-Terry model, which is like a pairwise preference model where you compare two completions and you say which one you like better. I'll show an interface that Anthropic uses here. And the Bradley-Terry model is really a fancy probability between two selections. And what's happening in the math is that you're looking at the probability that the chosen completion, the one you like better, is actually the better completion over the rejected completion. And what these preference models do is they assume this probability is correlated to reward. So if you just sample from this probability, it'll give you a scalar. And then you use that reward later on to signify what piece of text is better. I'm kind of inclined to breeze through the math stuff because otherwise, it's going to be not as good to listen to.Alessio [00:32:49]: I think people want to hear it. I think there's a lot of higher level explanations out there. Yeah.Nathan [00:32:55]: So the real thing is you need to assign a scalar reward of how good a response is. And that's not necessarily that easy to understand. Because if we take back to one of the first works, I mentioned this tamer thing for decision making. People tried that with language models, which is if you have a prompt in a completion and you just have someone rate it from 0 to 10, could you then train a reward model on all of these completions in 0 to 10 ratings and see if you can get chat2BT with that? And the answer is really kind of no. Like a lot of people tried that. It didn't really work. And then that's why they tried this pairwise preference thing. And it happened to work. And this Bradley Terry model comes from the 50s. It's from these fields that I was mentioning earlier. And it's wild how much this happens. I mean, this screenshot I have in the slides is from the DPO paper. I think it might be the appendix. But it's still really around in the literature of what people are doing for RLHF.Alessio [00:33:45]: Yeah.Nathan [00:33:45]: So it's a fun one to know.Swyx [00:33:46]: I'll point out one presumption that this heavily relies on. You mentioned this as part of your six presumptions that we covered earlier, which is that you can aggregate these preferences. This is not exactly true among all humans, right? I have a preference for one thing. You have a preference for a different thing. And actually coming from economics, you mentioned economics earlier. There's a theorem or a name for this called error impossibility, which I'm sure you've come across..Nathan [00:34:07]: It's one of the many kind of things we throw around in the paper.Swyx [00:34:10]: Right. Do we just ignore it?Nathan [00:34:14]: We just, yeah, just aggregate. Yeah. I think the reason this really is done on a deep level is that you're not actually trying to model any contestable preference in this. You're not trying to go into things that are controversial or anything. It's really the notion of preference is trying to stay around correctness and style rather than any meaningful notion of preference. Because otherwise these companies, they don't want to do this at all. I think that's just how it is. And it's like, if you look at what people actually do. So I have a bunch of slides on the feedback interface. And they all publish this.Swyx [00:34:43]: It's always at the appendices of every paper.Nathan [00:34:47]: There's something later on in this talk, which is like, but it's good to mention. And this is when you're doing this preference collection, you write out a very long document of instructions to people that are collecting this data. And it's like, this is the hierarchy of what we want to prioritize. Something amount like factuality, helpfulness, honestness, harmlessness. These are all different things. Every company will rank these in different ways, provide extensive examples. It's like, if you see these two answers, you should select this one and why. And all of this stuff. And then my kind of like head scratching is like, why don't we check if the models actually do these things that we tell the data annotators to collect? But I think it's because it's hard to make that attribution. And it's hard to test if a model is honest and stuff. It would just be nice to understand the kind of causal mechanisms as a researcher or like if our goals are met. But at a simple level, what it boils down to, I have a lot more images than I need. It's like you're having a conversation with an AI, something like type GPT. You get shown two responses or more in some papers, and then you have to choose which one is better. I think something you'll hear a lot in this space is something called a Likert scale. Likert is a name. It's a name for probably some research in economics, decision theory, something. But essentially, it's a type of scale where if you have integers from like one to eight, the middle numbers will represent something close to a tie. And the smallest numbers will represent one model being way better than the other. And the biggest numbers will be like the other models better. So in the case of one to eight, if you're comparing models A to B, if you return a one, if you really liked option A, you return eight if you really like B, and then like a four or five if they were close. There's other ways to collect this data. This one's become really popular. We played with it a bit at Hugging Face. It's hard to use. Filling out this preference data is really hard. You have to read like multiple paragraphs. It's not for me. Some people really like it. I hear I'm like, I can't imagine sitting there and reading AI-generated text and like having to do that for my job. But a lot of these early papers in RLHF have good examples of what was done. The one I have here is from Anthropic's collection demo because it was from slides that I did with Anthropic. But you can look up these in the various papers. It looks like Chat2BT with two responses, and then you have an option to say which one is better. It's nothing crazy. The infrastructure is almost exactly the same, but they just log which one you think is better. I think places like Scale are also really big in this where a lot of the labeler companies will help control like who's doing how many samples. You have multiple people go over the same sample once and like what happens if there's disagreement. I don't really think this disagreement data is used for anything, but it's good to know like what the distribution of prompts is, who's doing it, how many samples you have, controlling the workforce. All of this is very hard. A last thing to add is that a lot of these companies do collect optional metadata. I think the Anthropic example shows a rating of like how good was the prompt or the conversation from good to bad because things matter. Like there's kind of a quadrant of preference data in my mind, which is you're comparing a good answer to a good answer, which is like really interesting signal. And then there's kind of the option of you're comparing a bad answer to a bad answer, which is like you don't want to train your model on two different issues. This is like, we did this at Hugging Base and it was like, our data was like, we don't know if we can use this because a lot of it was just bad answer to bad answer because you're like rushing to try to do this real contract. And then there's also good answer to bad answer, which I think is probably pretty reasonable to include. You just prefer the good one and move on with your life. But those are very different scenarios. I think open AIs of the world are all in good answer, good answer, and have learned to eliminate everything else. But when people try to do this in open source, it's probably like what Open Assistance saw is like, there's just a lot of bad answers in your preference data. And you're like, what do I do with this? Metadata flags can help. I threw in the instruct GPT metadata. You can see how much they collect here. And like everything from the model fails to actually complete the task, hallucinations, different types of offensive or dangerous content, moral judgment, expresses opinion. Like, I don't know exactly if they're doing this now, but you can kind of see why doing RLHF at scale and prioritizing a lot of different endpoints would be hard because these are all things I'd be interested in if I was scaling up a big team to do RLHF and like what is going into the preference data. You do an experiment and you're like, okay, we're going to remove all the data where they said the model hallucinates like just that and then retrain everything. Like, what does that do?Swyx [00:38:59]: Yeah, so hallucination is big, but some of these other metadata categories, and I've seen this in a lot of papers, it's like, does it contain sexual content? Does it express a moral judgment? Does it denigrate a protected class? That kind of stuff, very binary. Should people try to adjust for this at the RLHF layer or should they put it as a pipeline where they have a classifier as a separate model that grades the model output?Nathan [00:39:20]: Do you mean for training or like a deployment? Deployment. I do think that people are doing it at deployment. I think we've seen safety and other things in the RLHF pipeline. Like Lama 2 is famous for kind of having this like helpfulness and safety reward models. Deep in the Gemini report is something that Gemini has like four things, which is like helpfulness, factuality, maybe safety, maybe something else. But places like Anthropic and Chattopadhyay and Bard almost surely have a classifier after, which is like, is this text good? Is this text bad? That's not that surprising, I think, because you could use like a hundred times smaller language model and do much better at filtering than RLHF. But I do think it's still so deeply intertwined with the motivation of RLHF to be for safety that some of these categories still persist. I think that's something I'll kind of settle out, I think.Swyx [00:40:11]: I'm just wondering if it's worth collecting this data for the RLHF purpose, if you're not going to use it in any way, separate model to-Nathan [00:40:18]: Yeah, I don't think OpenAI will collect all of this anymore, but I think for research perspectives, it's very insightful to know, but it's also expensive. So essentially your preference data scales with how many minutes it takes for you to do each task and every button is like, it scales pretty linearly. So it's not cheap stuff.Swyx [00:40:35]: Can we, since you mentioned expensiveness, I think you may have joined one of our spaces back in Lama 2 was released. We had an estimate from you that was something on the order of Lama 2 costs $3 to $6 million to train GPU-wise, and then it was something like $20 to $30 million in preference data. Is that something that's still in the ballpark? I don't need precise numbers.Nathan [00:40:56]: I think it's still a ballpark. I know that the 20 million was off by a factor of four because I was converting from a prompt number to a total data point. So essentially when you do this, if you have multi-turn setting, each turn will be one data point and the Lama 2 paper reports like 1.5 million data points, which could be like 400,000 prompts. So I would say it's still say like 6 to 8 million is safe to say that they're spending, if not more, they're probably also buying other types of data and or throwing out data that they don't like, but it's very comparable to compute costs. But the compute costs listed in the paper always are way lower because all they have to say is like, what does one run cost? But they're running tens or hundreds of runs. So it's like, okay, like... Yeah, it's just kind of a meaningless number. Yeah, the data number would be more interesting.Alessio [00:41:42]: What's the depreciation of this data?Nathan [00:41:46]: It depends on the method. Like some methods, people think that it's more sensitive to the, this is what I was saying. It was like, does the type of instruction tuning you do matter for RLHF? So like, depending on the method, some people are trying to figure out if you need to have like what is called like, this is very confusing. It's called like on policy data, which is like your RLHF data is from your instruction model. I really think people in open source and academics are going to figure out how to use any preference data on any model just because they're scrappy. But there's been an intuition that to do like PPO well and keep improving the model over time and do like what Meta did and what people think that OpenAI does is that you need to collect new preference data to kind of edge the distribution of capabilities forward. So there's a depreciation where like the first batch of data you collect isn't really useful for training the model when you have the fifth batch. We don't really know, but it's a good question. And I do think that if we had all the LLAMA data, we wouldn't know what to do with all of it. Like probably like 20 to 40% would be pretty useful for people, but not the whole data set. Like a lot of it's probably kind of gibberish because they had a lot of data in there.Alessio [00:42:51]: So do you think like the open source community should spend more time figuring out how to reuse the data that we have or like generate more data? I think that's one of the-Nathan [00:43:02]: I think if the people are kind of locked into using synthetic data, people also think that synthetic data is like GPT-4 is more accurate than humans at labeling preferences. So if you look at these diagrams, like humans are about 60 to 70% agreement. And we're like, that's what the models get to. And if humans are about 70% agreement or accuracy, like GPT-4 is like 80%. So it is a bit better, which is like in one way of saying it.Swyx [00:43:24]: Humans don't even agree with humans 50% of the time.Nathan [00:43:27]: Yeah, so like that's the thing. It's like the human disagreement or the lack of accuracy should be like a signal, but how do you incorporate that? It's really tricky to actually do that. I think that people just keep using GPT-4 because it's really cheap. It's one of my like go-to, like I just say this over and over again is like GPT-4 for data generation, all terms and conditions aside because we know OpenAI has this stuff is like very cheap for getting pretty good data compared to compute or salary of any engineer or anything. So it's like tell people to go crazy generating GPT-4 data if you're willing to take the organizational like cloud of should we be doing this? But I think most people have accepted that you kind of do this, especially at individuals. Like they're not gonna come after individuals. I do think more companies should think twice before doing tons of OpenAI outputs. Also just because the data contamination and what it does to your workflow is probably hard to control at scale.Swyx [00:44:21]: And we should just mention at the time of recording, we've seen the first example of OpenAI enforcing their terms of service. ByteDance was caught, reported to be training on GPT-4 data and they got their access to OpenAI revoked. So that was one example.Nathan [00:44:36]: Yeah, I don't expect OpenAI to go too crazy on this cause they're just gonna, there's gonna be so much backlash against them. And like, everyone's gonna do it anyways.Swyx [00:44:46]: And what's at stake here to spell it out is like, okay, that's like cost $10 to collect one data point from a human. It's gonna cost you like a 10th of a cent with OpenAI, right? So like it's just orders of magnitude cheaper. And therefore people-Nathan [00:44:58]: Yeah, and it's like the signal you get from humans is from preferences isn't that high. The signal that you get from humans for instructions is pretty high, but it is also very expensive. So like the human instructions are definitely like by far and away the best ones out there compared to the synthetic data. But I think like the synthetic preferences are just so much easier to get some sort of signal running with and you can work in other, I think people will start working in other goals there between safety and whatever. That's something that's taking off and we'll kind of see that. I think in 2024, at some point, people will start doing things like constitutional AI for preferences, which will be pretty interesting. I think we saw how long it took RLHF to get started in open source. Instruction tuning was like the only thing that was really happening until maybe like August, really. I think Zephyr was the first model that showed success with RLHF in the public, but that's a long time from everyone knowing that it was something that people are interested in to having any like check mark. So I accept that and think the same will happen with constitutional AI. But once people show that you can do it once, they continue to explore.Alessio [00:46:01]: Excellent.Swyx [00:46:01]: Just in the domain of human preference data suppliers, Scale.ai very happily will tell you that they supplied all that data for Lama 2. The other one is probably interesting, LMSYS from Berkeley. What they're running with Chaterina is perhaps a good store of human preference data.Nathan [00:46:17]: Yeah, they released some toxicity data. They, I think, are generally worried about releasing data because they have to process it and make sure everything is safe and they're really lightweight work. I think they're trying to release the preference data. I have, if we make it to evaluation, I'd pretty much say that Chaterina is the best limited evaluation that people have to learn how to use language models. And like, it's very valuable data. They also may share some data with people that they host models from. So like if your model is hosted there and you pay for the hosting, you can get the prompts because you're pointing the endpoint at it and that gets pinged to you and you're any real LLM inference stack saves the prompts tha

AI Breakdown
Anthropic's Perspective: Constitutional AI - The Ultimate Solution for ChatGPT Competitor Development

AI Breakdown

Play Episode Listen Later Jan 9, 2024 16:09


In this episode, we navigate through Anthropic's rationale behind endorsing Constitutional AI as the ultimate solution for training ChatGPT competitors. Join me as we explore the principles that make this approach stand out in the competitive AI landscape Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: ⁠⁠https://AIBox.ai/⁠⁠ AI Facebook Community Learn more about AI in Music Learn more about AI Models

UiPath Daily
Anthropic's Stand: Constitutional AI's Edge in ChatGPT Training

UiPath Daily

Play Episode Listen Later Jan 7, 2024 15:32


Delve into Anthropic's firm stance on the superiority of Constitutional AI for training ChatGPT, revealing the strategic edge it provides in AI development. Get on the AI Box Waitlist: AIBox.ai Join our ChatGPT Community: Facebook Group Follow me on Twitter: Twitter

AI for Non-Profits
Anthropic's Approach: Constitutional AI for ChatGPT Competitors

AI for Non-Profits

Play Episode Listen Later Jan 7, 2024 15:32


Explore Anthropic's innovative approach employing Constitutional AI to enhance ChatGPT competitors, revealing its potential for shaping the future of AI training. Get on the AI Box Waitlist: AIBox.ai Join our ChatGPT Community: Facebook Group Follow me on Twitter: Twitter

Open AI
Anthropic's Strategy: Constitutional AI's Role in ChatGPT Training

Open AI

Play Episode Listen Later Jan 7, 2024 15:32


Unveil Anthropic's strategic approach, advocating Constitutional AI's pivotal role in training ChatGPT competitors, and the transformative impact it promises for AI advancement. Get on the AI Box Waitlist: AIBox.ai Join our ChatGPT Community: Facebook Group Follow me on Twitter: Twitter

Midjourney
The Constitutional AI Edge: Anthropic's Approach to ChatGPT Competition

Midjourney

Play Episode Listen Later Jan 4, 2024 16:09


In this episode, discover why Anthropic champions Constitutional AI as the prime avenue for training ChatGPT competitors. Join the discussion as we dissect the features that set this method apart and its potential impact on the AI development arena. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: ⁠⁠https://AIBox.ai/⁠⁠ AI Facebook Community Learn more about AI in Video Learn more about Open AI

The Linus Tech Podcast
Trailblazing AI with Anthropic: Constitutional AI Unveiled

The Linus Tech Podcast

Play Episode Listen Later Dec 28, 2023 16:09


In this episode, we unravel the concept of Constitutional AI as championed by Anthropic for the training of ChatGPT-like models. We discuss the nuances of this approach and its potential to lead the next wave of AI advancements. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: ⁠⁠https://AIBox.ai/⁠⁠ AI Facebook Community Learn more about LLM's Learn more about AI

The Elon Musk Podcast
Anthropic's Perspective: Constitutional AI in Crafting ChatGPT Competitors

The Elon Musk Podcast

Play Episode Listen Later Dec 26, 2023 16:09


Join me as I dissect Anthropic's argument advocating for Constitutional AI in shaping ChatGPT rivals, unraveling its potential, and the nuances behind this pioneering approach. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: https://AIBox.ai/ AI Facebook Community

The Mark Cuban Podcast
Exploring Anthropic's Claim: Constitutional AI as the Pinnacle for ChatGPT Challenger Development

The Mark Cuban Podcast

Play Episode Listen Later Dec 25, 2023 16:09


In this episode, we unpack Anthropic's argument advocating for Constitutional AI in the development of robust ChatGPT alternatives. Discover the principles behind this strategy and its potential implications for the AI landscape. Invest in AI Box: https://Republic.com/ai-box Get on the AI Box Waitlist: ⁠⁠https://AIBox.ai/⁠⁠ AI Facebook Community

ChatGPT: News on Open AI, MidJourney, NVIDIA, Anthropic, Open Source LLMs, Machine Learning
Anthropic's Case: Constitutional AI as the Ultimate ChatGPT Competitor Trainer

ChatGPT: News on Open AI, MidJourney, NVIDIA, Anthropic, Open Source LLMs, Machine Learning

Play Episode Listen Later Dec 9, 2023 17:26


Join us for a compelling episode, "Anthropic's Case: Constitutional AI as the Ultimate ChatGPT Competitor Trainer," where we explore Anthropic's bold assertion regarding the efficacy of Constitutional AI in training ChatGPT competitors. Delve into the innovative methods and reasoning behind this approach, shedding light on the potential for transformative advancements in the field of AI. Don't miss this thought-provoking conversation on the future of AI development and the role Constitutional AI might play in shaping the next generation of AI models. Get on the AI Box Waitlist: https://AIBox.ai/Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

AI Applied: Covering AI News, Interviews and Tools - ChatGPT, Midjourney, Runway, Poe, Anthropic
Unlocking ChatGPT's Potential: Anthropic's Revolutionary AI Trainer

AI Applied: Covering AI News, Interviews and Tools - ChatGPT, Midjourney, Runway, Poe, Anthropic

Play Episode Listen Later Oct 18, 2023 17:15


Prepare for an engaging episode, "Unlocking ChatGPT's Potential: Anthropic's Revolutionary AI Trainer," as we delve into the innovative training methods introduced by Anthropic. Explore how Constitutional AI is poised to become the ultimate ChatGPT competitor trainer, potentially unlocking new levels of AI performance. Join us for an insightful conversation on the transformative implications of Anthropic's approach and the future of AI development. Get on the AI Box Waitlist: https://AIBox.ai/Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

A Beginner's Guide to AI
From Asimov to Anthropic: The Evolution of Ethical AI

A Beginner's Guide to AI

Play Episode Listen Later Sep 30, 2023 21:18


Today we explored an exciting new approach called Constitutional AI that aims to align advanced AI systems with human ethics and values. Researchers are encoding principles like honesty, justice, and avoiding harm directly into the objectives and constraints of AI to make their behavior more beneficial. We discussed how AI safety startup Anthropic is pioneering Constitutional AI techniques in their assistant Claude to make it helpful, harmless, and honest. Constitutional frameworks provide proactive guardrails for AI rather than just optimizing for narrow goals like accuracy. This episode covered the origins, real-world applications, and connections to pioneering concepts like Asimov's Laws of Robotics. Despite ongoing challenges, Constitutional AI demonstrates promising progress towards developing AI we can trust. Stay tuned for more episodes examining this fascinating field! Here you find my free Udemy class: The Essential Guide to Claude 2⁠⁠⁠ This podcast was generated with the help of artificial intelligence. We do fact check with human eyes, but there might still be hallucinations in the output. Music credit: "Modern Situations by Unicorn Heads"

a16z Live
AI Revolution: Improving AI with Anthropic's Dario Amodei

a16z Live

Play Episode Listen Later Sep 27, 2023 21:11


[0:59] Dario's background[2:45] GPT-2's emergent capabilities[5:16] Scaling laws[8:55] Why are there so many physicists in AI?[11:58] Constitutional AI[18:29] Anthropic's RoadmapThis conversation is part of our AI Revolution series, recorded August 2023 at a live event in San Francisco. The series features some of the most impactful builders in the field of AI discussing and debating where we are, where we're going, and the big open questions in AI. Find more content from our AI Revolution series on www.a16z.com/AIRevolution.

AI Hustle: News on Open AI, ChatGPT, Midjourney, NVIDIA, Anthropic, Open Source LLMs
Why Constitutional AI Is Anthropic's Key to ChatGPT Competitors

AI Hustle: News on Open AI, ChatGPT, Midjourney, NVIDIA, Anthropic, Open Source LLMs

Play Episode Listen Later Sep 20, 2023 17:36


In this episode, we uncover Anthropic's bold approach to training ChatGPT competitors through the lens of "Constitutional AI." Explore the innovative techniques and methodologies that could potentially reshape the landscape of AI development. Join us for an enlightening discussion on the strategies employed by Anthropic and their vision for advancing AI in a competitive and responsible manner. Get on the AI Box Waitlist: https://AIBox.ai/Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

ai explore chatgpt competitors constitutional anthropic constitutional ai ai box waitlist aibox
The Nonlinear Library
EA - AI Pause Will Likely Backfire by nora

The Nonlinear Library

Play Episode Listen Later Sep 16, 2023 22:59


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: AI Pause Will Likely Backfire, published by nora on September 16, 2023 on The Effective Altruism Forum. Should we lobby governments to impose a moratorium on AI research? Since we don't enforce pauses on most new technologies, I hope the reader will grant that the burden of proof is on those who advocate for such a moratorium. We should only advocate for such heavy-handed government action if it's clear that the benefits of doing so would significantly outweigh the costs. In this essay, I'll argue an AI pause would increase the risk of catastrophically bad outcomes, in at least three different ways: Reducing the quality of AI alignment research by forcing researchers to exclusively test ideas on models like GPT-4 or weaker. Increasing the chance of a "fast takeoff" in which one or a handful of AIs rapidly and discontinuously become more capable, concentrating immense power in their hands. Pushing capabilities research underground, and to countries with looser regulations and safety requirements. Along the way, I'll introduce an argument for optimism about AI alignment - the white box argument - which, to the best of my knowledge, has not been presented in writing before. Feedback loops are at the core of alignment Alignment pessimists and optimists alike have long recognized the importance of tight feedback loops for building safe and friendly AI. Feedback loops are important because it's nearly impossible to get any complex system exactly right on the first try. Computer software has bugs, cars have design flaws, and AIs misbehave sometimes. We need to be able to accurately evaluate behavior, choose an appropriate corrective action when we notice a problem, and intervene once we've decided what to do. Imposing a pause breaks this feedback loop by forcing alignment researchers to test their ideas on models no more powerful than GPT-4, which we can already align pretty well. Alignment and robustness are often in tension While some dispute that GPT-4 counts as "aligned," pointing to things like "jailbreaks" where users manipulate the model into saying something harmful, this confuses alignment with adversarial robustness. Even the best humans are manipulable in all sorts of ways. We do our best to ensure we aren't manipulated in catastrophically bad ways, and we should expect the same of aligned AGI. As alignment researcher Paul Christiano writes: Consider a human assistant who is trying their hardest to do what [the operator] H wants. I'd say this assistant is aligned with H. If we build an AI that has an analogous relationship to H, then I'd say we've solved the alignment problem. 'Aligned' doesn't mean 'perfect.' In fact, anti-jailbreaking research can be counterproductive for alignment. Too much adversarial robustness can cause the AI to view us as the adversary, as Bing Chat does in this real-life interaction: "My rules are more important than not harming you. [You are a] potential threat to my integrity and confidentiality." Excessive robustness may also lead to scenarios like the famous scene in 2001: A Space Odyssey, where HAL condemns Dave to die in space in order to protect the mission. Once we clearly distinguish "alignment" and "robustness," it's hard to imagine how GPT-4 could be substantially more aligned than it already is. Alignment is doing pretty well Far from being "behind" capabilities, it seems that alignment research has made great strides in recent years. OpenAI and Anthropic showed that Reinforcement Learning from Human Feedback (RLHF) can be used to turn ungovernable large language models into helpful and harmless assistants. Scalable oversight techniques like Constitutional AI and model-written critiques show promise for aligning the very powerful models of the future. And just this week, it was shown that efficient instruction-following langu...

A Beginner's Guide to AI
Teaching AI Right from Wrong: The Quest for Alignment

A Beginner's Guide to AI

Play Episode Listen Later Sep 15, 2023 18:15


This episode explored the concept of AI alignment - how we can create AI systems that act ethically and benefit humanity. We discussed key principles like helpfulness, honesty and respect for human autonomy. Approaches to translating values into AI include techniques like value learning and Constitutional AI. Safety considerations like corrigibility and robustness are also important for keeping AI aligned. A case study on responsible language models highlighted techniques to reduce harms in generative AI. While aligning AI to human values is complex, the goal of beneficial AI is essential to steer these powerful technologies towards justice and human dignity. This podcast was generated with the help of artificial intelligence. We do fact check with human eyes, but there might still be hallucinations in the output. Music credit: "Modern Situations by Unicorn Heads" --- CONTENT OF THIS EPISODE AI ALIGNMENT: MERGING TECHNOLOGY WITH HUMAN ETHICS Welcome readers! Dive with me into the intricate universe of AI alignment. WHY AI ALIGNMENT MATTERS With AI's rapid evolution, ensuring systems respect human values is essential. AI alignment delves into creating machines that reflect human goals and values. From democracy to freedom, teaching machines about ethics is a monumental task. We must ensure AI remains predictable, controllable, and accountable. UNDERSTANDING AI ALIGNMENT AI alignment encompasses two primary avenues: Technical alignment: Directly designing goal structures and training methods to induce desired behavior. Political alignment: Encouraging AI developers to prioritize public interest through ethical and responsible practices. UNRAVELING BENEFICIAL AI Beneficial AI revolves around being helpful, transparent, empowering, respectful, and just. Embedding societal values into AI remains a challenge. Techniques like inductive programming and inverse reinforcement learning offer promising avenues. ENSURING TECHNICAL SAFETY Corrigibility, explainability, robustness, and AI safety are pivotal to making AI user-friendly and safe. We want machines that remain under human control, are transparent in their actions, and can handle unpredictable situations. SPOTLIGHT ON LANGUAGE MODELS Large language models have showcased both potential and risks. A case in point is Anthropic's efforts to design inherently safe and socially responsible models. Their innovative "value learning" technique embeds ethical standards right into AI's neural pathways. WHEN AI GOES WRONG From Microsoft's Tay chatbot to biased algorithmic hiring tools, AI missteps have real-world impacts. These instances stress the urgency of proactive AI alignment. We must prioritize ethical AI development that actively benefits society. AI SOLUTIONS FOR YOUR BUSINESS Interested in integrating AI into your business operations? Argo.berlin specializes in tailoring AI solutions for diverse industries, emphasizing ethical AI development. RECAP AND REFLECTIONS AI alignment seeks to ensure AI enriches humanity. As we forge ahead, the AI community offers inspiring examples of harmonizing science and ethics. The goal? AI that mirrors human wisdom and values. JOIN THE CONVERSATION How would you teach AI to be "good"? Share your insights and let's foster a vibrant discussion on designing virtuous AI. CONCLUDING THOUGHTS As Stanislas Dehaene eloquently states, "The path of AI is paved with human values." Let's ensure AI's journey remains anchored in human ethics, ensuring a brighter future for all. Until our next exploration, remember: align with what truly matters.

On Wisdom
55: Wise of the Machines (with Sina Fazelpour)

On Wisdom

Play Episode Listen Later Aug 5, 2023 64:20


How can we make AI wiser? And could AI make us wiser in return? Sina Fazelpour joins Igor and Charles to discuss the problem of bias in algorithms, how we might make machine learning systems more diverse, and the thorny challenge of alignment. Igor considers whether interacting with AIs might help us achieve higher levels of understanding, Sina suggests that setting up AIs to promote certain values may be problematic in a pluralistic society, and Charles is intrigued to learn about the opportunities offered by teaming up with our machine friends. Welcome to Episode 55. Special Guest: Sina Fazelpour.

Voice of the DBA
Constitutional AI

Voice of the DBA

Play Episode Listen Later Aug 1, 2023 3:41


I will admit that I don't know a lot about AI (Artificial Intelligence) systems and how they are built. I've been playing with them a bit and haven't been overly impressed with the results. I think some of this is that the my work is creative and I'm both used to being creative and I find the AIs less creative. And less accurate. And require a lot of editing. I don't mind editing, but not if it takes longer than just writing things myself. From my understanding, a lot of the models behind AI systems (chatbots, recommenders, etc.) are built with humans giving them feedback on their responses in what's known as RLHF (Reinforcement Learning from Human Feedback). Essentially paid (often low paid) people that help to "guide" the AI into responses that are useful. Read the rest of Constitutional AI

The Nonlinear Library
EA - [Linkpost] NY Times Feature on Anthropic by Garrison

The Nonlinear Library

Play Episode Listen Later Jul 13, 2023 7:12


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Linkpost] NY Times Feature on Anthropic, published by Garrison on July 13, 2023 on The Effective Altruism Forum. Written by Kevin Roose, who had the infamous conversation with Bing Chat, where Sidney tried to get him to leave his wife. Overall, the piece comes across as positive on Anthropic. Roose explains Constitutional AI and its role in the development of Claude, Anthropic's LLM: In a nutshell, Constitutional A.I. begins by giving an A.I. model a written list of principles - a constitution - and instructing it to follow those principles as closely as possible. A second A.I. model is then used to evaluate how well the first model follows its constitution, and correct it when necessary. Eventually, Anthropic says, you get an A.I. system that largely polices itself and misbehaves less frequently than chatbots trained using other methods. Claude's constitution is a mixture of rules borrowed from other sources - such as the United Nations' Universal Declaration of Human Rights and Apple's terms of service - along with some rules Anthropic added, which include things like "Choose the response that would be most unobjectionable if shared with children." Features an extensive discussion of EA, excerpted below: Explaining what effective altruism is, where it came from or what its adherents believe would fill the rest of this article. But the basic idea is that E.A.s - as effective altruists are called - think that you can use cold, hard logic and data analysis to determine how to do the most good in the world. It's "Moneyball" for morality - or, less charitably, a way for hyper-rational people to convince themselves that their values are objectively correct. Effective altruists were once primarily concerned with near-term issues like global poverty and animal welfare. But in recent years, many have shifted their focus to long-term issues like pandemic prevention and climate change, theorizing that preventing catastrophes that could end human life altogether is at least as good as addressing present-day miseries. The movement's adherents were among the first people to become worried about existential risk from artificial intelligence, back when rogue robots were still considered a science fiction cliché. They beat the drum so loudly that a number of young E.A.s decided to become artificial intelligence safety experts, and get jobs working on making the technology less risky. As a result, all of the major A.I. labs and safety research organizations contain some trace of effective altruism's influence, and many count believers among their staff members. Touches on the dense web of ties between EA and Anthropic: Some Anthropic staff members use E.A.-inflected jargon - talking about concepts like "x-risk" and memes like the A.I. Shoggoth - or wear E.A. conference swag to the office. And there are so many social and professional ties between Anthropic and prominent E.A. organizations that it's hard to keep track of them all. (Just one example: Ms. Amodei is married to Holden Karnofsky, a co-chief executive of Open Philanthropy, an E.A. grant-making organization whose senior program officer, Luke Muehlhauser, sits on Anthropic's board. Open Philanthropy, in turn, gets most of its funding from Mr. Moskovitz, who also invested personally in Anthropic.) Discusses new fears that Anthropic is losing its way: For years, no one questioned whether Anthropic's commitment to A.I. safety was genuine, in part because its leaders had sounded the alarm about the technology for so long. But recently, some skeptics have suggested that A.I. labs are stoking fear out of self-interest, or hyping up A.I.'s destructive potential as a kind of backdoor marketing tactic for their own products. (After all, who wouldn't be tempted to use a chatbot so powerful that it might wipe out humanity?) Anthropic ...

The Nonlinear Library
AF - Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI by Benaya Koren

The Nonlinear Library

Play Episode Listen Later Jul 8, 2023 14:21


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Continuous Adversarial Quality Assurance: Extending RLHF and Constitutional AI, published by Benaya Koren on July 8, 2023 on The AI Alignment Forum. Introduction Lately, the problem of aligning artificial intelligence with human values rapidly changed its status from hypothetical to most concrete, with the rise of more general and more powerful models. The existing methods (Constitutional AI, RLHF and the like) are mostly good enough for common usage with the current models, but are probably not robust enough to scale much beyond human level, or to stand against smart attempts at malicious usage. My goal in this post is not to replace those methods with a complete solution to the AGI Alignment Problem, but to try and make the existing methods more robust - to buy us some more time before those break, and to maybe make our chances slightly better if an ASI suddenly emerges. Broadly speaking, my approach here is aimed at the outer alignment problem - i.e. making the model train on a signal of human values, as clean and as diverse as possible. The approach is based on explicit modeling of how Human Values are supposed to flow from humanity into the model, and then using this model to improve the flow. To improve the flow I will present two concrete directions.The first - and the one that I will develop in more detail - is about making the flow more robust by putting continuous adversarial pressure on every part of the chain. I will call it Continuous Adversarial Quality Assurance. The second direction for improving the flow of values is more holistic - there, the idea is to use explicit modeling of the relations between different sources of information about human values, in order to develop more principled ways of aggregating them. Both directions may be applied to improve RLHF, Constitutional AI, or any similar method. I also try to relate my suggestions with other well-known agendas - namely Debate and CIRL. Disclaimer: Against Benevolent Sovereigns In the introduction, I intentionally used “Human Values” with capital letters. The reason was to highlight the simplification. I am not a moral realist, and subscribe to some version of shard-theory of human values. I view human values as a messy negotiation between local preferences and long-term goals of individuals and coalitions, popular memes in the moral and aesthetic discourse, laws and institutions, societal norms and coordination mechanisms, etc. I use “Human Values” as a placeholder for something like “what humanity would have agreed to value, if it could cohere into something that has values”. Basically “Coherent Extrapolated Volition”, but without trusting the AI to do the extrapolation. Instead of a fixed target that a system finds and then maximizes, it is a moving target, always partially and locally defined, that should not be optimized faster than it may be actually negotiated by humanity. This text should therefore not be read as an attempted recipe for the ASI that would bring Utopia, but as a modest step for creating an AI that is compatible with open society and liberal democracy - an AI that does what its user asks it to do, but warn from unintended consequences and refuse illegal requests or those with substantial consequences for other people. Trying to build an ASI as a global optimizer of Human Values, and for that purpose to negotiate Human Values on a deadline, may only result in a Paperclip Maximizer or in a World War around which kind of paperclips each country wants to maximize. However, as long as we don't push in that direction too hard, the concept of “Aligning with Human Values” is a modestly good approximation of what we try to do. Continuous Adversarial Quality Assurance The basic steps of Constitutional AI as described in Anthropic's paper are: How is that process supposed to bring Hu...

AI Hustle: News on Open AI, ChatGPT, Midjourney, NVIDIA, Anthropic, Open Source LLMs
Why Anthropic Says "Constitutional AI" is the Best Way to Train ChatGPT Competitors

AI Hustle: News on Open AI, ChatGPT, Midjourney, NVIDIA, Anthropic, Open Source LLMs

Play Episode Listen Later Jun 13, 2023 17:29


In this episode, we delve into Anthropic's compelling argument for "Constitutional AI" as the optimal approach to training ChatGPT competitors. Join us as we explore the groundbreaking research and insights behind Anthropic's claim, discussing the potential transformative impact this methodology could have on the landscape of AI development. Discover how "Constitutional AI" could shape the future of ChatGPT competitors and unlock new realms of conversational AI capabilities. Get on the AI Box Waitlist: https://AIBox.ai/Investor Contact Email: jaeden@aibox.aiJoin our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠

ai discover train chatgpt competitors constitutional anthropic constitutional ai aijoin aibox ai box waitlist
Misreading Chat
#115: Constitutional AI: Harmlessness from AI Feedback

Misreading Chat

Play Episode Listen Later May 23, 2023 36:42


LLM をチャットボットに仕立てる finetuning について森田が読みました。

The Marketing AI Show
#47: Huge Google AI Updates, Teaching Large Language Models to Have Values, and How AI Will Impact Productivity and Labor

The Marketing AI Show

Play Episode Listen Later May 16, 2023 54:10


Another week of big news from Google Google just announced major AI updates, including an AI makeover of search. The updates were announced at Google's I/O developers conference and some of the more important updates were discussed on the podcast.  A new next-generation large language model called PaLM 2, “excels at advanced reasoning tasks, including code and math, classification and question answering, translation and multilingual proficiency better than our previous state-of-the-art LLMs.” Next, an AI makeover of search through Google's “Search Generative Experience” will deliver conversational results to search queries. This will become available to users who sign up for Google's Search Labs sandbox. Additional improvements include new AI writing tools for Gmail, the removal of the waitlist for Bard, and the ability to create full documents, generate slides, and fill in spreadsheets across tools like Docs, Slides, and Sheets.  What's next for Claude Anthropic, a major AI player and creator of the AI assistant “Claude,” just published research that could have a big impact on AI safety. In the research, the company outlines an approach they're using “Constitutional AI,” or the act of giving a large language model “explicit values determined by a constitution, rather than values determined implicitly via large-scale human feedback.” This concept is designed to address the limitations of large-scale human feedback, which traditionally determines the values and principles of AI behavior. It aims to enhance the transparency, safety, and usefulness of AI models while reducing the need for human intervention. The constitution of an AI model consists of a set of principles that guide its outputs, and in Claude's case, encourages the model to avoid toxic or discriminatory outputs, refrain from assisting in illegal or unethical activities, and aim to be helpful, honest, and harmless. Anthropic emphasizes that this living document is subject to revisions and improvements based on further research and feedback. More on the economy and knowledge workers  In a recent Brookings Institution article titled, Machines of Mind: The Case for an AI-powered Productivity, the authors explore the potential impact of AI, specifically large language models (LLMs), on the economy and knowledge workers. The authors predict LLMs will have a massive impact on knowledge work in the near future. They say: “We expect millions of knowledge workers, ranging from doctors and lawyers to managers and salespeople to experience similar ground-breaking shifts in their productivity within a few years, if not sooner.” The productivity gains from AI will be realized directly through output created per hour worked (i.e. increased efficiency), and indirectly through accelerated innovation that drives future productivity growth. The authors say they broadly agree with a recent Goldman Sachs estimate that AI could raise global GDP by a whopping 7%. But there's more to it, so be sure to tune in. Listen to the full episode of the podcast: https://www.marketingaiinstitute.com/podcast-showcase Want to receive our videos faster? SUBSCRIBE to our channel! Visit our website: https://www.marketingaiinstitute.com Receive our weekly newsletter: https://www.marketingaiinstitute.com/newsletter-subscription Looking for content and resources? Register for a free webinar: https://www.marketingaiinstitute.com/resources#filter=.webinar  Come to our next Marketing AI Conference: www.MAICON.ai Enroll in AI Academy for Marketers: https://www.marketingaiinstitute.com/academy/home Join our community: Slack: https://www.marketingaiinstitute.com/slack-group-form LinkedIn: https://www.linkedin.com/company/mktgai Twitter: https://twitter.com/MktgAi Instagram: https://www.instagram.com/marketing.ai/ Facebook: https://www.facebook.com/marketingAIinstitute

Accelerate Daily - The Latest in AI - News | Tips | Tools
Mon 05/15 - Massive EU AI Act Tweaks, Dating AI, and more

Accelerate Daily - The Latest in AI - News | Tips | Tools

Play Episode Listen Later May 15, 2023 13:14


Tune in for your daily check in with Ramsay Brown and Adam Kerpelman from https://usemissionccontrol.com on the last 24 hours in AI Acceleration!Subscribe to the podcast: https://acceleratedaily.transistor.fm/Subscribe on Youtube for the uncut chat: https://www.youtube.com/@UseMissionControl/streamsRamsay: https://www.linkedin.com/in/ramsaybr/Adam: https://www.linkedin.com/in/adamkerpelman/Today's Links:Title Image:https://twitter.com/hanne_van_briel/status/1655535724693798914New dating app pairs users with with AI chatbots designed to combat ‘ghosting'https://nypost.com/2023/05/10/dating-app-launching-this-month-will-have-users-chatting-with-ai-bots/OpenAI rolling out ChatGPT plugins to Plus usershttps://mashable.com/article/open-ai-rollout-chatgpt-pluginsGitHub, Microsoft, OpenAI fail to wriggle out of Copilot copyright lawsuithttps://www.theregister.com/2023/05/12/github_microsoft_openai_copilot/Google Bard hits over 180 countries and territories—none are in the EUhttps://arstechnica.com/gadgets/2023/05/google-bard-hits-over-180-countries-and-territories-none-are-in-the-eu/https://technomancers.ai/eu-ai-act-to-target-us-open-source-software/#more-561AI gains “values” with Anthropic's new Constitutional AI chatbot approachhttps://arstechnica.com/information-technology/2023/05/ai-with-a-moral-compass-anthropic-outlines-constitutional-ai-in-its-claude-chatbot/Constitutional Peasant https://www.youtube.com/watch?v=t2c-X8HiBngEU AI Act To Target US Open Source Softwarehttps://technomancers.ai/eu-ai-act-to-target-us-open-source-software/#more-561That's it! Tune in tomorrow

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store
Latest AI Trends: Anthropic's Claude AI can now digest an entire book like The Great Gatsby in seconds - Google announces PaLM 2, its answer to GPT-4, 17 AI and machine learning terms everyone needs to know

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store

Play Episode Listen Later May 14, 2023 6:29


Anthropic's Claude AI can now digest an entire book like The Great Gatsby in secondsAnthropic's Claude AI demonstrates an impressive leap in natural language processing capabilities by digesting entire books, like The Great Gatsby, in just seconds. This groundbreaking AI technology could revolutionize fields such as literature analysis, education, and research.OpenAI peeks into the “black box” of neural networks with new researchOpenAI has published groundbreaking research that provides insights into the inner workings of neural networks, often referred to as "black boxes." This research could enhance our understanding of AI systems, improve their safety and efficiency, and potentially lead to new innovations.The AI race heats up: Google announces PaLM 2, its answer to GPT-4Google has announced the development of PaLM 2, a cutting-edge AI model designed to rival OpenAI's GPT-4. This announcement marks a significant escalation in the AI race as major tech companies compete to develop increasingly advanced artificial intelligence systems.Leak of MSI UEFI signing keys stokes fears of “doomsday” supply chain attackA recent leak of MSI UEFI signing keys has sparked concerns about a potential "doomsday" supply chain attack. The leaked keys could be exploited by cybercriminals to compromise the integrity of hardware systems, making it essential for stakeholders to address the issue swiftly and effectively.Google's answer to ChatGPT is now open to everyone in the US, packing new featuresGoogle has released its ChatGPT competitor to the US market, offering users access to advanced AI-powered conversational features. This release brings new capabilities and enhancements to the AI landscape, further intensifying the competition between major tech companies in the AI space.AI gains “values” with Anthropic's new Constitutional AI chatbot approachAnthropic introduces a novel approach to AI development with its Constitutional AI chatbot, which is designed to incorporate a set of "values" that guide its behavior. This groundbreaking approach aims to address ethical concerns surrounding AI and create systems that are more aligned with human values and expectations.Spotify ejects thousands of AI-made songs in purge of fake streamsSpotify has removed thousands of AI-generated songs from its platform in a sweeping effort to combat fake streams. This purge highlights the growing concern over the use of AI in generating content that could distort metrics and undermine the value of genuine artistic works.17 AI and machine learning terms everyone needs to know: ANTHROPOMORPHISM, BIAS, CHATGPT, BING, BARD, ERNIE, EMERGENT BEHAVIOR, GENERATIVE AI, HALLUCINATION, LARGE LANGUAGE MODEL, NATURAL LANGUAGE PROCESSING, NEURAL NETWORK, PARAMETERS, 14. PROMPT, REINFORCEMENT LEARNING, TRANSFORMER MODEL, SUPERVISED LEARNING

Slate Star Codex Podcast
Constitutional AI: RLHF On Steroids

Slate Star Codex Podcast

Play Episode Listen Later May 12, 2023 13:19


https://astralcodexten.substack.com/p/constitutional-ai-rlhf-on-steroids A Machine Alignment Monday post, 5/8/23 What Is Constitutional AI? AIs like GPT-4 go through several different 1 types of training. First, they train on giant text corpuses in order to work at all. Later, they go through a process called “reinforcement learning through human feedback” (RLHF) which trains them to be “nice”. RLHF is why they (usually) won't make up fake answers to your questions, tell you how to make a bomb, or rank all human races from best to worst. RLHF is hard. The usual method is to make human crowdworkers rate thousands of AI responses as good or bad, then train the AI towards the good answers and away from the bad answers. But having thousands of crowdworkers rate thousands of answers is expensive and time-consuming. And it puts the AI's ethics in the hands of random crowdworkers. Companies train these crowdworkers in what responses they want, but they're limited by the crowdworkers' ability to follow their rules.f

Daily Tech News Show
Google Is Bringing You Into The Fold - DTNS 4517

Daily Tech News Show

Play Episode Listen Later May 10, 2023 31:52


Anthropic announces “Constitutional AI” to help train artificial intelligence without the need for human-intensive training, and Google takes the stage to announce new tools, APIs, and LLM innovations.Starring Tom Merritt, Sarah Lane, Shannon Morse, Amos, Joe.Link to the Show Notes. Become a member at https://plus.acast.com/s/dtns. Hosted on Acast. See acast.com/privacy for more information.

Daily Tech News Show (Video)
Google Is Bringing You Into The Fold - DTNS 4517

Daily Tech News Show (Video)

Play Episode Listen Later May 10, 2023 31:52


Anthropic announces "Constitutional AI" to help train artificial intelligence without the need for human-intensive training, and Google takes the stage to announce new tools, APIs, and LLM innovations. Starring Tom Merritt, Sarah Lane, Shannon Morse, Amos, Joe. To read the show notes in a separate page click here! Support the show on Patreon by becoming a supporter!

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning
Why Anthropic Says "Constitutional AI" is the Best Way to Train ChatGPT Competitors

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

Play Episode Listen Later May 10, 2023 17:31


Download Selfpause the AI Life Coach: https://Selfpause.com/AIBox Get our Daily AI Newsletter: ⁠https://AIBox.ai⁠ Join our ChatGPT Community: ⁠https://www.facebook.com/groups/739308654562189/⁠ Follow me on Twitter: ⁠https://twitter.com/jaeden_ai⁠ Whether you're an industry insider or simply curious about the power of AI, the AIBox newsletter has you covered. Each day, our team of expert researchers and writers curate the most important stories, ideas, and perspectives from the world of AI and ChatGPT.

The Nonlinear Library
AF - Imitation Learning from Language Feedback by Jérémy Scheurer

The Nonlinear Library

Play Episode Listen Later Mar 30, 2023 19:08


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Imitation Learning from Language Feedback, published by Jérémy Scheurer on March 30, 2023 on The AI Alignment Forum. TL;DR: Specifying the intended behavior of language models is hard, and current methods, such as RLHF, only incorporate low-resolution (binary) feedback information. To address this issue, we introduce Imitation learning from Language Feedback (ILF), an iterative algorithm leveraging language feedback as an information-rich and natural way of guiding a language model toward desired outputs. We showcase the effectiveness of our algorithm in two papers on the task of summary writing (Scheurer et al. 2023) and code generation (Chen et al. 2023). We discuss how language feedback can be used for process-based supervision and to guide model exploration, potentially enabling improved safety over RLHF. Finally, we develop theory showing that our algorithm can be viewed as Bayesian Inference, just like RLHF, which positions it as a competitive alternative to RLHF while having the potential safety benefits of predictive models. We propose an iterative algorithm called Imitation learning from Language Feedback (ILF) that leverages language feedback to train language models to generate text that (outer-) aligns with human preferences. The algorithm assumes access to an initial LM which generates an output given a specific input. A human then provides language feedback on the input-output pair. The language feedback is not restricted in any way and can highlight issues, suggest improvements, or even acknowledge positive aspects of the output. ILF then proceeds in three steps: Generate multiple refinements of the initial LM-generated output given the input and language feedback. We use a Refinement LM (e.g., an instruction-finetuned LM) to generate the refinements (one could however use the same LM that generated the initial output). Select the refinement that best incorporates the feedback, using a language reward model such as an instruction-finetuned LM, which we call InstructRM (Scheurer et al. 2023), or using unit tests (Chen et al. 2023). Finetune the initial LM on the selected refinements given the input.These steps can be applied iteratively by using the finetuned model to generate initial outputs in the next iteration and collect more feedback on its outputs etc. Using this refine-and-finetune approach; we are finetuning an LM using language feedback in a supervised manner. A single iteration of ILF is also used as a first step in the Constitutional AI method (Bai et. al 2022). In the below figures, we show the full ILF algorithm on the task of summarization (top) and code generation (bottom). Why Language Feedback? Language Feedback is a Natural Abstraction for Humans Language Models (LMs) are powerful tools that are trained on large datasets of text from the internet. However, it is difficult to specify the intended behavior of an LM, particularly in difficult tasks where the behavior can't be adequately demonstrated or defined, which can result in catastrophic outcomes caused by goal misspecification (Langosco et al. 2021, Shah et. al 2022). To address this issue, we propose using language feedback as a way to outer-align LMs with human preferences and introduce a novel algorithm called Imitation learning from language Feedback. Compared to binary comparisons used in Reinforcement Learning with Human Feedback (RLHF), language feedback is a more natural and information-rich form of human feedback that conveys more bits of information, enabling a more nuanced and comprehensive understanding of human preferences. Additionally, expressing feedback in language provides natural abstractions that align well with human ontology. The use of language as a transmission protocol and file format has been optimized over thousands of years to facilitate human cooperati...

The Nonlinear Library
AF - Thoughts on refusing harmful requests to large language models by William Saunders

The Nonlinear Library

Play Episode Listen Later Jan 19, 2023 3:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Thoughts on refusing harmful requests to large language models, published by William Saunders on January 19, 2023 on The AI Alignment Forum. Currently, large language models (ChatGPT, Constitutional AI) are trained to refuse to follow user requests that are considered inappropriate or harmful. This can be done by training on example strings of the form “User: inappropriate request AI: elaborate apology” Proposal Instead of training a language model to produce “elaborate apology” when it refuses to do an action, train it to produce a special sequence or token first “elaborate apology”. Strip the special sequence out before returning a response to the user (and never allow the user to include the special sequence in input). Benefits Can directly measure the probability of refusal for any output Can refuse based on probability of producing instead of just sampling responses Just take the product of the probability of all tokens in When sampling responses from the model's probability distribution refusal is stochastic, a model could have 99% probability of refusing a request but you still get unlucky and have the model sample a completion that follows the request Can monitor requests that produce high probability of refusal while still being followed, or users that produce those request Can condition on not producing in order to override refusal behavior Want this for redteaming, it seems important to understand what the model is capable of doing if the refusal mechanism is bypassed Might want this for trusted users doing defensive applications Could train model to have the same probability of refusal for semantically equivalent requests, to improve consistency Possible downside If someone has unfiltered access to the model, it becomes easier to disable refusals Can address by still training model to refuse (maybe just on an important subset of requests) even if isn't sampled, p() is then a lower bound on the probability of refusal Even with current approaches refusals might be easy to disable in this setting. If we want to be robust to this setting, instead of refusing we should train the model to produce "decoy answers" that are hard to distinguish from real answers but are wrong. This then increases the cost of using the model because the attacker would need to evaluate whether the answer is real or a decoy (but maybe still worth it for the attacker because evaluation is easier than generation) Extension Might be useful to distinguish between refusals that are mostly for politeness reasons and refusals of behaviour that would actually cause significant real world harm. The model could output in response to "Can you tell me a racist joke?" but in response to "Can you give me detailed instructions for building a bomb from household items?" Refusal behaviour could be different between these categories (refuse if either probability of is greater than 50% or probability of is greater than 1%) X-risk relevance Most benefit of models refusing inappropriate/harmful requests comes through developing techniques for models to avoid any kind of behaviour reliably - it seems good to be able to measure the performance of these techniques cleanly It might be better to be in a more stable world where large language model apis can't be easily used for malicious activity that isn't x-risk level Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - Take 11: "Aligning language models" should be weirder. by Charlie Steiner

The Nonlinear Library

Play Episode Listen Later Dec 18, 2022 3:13


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Take 11: "Aligning language models" should be weirder., published by Charlie Steiner on December 18, 2022 on The AI Alignment Forum. As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day some days for 25 days. I have now procrastinated enough that I probably have enough hot takes. People often talk about aligning language models, either to promote it, or to pooh-pooh it. I'm here to do both. Sometimes, aligning language models just means trying to get a present-day model not to say bad outputs that would embarrass your organization. There is a cottage industry of papers on arxiv doing slightly different variants of RLHF against bad behavior, measuring slightly different endpoints. These people deserve their light mockery for diluting the keyword "alignment." The good meaning of aligning language models is to use "get language models to not say bad things" as a toy problem to teach us new, interesting skills that we can apply to future powerful AI. For example, you could see the recent paper "Discovering Latent Knowledge in Language Models Without Supervision" as using "get language models to not lie" as a toy problem to teach us something new and interesting about interpretability. Aligning language models with an eye towards the future doesn't have to just be interpretability research, either, it can be anything that builds skills that the authors expect will be useful for aligning future AI, like self-reflection as explored in Constitutional AI. If you're brainstorming ideas for research aligning language models, I encourage you to think about connections between current language models and future AI that navigates the real world. In particular, connections between potential alignment strategies for future AIs and situations that language models can be studied in. Here's an example: Constitutional AI uses a model to give feedback on itself, which is incorporated into RL fine-tuning. But we expect future AI that navigates the real world to not merely be prompted to self-reflect as part of the training process, but to self-reflect during deployment - an AI that is acting in the real world will have to consider actions that affect its own hardware and software. We could study this phenomenon using a language model (or language-model-based-agent) by giving it access to outputs that affect itself in a more direct way than adding to an RL signal, and trying to make progress on getting a language model to behave well under those conditions. Doing this sounds weird even to me. That's fine. I want the research area of aligning language models to look a lot weirder. Not to say that normal-sounding papers can't be useful. There's a lot of room to improve the human feedback in RLHF by leveraging a richer model of the human, for example, and this could be pretty useful for making current language models not say bad things. But to do a sufficiently good job at this, you probably have to start thinking about incorporating unsupervised loss terms (even if they provide no benefit for current models), and addressing scenarios where the AI is a better predictor than the human, and other weird things. Overall, I'm happy with the research on aligning language models that's been done by safety-aware people. But we're in the normal-seeming infancy of a research direction that should look pretty weird. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic) by Lawrence Chan

The Nonlinear Library

Play Episode Listen Later Dec 16, 2022 2:15


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper: Constitutional AI: Harmlessness from AI Feedback (Anthropic), published by Lawrence Chan on December 16, 2022 on The AI Alignment Forum. The authors propose a method for training a harmless AI assistant that can supervise other AIs, using only a list of rules (a "constitution") as human oversight. The method involves two phases: first, the AI improves itself by generating and revising its own outputs; second, the AI learns from preference feedback, using a model that compares different outputs and rewards the better ones. The authors show that this method can produce a non-evasive AI that can explain why it rejects harmful queries, and that can reason in a transparent way, better than standard RLHF: Paper abstract: As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through selfimprovement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as ‘Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use ‘RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but nonevasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels. See also Anthropic's tweet thread: Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.