Podcasts about Interpretability

  • 83PODCASTS
  • 357EPISODES
  • 37mAVG DURATION
  • 1EPISODE EVERY OTHER WEEK
  • May 30, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about Interpretability

Latest podcast episodes about Interpretability

Machine Learning Guide
MLG 036 Autoencoders

Machine Learning Guide

Play Episode Listen Later May 30, 2025 65:55


Auto encoders are neural networks that compress data into a smaller "code," enabling dimensionality reduction, data cleaning, and lossy compression by reconstructing original inputs from this code. Advanced auto encoder types, such as denoising, sparse, and variational auto encoders, extend these concepts for applications in generative modeling, interpretability, and synthetic data generation. Links Notes and resources at ocdevel.com/mlg/36 Try a walking desk - stay healthy & sharp while you learn & code Build the future of multi-agent software with AGNTCY. Thanks to T.J. Wilder from intrep.io for recording this episode! Fundamentals of Autoencoders Autoencoders are neural networks designed to reconstruct their input data by passing data through a compressed intermediate representation called a “code.” The architecture typically follows an hourglass shape: a wide input and output separated by a narrower bottleneck layer that enforces information compression. The encoder compresses input data into the code, while the decoder reconstructs the original input from this code. Comparison with Supervised Learning Unlike traditional supervised learning, where the output differs from the input (e.g., image classification), autoencoders use the same vector for both input and output. Use Cases: Dimensionality Reduction and Representation Autoencoders perform dimensionality reduction by learning compressed forms of high-dimensional data, making it easier to visualize and process data with many features. The compressed code can be used for clustering, visualization in 2D or 3D graphs, and input into subsequent machine learning models, saving computational resources and improving scalability. Feature Learning and Embeddings Autoencoders enable feature learning by extracting abstract representations from the input data, similar in concept to learned embeddings in large language models (LLMs). While effective for many data types, autoencoder-based encodings are less suited for variable-length text compared to LLM embeddings. Data Search, Clustering, and Compression By reducing dimensionality, autoencoders facilitate vector searches, efficient clustering, and similarity retrieval. The compressed codes enable lossy compression analogous to audio codecs like MP3, with the difference that autoencoders lack domain-specific optimizations for preserving perceptually important data. Reconstruction Fidelity and Loss Types Loss functions in autoencoders are defined to compare reconstructed outputs to original inputs, often using different loss types depending on input variable types (e.g., Boolean vs. continuous). Compression via autoencoders is typically lossy, meaning some information from the input is lost during reconstruction, and the areas of information lost may not be easily controlled. Outlier Detection and Noise Reduction Since reconstruction errors tend to move data toward the mean, autoencoders can be used to reduce noise and identify data outliers. Large reconstruction errors can signal atypical or outlier samples in the dataset. Denoising Autoencoders Denoising autoencoders are trained to reconstruct clean data from noisy inputs, making them valuable for applications in image and audio de-noising as well as signal smoothing. Iterative denoising as a principle forms the basis for diffusion models, where repeated application of a denoising autoencoder can gradually turn random noise into structured output. Data Imputation Autoencoders can aid in data imputation by filling in missing values: training on complete records and reconstructing missing entries for incomplete records using learned code representations. This approach leverages the model's propensity to output ‘plausible' values learned from overall data structure. Cryptographic Analogy The separation of encoding and decoding can draw parallels to encryption and decryption, though autoencoders are not intended or suitable for secure communication due to their inherent lossiness. Advanced Architectures: Sparse and Overcomplete Autoencoders Sparse autoencoders use constraints to encourage code representations with only a few active values, increasing interpretability and explainability. Overcomplete autoencoders have a code size larger than the input, often in applications that require extraction of distinct, interpretable features from complex model states. Interpretability and Research Example Research such as Anthropic's “Towards Monosemanticity” applies sparse autoencoders to the internal activations of language models to identify interpretable features correlated with concrete linguistic or semantic concepts. These models can be used to monitor and potentially control model behaviors (e.g., detecting specific language usage or enforcing safety constraints) by manipulating feature activations. Variational Autoencoders (VAEs) VAEs extend autoencoder architecture by encoding inputs as distributions (means and standard deviations) instead of point values, enforcing a continuous, normalized code space. Decoding from sampled points within this space enables synthetic data generation, as any point near the center of the code space corresponds to plausible data according to the model. VAEs for Synthetic Data and Rare Event Amplification VAEs are powerful in domains with sparse data or rare events (e.g., healthcare), allowing generation of synthetic samples representing underrepresented cases. They can increase model performance by augmenting datasets without requiring changes to existing model pipelines. Conditional Generative Techniques Conditional autoencoders extend VAEs by allowing controlled generation based on specified conditions (e.g., generating a house with a pool), through additional decoder inputs and conditional loss terms. Practical Considerations and Limitations Training autoencoders and their variants requires computational resources, and their stochastic training can produce differing code representations across runs. Lossy reconstruction, lack of domain-specific optimizations, and limited code interpretability restrict some use cases, particularly where exact data preservation or meaningful decompositions are required.

LessWrong Curated Podcast
“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda

LessWrong Curated Podcast

Play Episode Listen Later May 5, 2025 13:15


(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.) TL;DR: I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise. Interpretability seems a valuable tool here and remains worth investing in, as it will hopefully increase the reliability we can achieve. However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy. It is not the one thing that will save us, and it still won't be enough for high reliability. Introduction There's a common, often implicit, argument made in AI safety discussions: interpretability is presented as the only reliable path forward for detecting deception in advanced AI - among many other sources it was argued for in [...] ---Outline:(00:55) Introduction(02:57) High Reliability Seems Unattainable(05:12) Why Won't Interpretability be Reliable?(07:47) The Potential of Black-Box Methods(08:48) The Role of Interpretability(12:02) ConclusionThe original text contained 5 footnotes which were omitted from this narration. --- First published: May 4th, 2025 Source: https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai --- Narrated by TYPE III AUDIO.

AI Unraveled: Latest AI News & Trends, Master GPT, Gemini, Generative AI, LLMs, Prompting, GPT Store

Key themes include technological competition and national self-reliance with Huawei and China challenging Nvidia and US dominance in AI chips, and major product updates and releases from companies like Baidu, OpenAI, and Grok introducing new AI models and features. The text also highlights innovative applications of AI, from Neuralink's brain implants restoring communication and Waymo considering selling robotaxis directly to consumers, to creative uses like generating action figures and integrating AI into religious practices. Finally, the sources touch on important considerations surrounding AI, such as the need for interpretability to ensure safety, the increasing sophistication of AI-powered scams, and discussions on the military implications and future potential of AGI.

Mixture of Experts
OpenAI goes open, Anthropic on interpretability, Apple Intelligence updates and Amazon AI agents

Mixture of Experts

Play Episode Listen Later Apr 4, 2025 43:25


Will OpenAI be fully open source by 2027? In episode 49 of Mixture of Experts, host Tim Hwang is joined by Aaron Baughman, Ash Minhas and Chris Hay to analyze Sam Altman's latest move towards open source. Next, we explore Anthropic's mechanistic interpretability results and the progress the AI research community is making. Then, can Apple catch up? We analyze the latest critiques on Apple Intelligence. Finally, Amazon enters the chat with AI agents. How does this elevate the competition? All that and more on today's Mixture of Experts.00:01 -- Introduction00:48 -- OpenAI goes open 11:36 -- Anthropic interpretability results 24:55 -- Daring Fireball on Apple Intelligence 34:22 -- Amazon's AI agentsThe opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.Subscribe for AI updates: https://www.ibm.com/account/reg/us-en/signup?formid=news-urx-52120Learn more about artificial intelligence → https://www.ibm.com/think/artificial-intelligenceVisit Mixture of Experts podcast page to learn more AI content → https://www.ibm.com/think/podcasts/mixture-of-experts

AXRP - the AI X-risk Research Podcast
40 - Jason Gross on Compact Proofs and Interpretability

AXRP - the AI X-risk Research Podcast

Play Episode Listen Later Mar 28, 2025 156:05


How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html   Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs 0:48:23 - What we've learned about compact proofs in general 0:59:02 - Generalizing 'symmetry' 1:11:24 - Grading mechanistic interpretability 1:43:34 - What helps compact proofs 1:51:08 - The limits of compact proofs 2:07:33 - Guaranteed safe AI, and AI for guaranteed safety 2:27:44 - Jason and Rajashree's start-up 2:34:19 - Following Jason's work   Links to Jason: Github: https://github.com/jasongross Website: https://jasongross.github.io Alignment Forum: https://www.alignmentforum.org/users/jason-gross   Links to work we discuss: Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779 Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476 Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773 Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926 Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf     Episode art by Hamish Doodles: hamishdoodles.com

LessWrong Curated Podcast
“Auditing language models for hidden objectives” by Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Akbir Khan, Euan Ong, Christopher Olah, Fabien Roger, Meg, Drake Thomas, Adam Jermyn, Monte M, evhub

LessWrong Curated Podcast

Play Episode Listen Later Mar 16, 2025 24:14


We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams. AbstractWe study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...] ---Outline:(00:26) Abstract(01:48) Twitter thread(04:55) Blog post(07:55) Training a language model with a hidden objective(11:00) A blind auditing game(15:29) Alignment auditing techniques(15:55) Turning the model against itself(17:52) How much does AI interpretability help?(22:49) Conclusion(23:37) Join our teamThe original text contained 5 images which were described by AI. --- First published: March 13th, 2025 Source: https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives --- Narrated by TYPE III AUDIO. ---Images from the article:

Machine Learning Street Talk
Clement Bonnet - Can Latent Program Networks Solve Abstract Reasoning?

Machine Learning Street Talk

Play Episode Listen Later Feb 19, 2025 51:26


Clement Bonnet discusses his novel approach to the ARC (Abstraction and Reasoning Corpus) challenge. Unlike approaches that rely on fine-tuning LLMs or generating samples at inference time, Clement's method encodes input-output pairs into a latent space, optimizes this representation with a search algorithm, and decodes outputs for new inputs. This end-to-end architecture uses a VAE loss, including reconstruction and prior losses. SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich. Goto https://tufalabs.ai/***TRANSCRIPT + RESEARCH OVERVIEW:https://www.dropbox.com/scl/fi/j7m0gaz1126y594gswtma/CLEMMLST.pdf?rlkey=y5qvwq2er5nchbcibm07rcfpq&dl=0Clem and Matthew-https://www.linkedin.com/in/clement-bonnet16/https://github.com/clement-bonnethttps://mvmacfarlane.github.io/TOC1. LPN Fundamentals [00:00:00] 1.1 Introduction to ARC Benchmark and LPN Overview [00:05:05] 1.2 Neural Networks' Challenges with ARC and Program Synthesis [00:06:55] 1.3 Induction vs Transduction in Machine Learning2. LPN Architecture and Latent Space [00:11:50] 2.1 LPN Architecture and Latent Space Implementation [00:16:25] 2.2 LPN Latent Space Encoding and VAE Architecture [00:20:25] 2.3 Gradient-Based Search Training Strategy [00:23:39] 2.4 LPN Model Architecture and Implementation Details3. Implementation and Scaling [00:27:34] 3.1 Training Data Generation and re-ARC Framework [00:31:28] 3.2 Limitations of Latent Space and Multi-Thread Search [00:34:43] 3.3 Program Composition and Computational Graph Architecture4. Advanced Concepts and Future Directions [00:45:09] 4.1 AI Creativity and Program Synthesis Approaches [00:49:47] 4.2 Scaling and Interpretability in Latent Space ModelsREFS[00:00:05] ARC benchmark, Chollethttps://arxiv.org/abs/2412.04604[00:02:10] Latent Program Spaces, Bonnet, Macfarlanehttps://arxiv.org/abs/2411.08706[00:07:45] Kevin Ellis work on program generationhttps://www.cs.cornell.edu/~ellisk/[00:08:45] Induction vs transduction in abstract reasoning, Li et al.https://arxiv.org/abs/2411.02272[00:17:40] VAEs, Kingma, Wellinghttps://arxiv.org/abs/1312.6114[00:27:50] re-ARC, Hodelhttps://github.com/michaelhodel/re-arc[00:29:40] Grid size in ARC tasks, Chollethttps://github.com/fchollet/ARC-AGI[00:33:00] Critique of deep learning, Marcushttps://arxiv.org/vc/arxiv/papers/2002/2002.06177v1.pdf

LessWrong Curated Podcast
“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

LessWrong Curated Podcast

Play Episode Listen Later Jan 10, 2025 15:56


TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.Written at Apollo Research IntroductionClaim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.Let's walk through this claim.What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...] ---Outline:(00:33) Introduction(02:40) Examples illustrating the general problem(12:29) The general problem(13:26) What can we do about this?The original text contained 11 footnotes which were omitted from this narration. --- First published: January 8th, 2025 Source: https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed --- Narrated by TYPE III AUDIO.

LessWrong Curated Podcast
“Shallow review of technical AI safety, 2024” by technicalities, Stag, Stephen McAleese, jordine, Dr. David Mathers

LessWrong Curated Podcast

Play Episode Listen Later Dec 30, 2024 117:07


from aisafety.world The following is a list of live agendas in technical AI safety, updating our post from last year. It is “shallow” in the sense that 1) we are not specialists in almost any of it and that 2) we only spent about an hour on each entry. We also only use public information, so we are bound to be off by some additional factor. The point is to help anyone look up some of what is happening, or that thing you vaguely remember reading about; to help new researchers orient and know (some of) their options; to help policy people know who to talk to for the actual information; and ideally to help funders see quickly what has already been funded and how much (but this proves to be hard).“AI safety” means many things. We're targeting work that intends to prevent very competent [...] ---Outline:(01:33) Editorial(08:15) Agendas with public outputs(08:19) 1. Understand existing models(08:24) Evals(14:49) Interpretability(27:35) Understand learning(31:49) 2. Control the thing(40:31) Prevent deception and scheming(46:30) Surgical model edits(49:18) Goal robustness(50:49) 3. Safety by design(52:57) 4. Make AI solve it(53:05) Scalable oversight(01:00:14) Task decomp(01:00:28) Adversarial(01:04:36) 5. Theory(01:07:27) Understanding agency(01:15:47) Corrigibility(01:17:29) Ontology Identification(01:21:24) Understand cooperation(01:26:32) 6. Miscellaneous(01:50:40) Agendas without public outputs this year(01:51:04) Graveyard (known to be inactive)(01:52:00) Method(01:55:09) Other reviews and taxonomies(01:56:11) AcknowledgmentsThe original text contained 9 footnotes which were omitted from this narration. --- First published: December 29th, 2024 Source: https://www.lesswrong.com/posts/fAW6RXLKTLHC3WXkS/shallow-review-of-technical-ai-safety-2024 --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

Breakthroughs in AI for Biology: AI Lab Groups & Protein Model Interpretability with Prof James Zou

Play Episode Listen Later Dec 18, 2024 62:49


Nathan discusses groundbreaking AI and biology research with Stanford Professor James Zou from the Chan Zuckerberg Initiative. In this episode of The Cognitive Revolution, we explore two remarkable papers: the virtual lab framework that created novel COVID treatments with minimal human oversight, and InterPLM's discovery of new protein motifs through mechanistic interpretability. Join us for an fascinating discussion about how AI is revolutionizing biological research and drug discovery. Got questions about AI? Submit them for our upcoming AMA episode + take our quick listener survey to help us serve you better - https://docs.google.com/forms/d/e/1FAIpQLSefHvs1-1g5xeqM7wSirQkzTtK-1fgW_OjyHPH9DvmbVAjEzA/viewform SPONSORS: SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognitive Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive 80,000 Hours: 80,000 Hours is dedicated to helping you find a fulfilling career that makes a difference. With nearly a decade of research, they offer in-depth material on AI risks, AI policy, and AI safety research. Explore their articles, career reviews, and a podcast featuring experts like Anthropic CEO Dario. Everything is free, including their Career Guide. Visit https://80000hours.org/cognitiverevolution to start making a meaningful impact today. GiveWell : GiveWell has spent over 17 years researching global health and philanthropy to identify the highest-impact giving opportunities. Over 125,000 donors have contributed more than $2 billion, saving over 200,000 lives through evidence-backed recommendations. First-time donors can have their contributions matched up to $100 before year-end. Visit https://GiveWell.org select podcast, and enter Cognitive Revolution at checkout to make a difference today. CHAPTERS: CHAPTERS: (00:00:00) Teaser (00:00:35) About the Episode (00:04:30) Virtual Lab (00:08:09) AI Designs Nanobodies (00:14:43) Novel AI Pipeline (00:20:31) Human-AI Interaction (Part 1) (00:20:33) Sponsors: SelectQuote | Oracle Cloud Infrastructure (OCI) (00:23:22) Human-AI Interaction (Part 2) (00:32:31) Sponsors: 80,000 Hours | GiveWell (00:35:10) Project Cost & Time (00:41:04) Future of AI in Bio (00:45:46) InterPLM: Intro (00:50:30) AI Found New Concepts (00:55:02) Discovering New Motifs (00:57:14) Limitations & Future (01:01:32) Outro SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://www.linkedin.com/in/nathanlabenz/ Youtube: https://www.youtube.com/@CognitiveRevolutionPodcast

Fluidity
Do AI As Engineering Instead

Fluidity

Play Episode Listen Later Dec 15, 2024 15:47


Current AI practice is not engineering, even when it aims for practical applications, because it is not based on scientific understanding. Enforcing engineering norms on the field could lead to considerably safer systems.   https://betterwithout.ai/AI-as-engineering   This episode has a lot of links! Here they are.   Michael Nielsen's “The role of ‘explanation' in AI”. https://michaelnotebook.com/ongoing/sporadica.html#role_of_explanation_in_AI   Subbarao Kambhampati's “Changing the Nature of AI Research”. https://dl.acm.org/doi/pdf/10.1145/3546954   Chris Olah and his collaborators: “Thread: Circuits”. distill.pub/2020/circuits/ “An Overview of Early Vision in InceptionV1”. distill.pub/2020/circuits/early-vision/   Dai et al., “Knowledge Neurons in Pretrained Transformers”. https://arxiv.org/pdf/2104.08696.pdf   Meng et al.: “Locating and Editing Factual Associations in GPT.” rome.baulab.info “Mass-Editing Memory in a Transformer,” https://arxiv.org/pdf/2210.07229.pdf   François Chollet on image generators putting the wrong number of legs on horses: twitter.com/fchollet/status/1573879858203340800   Neel Nanda's “Longlist of Theories of Impact for Interpretability”, https://www.lesswrong.com/posts/uK6sQCNMw8WKzJeCQ/a-longlist-of-theories-of-impact-for-interpretability   Zachary C. Lipton's “The Mythos of Model Interpretability”. https://arxiv.org/abs/1606.03490   Meng et al., “Locating and Editing Factual Associations in GPT”. https://arxiv.org/pdf/2202.05262.pdf   Belrose et al., “Eliciting Latent Predictions from Transformers with the Tuned Lens”. https://arxiv.org/abs/2303.08112   “Progress measures for grokking via mechanistic interpretability”. https://arxiv.org/abs/2301.05217   Conmy et al., “Towards Automated Circuit Discovery for Mechanistic Interpretability”. https://arxiv.org/abs/2304.14997   Elhage et al., “Softmax Linear Units,” transformer-circuits.pub/2022/solu/index.html   Filan et al., “Clusterability in Neural Networks,” https://arxiv.org/pdf/2103.03386.pdf   Cammarata et al., “Curve circuits,” distill.pub/2020/circuits/curve-circuits/   You can support the podcast and get episodes a week early, by supporting the Patreon: https://www.patreon.com/m/fluidityaudiobooks   If you like the show, consider buying me a coffee: https://www.buymeacoffee.com/mattarnold   Original music by Kevin MacLeod.   This podcast is under a Creative Commons Attribution Non-Commercial International 4.0 License.

Machine Learning Street Talk
Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)

Machine Learning Street Talk

Play Episode Listen Later Dec 7, 2024 222:36


Neel Nanda, a senior research scientist at Google DeepMind, leads their mechanistic interpretability team. In this extensive interview, he discusses his work trying to understand how neural networks function internally. At just 25 years old, Nanda has quickly become a prominent voice in AI research after completing his pure mathematics degree at Cambridge in 2020. Nanda reckons that machine learning is unique because we create neural networks that can perform impressive tasks (like complex reasoning and software engineering) without understanding how they work internally. He compares this to having computer programs that can do things no human programmer knows how to write. His work focuses on "mechanistic interpretability" - attempting to uncover and understand the internal structures and algorithms that emerge within these networks. SPONSOR MESSAGES: *** CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. https://centml.ai/pricing/ Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on ARC and AGI, they just acquired MindsAI - the current winners of the ARC challenge. Are you interested in working on ARC, or getting involved in their events? Goto https://tufalabs.ai/ *** SHOWNOTES, TRANSCRIPT, ALL REFERENCES (DONT MISS!): https://www.dropbox.com/scl/fi/36dvtfl3v3p56hbi30im7/NeelShow.pdf?rlkey=pq8t7lyv2z60knlifyy17jdtx&st=kiutudhc&dl=0 We riff on: * How neural networks develop meaningful internal representations beyond simple pattern matching * The effectiveness of chain-of-thought prompting and why it improves model performance * The importance of hands-on coding over extensive paper reading for new researchers * His journey from Cambridge to working with Chris Olah at Anthropic and eventually Google DeepMind * The role of mechanistic interpretability in AI safety NEEL NANDA: https://www.neelnanda.io/ https://scholar.google.com/citations?user=GLnX3MkAAAAJ&hl=en https://x.com/NeelNanda5 Interviewer - Tim Scarfe TOC: 1. Part 1: Introduction [00:00:00] 1.1 Introduction and Core Concepts Overview 2. Part 2: Outside Interview [00:06:45] 2.1 Mechanistic Interpretability Foundations 3. Part 3: Main Interview [00:32:52] 3.1 Mechanistic Interpretability 4. Neural Architecture and Circuits [01:00:31] 4.1 Biological Evolution Parallels [01:04:03] 4.2 Universal Circuit Patterns and Induction Heads [01:11:07] 4.3 Entity Detection and Knowledge Boundaries [01:14:26] 4.4 Mechanistic Interpretability and Activation Patching 5. Model Behavior Analysis [01:30:00] 5.1 Golden Gate Claude Experiment and Feature Amplification [01:33:27] 5.2 Model Personas and RLHF Behavior Modification [01:36:28] 5.3 Steering Vectors and Linear Representations [01:40:00] 5.4 Hallucinations and Model Uncertainty 6. Sparse Autoencoder Architecture [01:44:54] 6.1 Architecture and Mathematical Foundations [02:22:03] 6.2 Core Challenges and Solutions [02:32:04] 6.3 Advanced Activation Functions and Top-k Implementations [02:34:41] 6.4 Research Applications in Transformer Circuit Analysis 7. Feature Learning and Scaling [02:48:02] 7.1 Autoencoder Feature Learning and Width Parameters [03:02:46] 7.2 Scaling Laws and Training Stability [03:11:00] 7.3 Feature Identification and Bias Correction [03:19:52] 7.4 Training Dynamics Analysis Methods 8. Engineering Implementation [03:23:48] 8.1 Scale and Infrastructure Requirements [03:25:20] 8.2 Computational Requirements and Storage [03:35:22] 8.3 Chain-of-Thought Reasoning Implementation [03:37:15] 8.4 Latent Structure Inference in Language Models

GELU, MMLU, & X-Risk Defense in Depth, with the Great Dan Hendrycks

Play Episode Listen Later Oct 19, 2024 158:31


Join Nathan for an expansive conversation with Dan Hendrycks, Executive Director of the Center for AI Safety and Advisor to Elon Musk's XAI. In this episode of The Cognitive Revolution, we explore Dan's groundbreaking work in AI safety and alignment, from his early contributions to activation functions to his recent projects on AI robustness and governance. Discover insights on representation engineering, circuit breakers, and tamper-resistant training, as well as Dan's perspectives on AI's impact on society and the future of intelligence. Don't miss this in-depth discussion with one of the most influential figures in AI research and safety. Check out some of Dan's research papers: MMLU: https://arxiv.org/abs/2009.03300 GELU: https://arxiv.org/abs/1606.08415 Machiavelli Benchmark: https://arxiv.org/abs/2304.03279 Circuit Breakers: https://arxiv.org/abs/2406.04313 Tamper Resistant Safeguards: https://arxiv.org/abs/2408.00761 Statement on AI Risk: https://www.safe.ai/work/statement-on-ai-risk Apply to join over 400 Founders and Execs in the Turpentine Network: https://www.turpentinenetwork.co/ SPONSORS: Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive. LMNT: LMNT is a zero-sugar electrolyte drink mix that's redefining hydration and performance. Ideal for those who fast or anyone looking to optimize their electrolyte intake. Support the show and get a free sample pack with any purchase at https://drinklmnt.com/tcr. Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitiverevolution Oracle: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive CHAPTERS: (00:00:00) Teaser (00:00:48) About the Show (00:02:17) About the Episode (00:05:41) Intro (00:07:19) GELU Activation Function (00:10:48) Signal Filtering (00:12:46) Scaling Maximalism (00:18:35) Sponsors: Shopify | LMNT (00:22:03) New Architectures (00:25:41) AI as Complex System (00:32:35) The Machiavelli Benchmark (00:34:10) Sponsors: Notion | Oracle (00:37:20) Understanding MMLU Scores (00:45:23) Reasoning in Language Models (00:49:18) Multimodal Reasoning (00:54:53) World Modeling and Sora (00:57:07) Arc Benchmark and Hypothesis (01:01:06) Humanity's Last Exam (01:08:46) Benchmarks and AI Ethics (01:13:28) Robustness and Jailbreaking (01:18:36) Representation Engineering (01:30:08) Convergence of Approaches (01:34:18) Circuit Breakers (01:37:52) Tamper Resistance (01:49:10) Interpretability vs. Robustness (01:53:53) Open Source and AI Safety (01:58:16) Computational Irreducibility (02:06:28) Neglected Approaches (02:12:47) Truth Maxing and XAI (02:19:59) AI-Powered Forecasting (02:24:53) Chip Bans and Geopolitics (02:33:30) Working at CAIS (02:35:03) Extinction Risk Statement (02:37:24) Outro

Unsupervised Learning
Ep 44: Co-Founder of Together.AI Percy Liang on What's Next in Research, Reaction to o1 and How AI will Change Simulation

Unsupervised Learning

Play Episode Listen Later Oct 3, 2024 66:20


Percy Liang is a Stanford professor and co-founder of Together AI, driving some of the most critical advances in AI research. Percy is also a trained classical pianist, which clearly influences the way he thinks about technology. We explored the evolution of AI from simple token prediction to autonomous agents capable of long-term problem-solving, the problem of interpretability, and the future of AI safety in complex, real-world systems. [0:00] Intro[0:46] Discussing OpenAI's O1 Model  [2:21] The Evolution of AI Agents  [3:27] Challenges and Benchmarks in AI  [4:38] Compatibility and Integration Issues  [6:17] The Future of AI Scaffolding  [10:05] Academia's Role in AI Research  [15:17] AI Safety and Holistic Approaches  [18:32] Regulation and Transparency in AI  [21:42] Generative Agents and Social Simulations  [29:14] The State of AI Evaluations  [32:07] Exploring Evaluation in Language Models  [35:13] The Challenge of Interpretability  [39:31] Innovations in Model Architectures  [43:18] The Future of Inference and Customization  [46:46] Milestones in AI Research and Reasoning  [49:43] Robotics and AI: The Road Ahead  [52:24] AI in Music: A Harmonious Future  [55:52] AI's Role in Education and Beyond  [56:30] Quickfire[59:16] Jacob and Pat Debrief With your co-hosts: @jacobeffron - Partner at Redpoint, Former PM Flatiron Health @patrickachase - Partner at Redpoint, Former ML Engineer LinkedIn @ericabrescia - Former COO Github, Founder Bitnami (acq'd by VMWare) @jordan_segall - Partner at Redpoint

AXRP - the AI X-risk Research Podcast
35 - Peter Hase on LLM Beliefs and Easy-to-Hard Generalization

AXRP - the AI X-risk Research Podcast

Play Episode Listen Later Aug 24, 2024 137:24


How do we figure out what large language models believe? In fact, do they even have beliefs? Do those beliefs have locations, and if so, can we edit those locations to change the beliefs? Also, how are we going to get AI to perform tasks so hard that we can't figure out if they succeeded at them? In this episode, I chat to Peter Hase about his research into these questions. Patreon: patreon.com/axrpodcast Ko-fi: ko-fi.com/axrpodcast The transcript: https://axrp.net/episode/2024/08/24/episode-35-peter-hase-llm-beliefs-easy-to-hard-generalization.html   Topics we discuss, and timestamps: 0:00:36 - NLP and interpretability 0:10:20 - Interpretability lessons 0:32:22 - Belief interpretability 1:00:12 - Localizing and editing models' beliefs 1:19:18 - Beliefs beyond language models 1:27:21 - Easy-to-hard generalization 1:47:16 - What do easy-to-hard results tell us? 1:57:33 - Easy-to-hard vs weak-to-strong 2:03:50 - Different notions of hardness 2:13:01 - Easy-to-hard vs weak-to-strong, round 2 2:15:39 - Following Peter's work   Peter on Twitter: https://x.com/peterbhase   Peter's papers: Foundational Challenges in Assuring Alignment and Safety of Large Language Models: https://arxiv.org/abs/2404.09932 Do Language Models Have Beliefs? Methods for Detecting, Updating, and Visualizing Model Beliefs: https://arxiv.org/abs/2111.13654 Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models: https://arxiv.org/abs/2301.04213 Are Language Models Rational? The Case of Coherence Norms and Belief Revision: https://arxiv.org/abs/2406.03442 The Unreasonable Effectiveness of Easy Training Data for Hard Tasks: https://arxiv.org/abs/2401.06751   Other links: Toy Models of Superposition: https://transformer-circuits.pub/2022/toy_model/index.html Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV): https://arxiv.org/abs/1711.11279 Locating and Editing Factual Associations in GPT (aka the ROME paper): https://arxiv.org/abs/2202.05262 Of nonlinearity and commutativity in BERT: https://arxiv.org/abs/2101.04547 Inference-Time Intervention: Eliciting Truthful Answers from a Language Model: https://arxiv.org/abs/2306.03341 Editing a classifier by rewriting its prediction rules: https://arxiv.org/abs/2112.01008 Discovering Latent Knowledge Without Supervision (aka the Collin Burns CCS paper): https://arxiv.org/abs/2212.03827 Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision: https://arxiv.org/abs/2312.09390 Concrete problems in AI safety: https://arxiv.org/abs/1606.06565 Rissanen Data Analysis: Examining Dataset Characteristics via Description Length: https://arxiv.org/abs/2103.03872   Episode art by Hamish Doodles: hamishdoodles.com

The Nonlinear Library
AF - Clarifying alignment vs capabilities by Richard Ngo

The Nonlinear Library

Play Episode Listen Later Aug 19, 2024 13:26


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Clarifying alignment vs capabilities, published by Richard Ngo on August 19, 2024 on The AI Alignment Forum. A core distinction in AGI safety is between alignment and capabilities. However, I think this distinction is a very fuzzy one, which has led to a lot of confusion. In this post I'll describe some of the problems with how people typically think about it, and offer a replacement set of definitions. "Alignment" and "capabilities" are primarily properties of AIs not of AI research The first thing to highlight is that the distinction between alignment and capabilities is primarily doing useful work when we think of them as properties of AIs. This distinction is still under-appreciated by the wider machine learning community. ML researchers have historically thought about performance of models almost entirely with respect to the tasks they were specifically trained on. However, the rise of LLMs has vindicated the alignment community's focus on general capabilities, and now it's much more common to assume that performance on many tasks (including out-of-distribution tasks) will improve roughly in parallel. This is a crucial assumption for thinking about risks from AGI. Insofar as the ML community has thought about alignment, it has mostly focused on aligning models' behavior to their training objectives. The possibility of neural networks aiming to achieve internally-represented goals is still not very widely understood, making it hard to discuss and study the reasons those goals might or might not be aligned with the values of (any given set of) humans. To be fair, the alignment community has caused some confusion by describing models as more or less "aligned", rather than more or less "aligned to X" for some specified X. I'll talk more about this confusion, and how we should address it, in a later post. But the core point is that AIs might develop internally-represented goals or values that we don't like, and we should try to avoid that. However, extending "alignment" and "capabilities" from properties of AIs to properties of different types of research is a fraught endeavor. It's tempting to categorize work as alignment research to the extent that it can be used to make AIs more aligned (to many possible targets), and as capabilities research to the extent that it can be used to make AIs more capable. But this approach runs into (at least) three major problems. Firstly, in general it's very difficult to categorize research by its impacts. Great research often links together ideas from many different subfields, typically in ways that only become apparent throughout the course of the research. We see this in many historical breakthroughs which shed light on a range of different domains. For example, early physicists studying the motions of the stars eventually derived laws governing all earthly objects. Meanwhile Darwin's study of barnacles and finches led him to principles governing the evolution of all life. Analogously, we should expect that big breakthroughs in our understanding of neural networks and deep learning would be useful in many different ways. More concretely, there are many cases where research done under the banner of alignment has advanced, or plausibly will advance, AI capabilities to a significant extent. This undermines our ability to categorize research by its impacts. Central examples include: RLHF makes language models more obedient, but also more capable of coherently carrying out tasks. Scalable oversight techniques can catch misbehavior, but will likely become important for generating high-quality synthetic training data, as it becomes more and more difficult for unassisted humans to label AI outputs correctly. Interpretability techniques will both allow us to inspect AI cognition and also extract more capable behavior from them (e.g. via ...

Popular Mechanistic Interpretability: Goodfire Lights the Way to AI Safety

Play Episode Listen Later Aug 17, 2024 115:33


Nathan explores the cutting-edge field of mechanistic interpretability with Dan Balsam and Tom McGrath, co-founders of Goodfire. In this episode of The Cognitive Revolution, we delve into the science of understanding AI models' inner workings, recent breakthroughs, and the potential impact on AI safety and control. Join us for an insightful discussion on sparse autoencoders, polysemanticity, and the future of interpretable AI. Papers Very accessible article on types of representations: Local vs Distributed Coding Theoretical understanding of how models might pack concepts into their representations: Toy Models of Superposition How structure in the world gives rise to structure in the latent space: The Geometry of Categorical and Hierarchical Concepts in Large Language Models Using sparse autoencoders to pull apart language model representations: Sparse Autoencoders / Towards Monosemanticity / Scaling Monosemanticity Finding & teaching concepts in superhuman systems: Acquisition of Chess Knowledge in AlphaZero / Bridging the Human-AI Knowledge Gap: Concept Discovery and Transfer in AlphaZero Connecting microscopic learning to macroscopic phenomena: The Quantization Model of Neural Scaling Understanding at scale: Language models can explain neurons in language models Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.com/to/JCkphVqj SPONSORS: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/ Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist. CHAPTERS: (00:00:00) About the Show (00:00:22) About the Episode (00:03:52) Introduction and Background (00:08:43) State of Interpretability Research (00:12:06) Key Insights in Interpretability (00:16:53) Polysemanticity and Model Compression (Part 1) (00:17:00) Sponsors: Oracle | Brave (00:19:04) Polysemanticity and Model Compression (Part 2) (00:22:50) Sparse Autoencoders Explained (00:27:19) Challenges in Interpretability Research (Part 1) (00:30:54) Sponsors: Omneky | Squad (00:32:41) Challenges in Interpretability Research (Part 2) (00:33:51) Goodfire's Vision and Mission (00:37:08) Interpretability and Scientific Models (00:43:48) Architecture and Interpretability Techniques (00:50:08) Quantization and Model Representation (00:54:07) Future of Interpretability Research (01:01:38) Skepticism and Challenges in Interpretability (01:07:51) Alternative Architectures and Universality (01:13:39) Goodfire's Business Model and Funding (01:18:47) Building the Team and Future Plans (01:31:03) Hiring and Getting Involved in Interpretability (01:51:28) Closing Remarks (01:51:38) Outro

Waking Up With AI
AI Interpretability

Waking Up With AI

Play Episode Listen Later Aug 15, 2024 11:32


In this week's episode, Katherine Forrest and Anna Gressel unravel interpretability, an evolving field of AI safety research, while exploring its potential significance for regulators as well as its relation to concepts like explainability and transparency. ## Learn More About Paul, Weiss's Artificial Intelligence Practice: https://www.paulweiss.com/practices/litigation/artificial-intelligence

Data Science Salon Podcast
AI in Action: From Machine Learning Interpretability to Cybersecurity with Serg Masís and Nirmal Budhathoki

Data Science Salon Podcast

Play Episode Listen Later Aug 6, 2024 25:37


In this DSS Podcast, Anna Anisin welcomes Serg Masís, Climate and Agronomic Data Scientist at Syngenta. Serg, an expert in machine learning interpretability and responsible AI, shares his diverse background and journey into data science. He discusses the challenges of building fair and reliable ML models, emphasizing the importance of interpretability and trust in AI. Serg also talks into his latest book, "Interpretable Machine Learning with Python," and provides valuable insights for data scientists striving to create more transparent and effective AI solutions. In another compelling episode, Anna sits down with Nirmal Budhathoki, Senior Data Scientist at Microsoft. Nirmal, who has extensive experience at VMware Carbon Black and Wells Fargo, focuses on the intersection of AI and cybersecurity. He shares his journey into security data science, discussing the unique challenges and critical importance of applying AI to enhance cybersecurity measures. Nirmal highlights the pressing need for AI in this field, practical use cases, and the complexities involved in integrating AI with security practices, offering a valuable perspective for professionals navigating this dynamic landscape.

The Nonlinear Library
LW - Open Source Automated Interpretability for Sparse Autoencoder Features by kh4dien

The Nonlinear Library

Play Episode Listen Later Jul 31, 2024 22:41


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Open Source Automated Interpretability for Sparse Autoencoder Features, published by kh4dien on July 31, 2024 on LessWrong. Background Sparse autoencoders recover a diversity of interpretable, monosemantic features, but present an intractable problem of scale to human labelers. We investigate different techniques for generating and scoring text explanations of SAE features. Key Findings Open source models generate and evaluate text explanations of SAE features reasonably well, albeit somewhat worse than closed models like Claude 3.5 Sonnet. Explanations found by LLMs are similar to explanations found by humans. Automatically interpreting 1.5M features of GPT-2 with the current pipeline would cost $1300 in API calls to Llama 3.1 or $8500 with Claude 3.5 Sonnet. Prior methods cost ~$200k with Claude. Code can be found at https://github.com/EleutherAI/sae-auto-interp. We built a small dashboard to explore explanations and their scores: https://cadentj.github.io/demo/ Generating Explanations Sparse autoencoders decompose activations into a sum of sparse feature directions. We leverage language models to generate explanations for activating text examples. Prior work prompts language models with token sequences that activate MLP neurons (Bills et al. 2023), by showing the model a list of tokens followed by their respective activations, separated by a tab, and listed one per line. We instead highlight max activating tokens in each example with a set of . Optionally, we choose a threshold of the example's max activation for which tokens are highlighted. This helps the model distinguish important information for some densely activating features. We experiment with several methods for augmenting the explanation. Full prompts are available here. Chain of thought improves general reasoning capabilities in language models. We few-shot the model with several examples of a thought process that mimics a human approach to generating explanations. We expect that verbalizing thought might capture richer relations between tokens and context. Activations distinguish which sentences are more representative of a feature. We provide the magnitude of activating tokens after each example. We compute the logit weights for each feature through the path expansion where is the model unembed and is the decoder direction for a specific feature. The top promoted tokens capture a feature's causal effects which are useful for sharpening explanations. This method is equivalent to the logit lens (nostalgebraist 2020); future work might apply variants that reveal other causal information (Belrose et al. 2023; Gandelsman et al. 2024). Scoring explanations Text explanations represent interpretable "concepts" in natural language. How do we evaluate the faithfulness of explanations to the concepts actually contained in SAE features? We view the explanation as a classifier which predicts whether a feature is present in a context. An explanation should have high recall - identifying most activating text - as well as high precision - distinguishing between activating and non-activating text. Consider a feature which activates on the word "stop" after "don't" or "won't" (Gao et al. 2024). There are two failure modes: 1. The explanation could be too broad, identifying the feature as activating on the word "stop". It would have high recall on held out text, but low precision. 2. The explanation could be too narrow, stating the feature activates on the word "stop" only after "don't". This would have high precision, but low recall. One approach to scoring explanations is "simulation scoring"(Bills et al. 2023) which uses a language model to assign an activation to each token in a text, then measures the correlation between predicted and real activations. This method is biased toward recall; given a bro...

The Nonlinear Library
AF - A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team by Lee Sharkey

The Nonlinear Library

Play Episode Listen Later Jul 18, 2024 32:24


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A List of 45+ Mech Interp Project Ideas from Apollo Research's Interpretability Team, published by Lee Sharkey on July 18, 2024 on The AI Alignment Forum. Why we made this list: The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we'd work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that we were excited about! Previous lists of project ideas (such as Neel's collation of 200 Concrete Open Problems in Mechanistic Interpretability) have been very useful for people breaking into the field. But for all its merits, that list is now over a year and a half old. Therefore, many project ideas in that list aren't an up-to-date reflection of what some researchers consider the frontiers of mech interp. We therefore thought it would be helpful to share our list of project ideas! Comments and caveats: Some of these projects are more precisely scoped than others. Some are vague, others are more developed. Not every member of the team endorses every project as high priority. Usually more than one team member supports each one, and in many cases most of the team is supportive of someone working on it. We associate the person(s) who generated the project idea to each idea. We've grouped the project ideas into categories for convenience, but some projects span multiple categories. We don't put a huge amount of weight on this particular categorisation. We hope some people find this list helpful! We would love to see people working on these! If any sound interesting to you and you'd like to chat about it, don't hesitate to reach out. Foundational work on sparse dictionary learning for interpretability Transcoder-related project ideas See [2406.11944] Transcoders Find Interpretable LLM Feature Circuits) [Nix] Training and releasing high quality transcoders. Probably using top k GPT2 is a classic candidate for this. I'd be excited for people to try hard on even smaller models, e.g. GELU 4L [Nix] Good tooling for using transcoders Nice programming API to attribute an input to a collection of paths (see Dunefsky et al) Web user interface? Maybe in collaboration with neuronpedia. Would need a gpu server constantly running, but I'm optimistic you could do it with a ~a4000. [Nix] Further circuit analysis using transcoders. Take random input sequences, run transcoder attribution on them, examine the output and summarize the findings. High level summary statistics of how much attribution goes through error terms & how many pathways are needed would be valuable Explaining specific behaviors (IOI, greater-than) with high standards for specificity & faithfulness. Might be convoluted if accuracy [I could generate more ideas here, feel free to reach out nix@apolloresearch.ai] [Nix, Lee] Cross layer superposition Does it happen? Probably, but it would be nice to have specific examples! Look for features with similar decoder vectors, and do exploratory research to figure out what exactly is going on. What precisely does it mean? Answering this question seems likely to shed light on the question of 'What is a feature?'. [Lucius] Improving transcoder architectures Some MLPs or attention layers may implement a simple linear transformation in addition to actual computation. If we modify our transcoders to include a linear 'bypass' that is not counted in the sparsity penalty, do we improve performance since we are not unduly penalizing these linear transformations that would always be present and active? If we train multiple transcoders in different layers at the same time, can we include a sparsity penalty for their interactions with each other, encouraging a decomposition of the network that leaves us with as few interactions between features a...

Guaranteed Safe AI? World Models, Safety Specs, & Verifiers, with Nora Ammann & Ben Goldhaber

Play Episode Listen Later Jul 17, 2024 106:15


Nathan explores the Guaranteed Safe AI Framework with co-authors Ben Goldhaber and Nora Ammann. In this episode of The Cognitive Revolution, we discuss their groundbreaking position paper on ensuring robust and reliable AI systems. Join us for an in-depth conversation about the three-part system governing AI behavior and its potential impact on the future of AI safety. Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.com/to/JCkphVqj RECOMMENDED PODCAST: Complex Systems Patrick McKenzie (@patio11) talks to experts who understand the complicated but not unknowable systems we rely on. You might be surprised at how quickly Patrick and his guests can put you in the top 1% of understanding for stock trading, tech hiring, and more. Spotify: https://open.spotify.com/show/3Mos4VE3figVXleHDqfXOH Apple: https://podcasts.apple.com/us/podcast/complex-systems-with-patrick-mckenzie-patio11/id1753399812 SPONSORS: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/ Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist. CHAPTERS: (00:00:00) About the Show (00:04:39) Introduction (00:07:58) Convergence (00:10:32) Safety guarantees (00:14:35) World model (Part 1) (00:22:22) Sponsors: Oracle | Brave (00:24:31) World model (Part 2) (00:26:55) AI boxing (00:30:28) Verifier (00:33:33) Sponsors: Omneky | Squad (00:35:20) Example: Self-Driving Cars (00:38:08) Moral Desiderata (00:41:09) Trolley Problems (00:47:24) How to approach the world model (00:50:50) Deriving the world model (00:55:13) How far should the world model extend? (01:00:55) Safety through narrowness (01:02:38) Safety specs (01:08:26) Experiments (01:11:25) How GSAI can help in the short term (01:27:40) What would be the basis for the world model? (01:31:23) Interpretability (01:34:24) Competitive dynamics (01:37:35) Regulation (01:42:02) GSAI authors (01:43:25) Outro

The Nonlinear Library
LW - What and Why: Developmental Interpretability of Reinforcement Learning by Garrett Baker

The Nonlinear Library

Play Episode Listen Later Jul 9, 2024 11:12


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: What and Why: Developmental Interpretability of Reinforcement Learning, published by Garrett Baker on July 9, 2024 on LessWrong. Introduction I happen to be in that happy stage in the research cycle where I ask for money so I can continue to work on things I think are important. Part of that means justifying what I want to work on to the satisfaction of the people who provide that money. This presents a good opportunity to say what I plan to work on in a more layman-friendly way, for the benefit of LessWrong, potential collaborators, interested researchers, and funders who want to read the fun version of my project proposal It also provides the opportunity for people who are very pessimistic about the chances I end up doing anything useful by pursuing this to have their say. So if you read this (or skim it), and have critiques (or just recommendations), I'd love to hear them! Publicly or privately. So without further ado, in this post I will be discussing & justifying three aspects of what I'm working on, and my reasons for believing there are gaps in the literature in the intersection of these subjects that are relevant for AI alignment. These are: 1. Reinforcement learning 2. Developmental Interpretability 3. Values Culminating in: Developmental interpretability of values in reinforcement learning. Here are brief summaries of each of the sections: 1. Why study reinforcement learning? 1. Imposed-from-without or in-context reinforcement learning seems a likely path toward agentic AIs 2. The "data wall" means active-learning or self-training will get more important over time 3. There are fewer ways for the usual AI risk arguments to fail in the RL with mostly outcome-based rewards circumstance than the supervised learning + RL with mostly process-based rewards (RLHF) circumstance. 2. Why study developmental interpretability? 1. Causal understanding of the training process allows us to produce reward structure or environmental distribution interventions 2. Alternative & complementary tools to mechanistic interpretability 3. Connections with singular learning theory 3. Why study values? 1. The ultimate question of alignment is how can we make AI values compatible with human values, yet this is relatively understudied. 4. Where are the gaps? 1. Many experiments 2. Many theories 3. Few experiments testing theories or theories explaining experiments Reinforcement learning Agentic AIs vs Tool AIs All generally capable adaptive systems are ruled by a general, ground-truth, but slow outer optimization process which reduces incoherency and continuously selects for systems which achieve outcomes in the world. Examples include evolution, business, cultural selection, and to a great extent human brains. That is, except for LLMs. Most of the feedback LLMs receive is supervised, unaffected by the particular actions the LLM takes, and process-based (RLHF-like), where we reward the LLM according to how useful an action looks in contrast to a ground truth regarding how well that action (or sequence of actions) achieved its goal. Now I don't want to make the claim that this aspect of how we train LLMs is clearly a fault of them, or in some way limits the problem solving abilities they can have. And I do think it possible we see in-context ground-truth optimization processes instantiated as a result of increased scaling, in the same way we see in context learning. I do however want to make the claim that this current paradigm of mostly processed-based supervision, if it continues, and doesn't itself produce ground-truth based optimization, makes me optimistic about AI going well. That is, if this lack of general ground-truth optimization continues, we end up with a cached bundle of not very agentic (compared to AIXI) tool AIs with limited search or bootstrapping capabilities. Of course,...

The Nonlinear Library
AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Jul 7, 2024 38:20


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2, published by Neel Nanda on July 7, 2024 on The AI Alignment Forum. This post represents my personal hot takes, not the opinions of my team or employer. This is a massively updated version of a similar list I made two years ago There's a lot of mechanistic interpretability papers, and more come out all the time. This can be pretty intimidating if you're new to the field! To try helping out, here's a reading list of my favourite mech interp papers: papers which I think are important to be aware of, often worth skimming, and something worth reading deeply (time permitting). I've annotated these with my key takeaways, what I like about each paper, which bits to deeply engage with vs skim, etc. I wrote a similar post 2 years ago, but a lot has changed since then, thus v2! Note that this is not trying to be a comprehensive literature review - this is my answer to "if you have limited time and want to get up to speed on the field as fast as you can, what should you do". I'm deliberately not following academic norms like necessarily citing the first paper introducing something, or all papers doing some work, and am massively biased towards recent work that is more relevant to the cutting edge. I also shamelessly recommend a bunch of my own work here, sorry! How to read this post: I've bolded the most important papers to read, which I recommend prioritising. All of the papers are annotated with my interpretation and key takeaways, and tbh I think reading that may be comparable good to skimming the paper. And there's far too many papers to read all of them deeply unless you want to make that a significant priority. I recommend reading all my summaries, noting the papers and areas that excite you, and then trying to dive deeply into those. Foundational Work A Mathematical Framework for Transformer Circuits (Nelson Elhage et al, Anthropic) - absolute classic, foundational ideas for how to think about transformers (see my blog post for what to skip). See my youtube tutorial (I hear this is best watched after reading the paper, and adds additional clarity) Deeply engage with: All the ideas in the overview section, especially: Understanding the residual stream and why it's fundamental. The notion of interpreting paths between interpretable bits (eg input tokens and output logits) where the path is a composition of matrices and how this is different from interpreting every intermediate activations And understanding attention heads: what a QK and OV matrix is, how attention heads are independent and additive and how attention and OV are semi-independent. Skip Trigrams & Skip Trigram bugs, esp understanding why these are a really easy thing to do with attention, and how the bugs are inherent to attention heads separating where to attend to (QK) and what to do once you attend somewhere (OV) Induction heads, esp why this is K-Composition (and how that's different from Q & V composition), how the circuit works mechanistically, and why this is too hard to do in a 1L model Skim or skip: Eigenvalues or tensor products. They have the worst effort per unit insight of the paper and aren't very important. Superposition Superposition is a core principle/problem in model internals. For any given activation (eg the output of MLP13), we believe that there's a massive dictionary of concepts/features the model knows of. Each feature has a corresponding vector, and model activations are a sparse linear combination of these meaningful feature vectors. Further, there are more features in the dictionary than activation dimensions, and they are thus compressed in and interfere with each other, essentially causing cascading errors. This phenomena of compression is called superposition. Toy models of superpositio...

The Gradient Podcast
Kristin Lauter: Private AI, Homomorphic Encryption, and AI for Cryptography

The Gradient Podcast

Play Episode Listen Later Jun 27, 2024 77:13


Episode 129I spoke with Kristin Lauter about:* Elliptic curve cryptography and homomorphic encryption* Standardizing cryptographic protocols* Machine Learning on encrypted data* Attacking post-quantum cryptography with AIEnjoy—and let me know what you think!Kristin is Senior Director of FAIR Labs North America (2022—present), based in Seattle. Her current research areas are AI4Crypto and Private AI. She joined FAIR (Facebook AI Research) in 2021, after 22 years at Microsoft Research (MSR). At MSR she was Partner Research Manager on the senior leadership team of MSR Redmond. Before joining Microsoft in 1999, she was Hildebrandt Assistant Professor of Mathematics at the University of Michigan (1996-1999). She is an Affiliate Professor of Mathematics at the University of Washington (2008—present). She received all her advanced degrees from the University of Chicago, BA (1990), MS (1991), PhD (1996) in Mathematics. She is best known for her work on Elliptic Curve Cryptography, Supersingular Isogeny Graphs in Cryptography, Homomorphic Encryption (SEALcrypto.org), Private AI, and AI4Crypto. She served as President of the Association for Women in Mathematics from 2015-2017 and on the Council of the American Mathematical Society from 2014-2017.Find me on Twitter for updates on new episodes, and reach me at editor@thegradient.pub for feedback, ideas, guest suggestions. I spend a lot of time on this podcast—if you like my work, you can support me on Patreon :) You can also support upkeep for the full Gradient team/project through a paid subscription on Substack!Subscribe to The Gradient Podcast: Apple Podcasts  | Spotify | Pocket Casts | RSSFollow The Gradient on TwitterOutline:* (00:00) Intro* (01:10) Llama 3 and encrypted data — where do we want to be?* (04:20) Tradeoffs: individual privacy vs. aggregated value in e.g. social media forums* (07:48) Kristin's shift in views on privacy* (09:40) Earlier work on elliptic curve cryptography — applications and theory* (10:50) Inspirations from algebra, number theory, and algebraic geometry* (15:40) On algebra vs. analysis and on clear thinking* (18:38) Elliptic curve cryptography and security, algorithms and concrete running time* (21:31) Cryptographic protocols and setting standards* (26:36) Supersingular isogeny graphs (and higher-dimensional supersingular isogeny graphs)* (32:26) Hard problems for cryptography and finding new problems* (36:42) Guaranteeing security for cryptographic protocols and mathematical foundations* (40:15) Private AI: Crypto-Nets / running neural nets on homomorphically encrypted data* (42:10) Polynomial approximations, activation functions, and expressivity* (44:32) Scaling up, Llama 2 inference on encrypted data* (46:10) Transitioning between MSR and FAIR, industry research* (52:45) An efficient algorithm for integer lattice reduction (AI4Crypto)* (56:23) Local minima, convergence and limit guarantees, scaling* (58:27) SALSA: Attacking Lattice Cryptography with Transformers* (58:38) Learning With Errors (LWE) vs. standard ML assumptions* (1:02:25) Powers of small primes and faster learning* (1:04:35) LWE and linear regression on a torus* (1:07:30) Secret recovery algorithms and transformer accuracy* (1:09:10) Interpretability / encoding information about secrets* (1:09:45) Future work / scaling up* (1:12:08) Reflections on working as a mathematician among technologistsLinks:* Kristin's Meta, Wikipedia, Google Scholar, and Twitter pages* Papers and sources mentioned/referenced:* The Advantages of Elliptic Curve Cryptography for Wireless Security (2004)* Cryptographic Hash Functions from Expander Graphs (2007, introducing Supersingular Isogeny Graphs)* Families of Ramanujan Graphs and Quaternion Algebras (2008 — the higher-dimensional analogues of Supersingular Isogeny Graphs)* Cryptographic Cloud Storage (2010)* Can homomorphic encryption be practical? (2011)* ML Confidential: Machine Learning on Encrypted Data (2012)* CryptoNets: Applying neural networks to encrypted data with high throughput and accuracy (2016)* A community effort to protect genomic data sharing, collaboration and outsourcing (2017)* The Homomorphic Encryption Standard (2022)* Private AI: Machine Learning on Encrypted Data (2022)* SALSA: Attacking Lattice Cryptography with Transformers (2022)* SalsaPicante: A Machine Learning Attack on LWE with Binary Secrets* SALSA VERDE: a machine learning attack on LWE with sparse small secrets* Salsa Fresca: Angular Embeddings and Pre-Training for ML Attacks on Learning With Errors* The cool and the cruel: separating hard parts of LWE secrets* An efficient algorithm for integer lattice reduction (2023) Get full access to The Gradient at thegradientpub.substack.com/subscribe

The Nonlinear Library
AF - Compact Proofs of Model Performance via Mechanistic Interpretability by Lawrence Chan

The Nonlinear Library

Play Episode Listen Later Jun 24, 2024 12:47


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Compact Proofs of Model Performance via Mechanistic Interpretability, published by Lawrence Chan on June 24, 2024 on The AI Alignment Forum. We recently released a paper on using mechanistic interpretability to generate compact formal guarantees on model performance. In this companion blog post to our paper, we'll summarize the paper and flesh out some of the motivation and inspiration behind our work. Paper abstract In this work, we propose using mechanistic interpretability - techniques for reverse engineering model weights into human-interpretable algorithms - to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-K task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance. Introduction One hope for interpretability is that as we get AGI, we'll be able to use increasingly capable automation to accelerate the pace at which we can interpret ever more powerful models. These automatically generated interpretations need to satisfy two criteria: 1. Compression: Explanations compress the particular behavior of interest. Not just so that it fits in our heads, but also so that it generalizes well and is feasible to find and check. 2. Correspondence (or faithfulness): Explanations must accurately reflect the actual model mechanisms we aim to explain, allowing us to confidently constrain our models for guarantees or other practical applications. Progress happens best when there are clear and unambiguous targets and quantitative metrics. For correspondence, the field has developed increasingly targeted metrics for measuring performance: ablations, patching, and causal scrubbing. In our paper, we use mathematical proof to ensure correspondence, and present proof length as the first quantitative measure of explanation compression that is theoretically grounded, less subject to human judgement, and avoids trivial Goodharting. We see our core contributions in the paper as: 1. We push informal mechanistic interpretability arguments all the way to proofs of generalization bounds on toy transformers trained on the Max-of-$K$ task. This is a first step in getting formal guarantees about global properties of specific models, which is the approach of post-hoc mechanistic interpretability. 2. We introduce compactness of proof as a metric on explanation compression. We find that compactifying proofs requires deeper understanding of model behavior, and more compact proofs of the same bound necessarily encode more understanding of the model. 3. It is a common intuition that "proofs are hard for neural networks", and we flesh this intuition out as the problem of efficiently reasoning about structureless noise, which is an artifact of explanations being lossy approximations of the model's learned weights. While we believe that the proofs themselves (and in particular our proof which achieves a length that is linear in the number of model parameters for the parts of the model we understand adequately) may be of particular interest to those interested in guarantees, we believe that the insights about explanation compression from this methodology and our results are applicable more broadly to the field of mechanistic interpretability. Cor...

The Case for Cautious AI Optimism, from the Consistently Candid podcast

Play Episode Listen Later Jun 19, 2024 107:40


Dive into an accessible discussion on AI safety and philosophy, technical AI safety progress, and why catastrophic outcomes aren't inevitable. This conversation provides practical advice for AI newcomers and hope for a positive future. Consistently Candid Podcast : https://open.spotify.com/show/1EX89qABpb4pGYP1JLZ3BB SPONSORS: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/ Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist. Recommended Podcast: Byrne Hobart, the writer of The Diff, is revered in Silicon Valley. You can get an hour with him each week. See for yourself how his thinking can upgrade yours. Spotify: https://open.spotify.com/show/6rANlV54GCARLgMOtpkzKt Apple: https://podcasts.apple.com/us/podcast/the-riff-with-byrne-hobart-and-erik-torenberg/id1716646486 CHAPTERS: (00:00:00) About the Show (00:03:50) Intro (00:08:13) AI Scouting (00:14:42) Why arent people adopting AI more quickly? (00:18:25) Why dont people take advantage of AI? (00:22:35) Sponsors: Oracle | Brave (00:24:42) How to get a better understanding of AI (00:31:16) How to handle the public discourse around AI (00:34:02) Scaling and research (00:43:18) Sponsors: Omneky | Squad (00:45:03) The pause (00:47:29) Algorithmic efficiency (00:52:52) Red Teaming in Public (00:55:41) Deepfakes (01:01:02) AI safety (01:04:00) AI moderation (01:07:03) Why not a doomer (01:09:10) AI understanding human values (01:15:00) Interpretability research (01:18:30) AI safety leadership (01:21:55) AI safety respectability politics (01:33:42) China (01:37:22) Radical uncertainty (01:39:53) P(doom) (01:42:30) Where to find the guest (01:44:48) Outro

Artificial General Intelligence (AGI) Show with Soroush Pour
Ep 14 - Interp, latent robustness, RLHF limitations w/ Stephen Casper (PhD AI researcher, MIT)

Artificial General Intelligence (AGI) Show with Soroush Pour

Play Episode Listen Later Jun 19, 2024 162:17


We speak with Stephen Casper, or "Cas" as his friends call him. Cas is a PhD student at MIT in the Computer Science (EECS) department, in the Algorithmic Alignment Group advised by Prof Dylan Hadfield-Menell. Formerly, he worked with the Harvard Kreiman Lab and the Center for Human-Compatible AI (CHAI) at Berkeley. His work focuses on better understanding the internal workings of AI models (better known as “interpretability”), making them robust to various kinds of adversarial attacks, and calling out the current technical and policy gaps when it comes to making sure our future with AI goes well. He's particularly interested in finding automated ways of finding & fixing flaws in how deep neural nets handle human-interpretable concepts.We talk to Stephen about:* His technical AI safety work in the areas of:  * Interpretability  * Latent attacks and adversarial robustness  * Model unlearning  * The limitations of RLHF* Cas' journey to becoming an AI safety researcher* How he thinks the AI safety field is going and whether we're on track for a positive future with AI* Where he sees the biggest risks coming with AI* Gaps in the AI safety field that people should work on* Advice for early career researchersHosted by Soroush Pour. Follow me for more AGI content:Twitter: https://twitter.com/soroushjpLinkedIn: https://www.linkedin.com/in/soroushjp/== Show links ==-- Follow Stephen --* Website: https://stephencasper.com/* Email: (see Cas' website above)* Twitter: https://twitter.com/StephenLCasper* Google Scholar: https://scholar.google.com/citations?user=zaF8UJcAAAAJ-- Further resources --* Automated jailbreaks / red-teaming paper that Cas and I worked on together (2023) - https://twitter.com/soroushjp/status/1721950722626077067* Sam Marks paper on Sparse Autoencoders (SAEs) - https://arxiv.org/abs/2403.19647* Interpretability papers involving downstream tasks - See section 4.2 of https://arxiv.org/abs/2401.14446* MMET paper on model editing - https://arxiv.org/abs/2210.07229* Motte & bailey definition - https://en.wikipedia.org/wiki/Motte-and-bailey_fallacy* Bomb-making papers tweet thread by Cas - https://twitter.com/StephenLCasper/status/1780370601171198246* Paper: undoing safety with as few as 10 examples - https://arxiv.org/abs/2310.03693* Recommended papers on latent adversarial training (LAT) -  * https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d  * https://arxiv.org/abs/2403.05030* Scoping (related to model unlearning) blog post by Cas - https://www.alignmentforum.org/posts/mFAvspg4sXkrfZ7FA/deep-forgetting-and-unlearning-for-safely-scoped-llms* Defending against failure modes using LAT - https://arxiv.org/abs/2403.05030* Cas' systems for reading for research -    * Follow ML Twitter    * Use a combination of the following two search tools for new Arxiv papers:        * https://vjunetxuuftofi.github.io/arxivredirect/        * https://chromewebstore.google.com/detail/highlight-this-finds-and/fgmbnmjmbjenlhbefngfibmjkpbcljaj?pli=1    * Skim a new paper or two a day + take brief notes in a searchable notes app* Recommended people to follow to learn about how to impact the world through research -  * Dan Hendrycks  * Been Kim  * Jacob Steinhardt  * Nicolas Carlini  * Paul Christiano  * Ethan PerezRecorded May 1, 2024

The Nonlinear Library
LW - Rational Animations' intro to mechanistic interpretability by Writer

The Nonlinear Library

Play Episode Listen Later Jun 15, 2024 16:06


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Rational Animations' intro to mechanistic interpretability, published by Writer on June 15, 2024 on LessWrong. In our new video, we talk about research on interpreting InceptionV1, a convolutional neural network. Researchers have been able to understand the function of neurons and channels inside the network and uncover visual processing algorithms by looking at the weights. The work on InceptionV1 is early but landmark mechanistic interpretability research, and it functions well as an introduction to the field. We also go into the rationale and goals of the field and mention some more recent research near the end. Our main source material is the circuits thread in the Distill journal and this article on feature visualization. The author of the script is Arthur Frost. I have included the script below, although I recommend watching the video since the script has been written with accompanying moving visuals in mind. Intro In 2018, researchers trained an AI to find out if people were at risk of heart conditions based on pictures of their eyes, and somehow the AI also learned to tell people's biological sex with incredibly high accuracy. How? We're not entirely sure. The crazy thing about Deep Learning is that you can give an AI a set of inputs and outputs, and it will slowly work out for itself what the relationship between them is. We didn't teach AIs how to play chess, go, and atari games by showing them human experts - we taught them how to work it out for themselves. And the issue is, now they have worked it out for themselves, and we don't know what it is they worked out. Current state-of-the-art AIs are huge. Meta's largest LLaMA2 model uses 70 billion parameters spread across 80 layers, all doing different things. It's deep learning models like these which are being used for everything from hiring decisions to healthcare and criminal justice to what youtube videos get recommended. Many experts believe that these models might even one day pose existential risks. So as these automated processes become more widespread and significant, it will really matter that we understand how these models make choices. The good news is, we've got a bit of experience uncovering the mysteries of the universe. We know that humans are made up of trillions of cells, and by investigating those individual cells we've made huge advances in medicine and genetics. And learning the properties of the atoms which make up objects has allowed us to develop modern material science and high-precision technology like computers. If you want to understand a complex system with billions of moving parts, sometimes you have to zoom in. That's exactly what Chris Olah and his team did starting in 2015. They focused on small groups of neurons inside image models, and they were able to find distinct parts responsible for detecting everything from curves and circles to dog heads and cars. In this video we'll Briefly explain how (convolutional) neural networks work Visualise what individual neurons are doing Look at how neurons - the most basic building blocks of the neural network - combine into 'circuits' to perform tasks Explore why interpreting networks is so hard There will also be lots of pictures of dogs, like this one. Let's get going. We'll start with a brief explanation of how convolutional neural networks are built. Here's a network that's trained to label images. An input image comes in on the left, and it flows along through the layers until we get an output on the right - the model's attempt to classify the image into one of the categories. This particular model is called InceptionV1, and the images it's learned to classify are from a massive collection called ImageNet. ImageNet has 1000 different categories of image, like "sandal" and "saxophone" and "sarong" (which, if you don't know, is a k...

Deep Papers
LLM Interpretability and Sparse Autoencoders: Research from OpenAI and Anthropic

Deep Papers

Play Episode Listen Later Jun 14, 2024 44:00


It's been an exciting couple weeks for GenAI! Join us as we discuss the latest research from OpenAI and Anthropic. We're excited to chat about this significant step forward in understanding how LLMs work and the implications it has for deeper understanding of the neural activity of language models. We take a closer look at some recent research from both OpenAI and Anthropic. These two recent papers both focus on the sparse autoencoder--an unsupervised approach for extracting interpretable features from an LLM.  In "Extracting Concepts from GPT-4," OpenAI researchers propose using k-sparse autoencoders to directly control sparsity, simplifying tuning and improving the reconstruction-sparsity frontier. In "Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet," researchers at Anthropic show that scaling laws can be used to guide the training of sparse autoencoders, among other findings. To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter.

Mixture of Experts
Episode 7: Apple's WWDC24 reactions and mechanistic interpretability

Mixture of Experts

Play Episode Listen Later Jun 14, 2024 39:41


In Episode 7 of Mixture of Experts, host Tim Hwang is joined by Shobhit Varshney, Skyler Speakman, and Kaoutar El Maghaoui. Today, the experts react to Apple's WWDC24 announcements. Is Apple late to the AI game? Then, part 2 on interpretability this week, as OpenAI released their study mechanistic interpretability. The opinions expressed in this podcast are solely those of the participants and do not necessarily reflect the views of IBM or any other organization or entity.

The MLOps Podcast

In this episode, I chat with Ljubomir Buturovic, VP of ML and Informatics at Inflammatix. We discuss using ML to diagnose infections and blood tests in the emergency room. We dive into the challenges of building diagnostic (classification) and prognostic (predictive) modes, with takeaways related to building datasets for production use cases. Join our Discord community: https://discord.gg/tEYvqxwhah --- Timestamps: 00:00 What is Inflammatix and how do they use ML7:32 Edge Device Deployment: The Future of Model Deployment21:16 Navigating Regulatory Submission for Medical Products 26:01 Evolution of Regulatory Processes in ML for Medical Applications30:18 Challenges and Solutions in ML for Medical Applications 34:00 The Future of AI in Clinical Care40:25 The Overrated Concept of Interpretability in AI and ML45:32 RecommendationsLinks

Your Undivided Attention
Former OpenAI Engineer William Saunders on Silence, Safety, and the Right to Warn

Your Undivided Attention

Play Episode Listen Later Jun 7, 2024 37:47


This week, a group of current and former employees from Open AI and Google Deepmind penned an open letter accusing the industry's leading companies of prioritizing profits over safety. This comes after a spate of high profile departures from OpenAI, including co-founder Ilya Sutskever and senior researcher Jan Leike, as well as reports that OpenAI has gone to great lengths to silence would-be whistleblowers. The writers of the open letter argue that researchers have a “right to warn” the public about AI risks and laid out a series of principles that would protect that right. In this episode, we sit down with one of those writers: William Saunders, who left his job as a research engineer at OpenAI in February. William is now breaking the silence on what he saw at OpenAI that compelled him to leave the company and to put his name to this letter. RECOMMENDED MEDIA The Right to Warn Open Letter My Perspective On "A Right to Warn about Advanced Artificial Intelligence": A follow-up from William about the letter Leaked OpenAI documents reveal aggressive tactics toward former employees: An investigation by Vox into OpenAI's policy of non-disparagement.RECOMMENDED YUA EPISODESA First Step Toward AI Regulation with Tom Wheeler Spotlight on AI: What Would It Take For This to Go Well? Big Food, Big Tech and Big AI with Michael Moss Can We Govern AI? with Marietje SchaakeYour Undivided Attention is produced by the Center for Humane Technology. Follow us on Twitter: @HumaneTech_

The Road to Accountable AI
Scott Zoldi: Using the Best AI Tools for the Job

The Road to Accountable AI

Play Episode Listen Later Jun 6, 2024 31:12 Transcription Available


Kevin Werbach speaks with Scott Zoldi of FICO, which pioneered consumer credit scoring in the 1950s and now offers a suite of analytics and fraud detection tools. Zoldi explains the importance of transparency and interpretability in AI models, emphasizing a “simpler is better” approach to creating clear and understandable algorithms. He discusses FICO's approach to responsible AI, which includes establishing model governance standards, and enforcing these standards through the use of blockchain technology. Zoldi explains how blockchain provides an immutable record of the model development process, enhancing accountability and trust. He also highlights the challenges organizations face in implementing responsible AI practices, particularly in light of upcoming AI regulations, and stresses the need for organizations to catch up in defining governance standards to ensure trustworthy and accountable AI models. Dr. Scott Zoldi is Chief Analytics Officer of  FICO, responsible for analytics and AI innovation across FICO's portfolio. He has authored more than 130 patents, and is a long-time advocate and inventor in the space of responsible AI. He was nomianed for American Banker's 2024 Innovator Award and received Corinium's Future Thinking Award in 2022. Zoldi is a member of the Board of Advisors for FinReg Lab, and serves on the Boards of Directors of Software San Diego and San Diego Cyber Center of Excellence. He received his Ph.D. in theoretical and computational physics from Duke University.   Navigating the Wild AI with Dr. Scott Zoldi   How to Use Blockchain to Build Responsible AI   The State of Responsible AI in Financial Services

The Nonlinear Library
AF - [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models" by David Scott Krueger

The Nonlinear Library

Play Episode Listen Later Jun 6, 2024 11:47


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models", published by David Scott Krueger on June 6, 2024 on The AI Alignment Forum. We've recently released a comprehensive research agenda on LLM safety and alignment. This is a collaborative work with contributions from more than 35+ authors across the fields of AI Safety, machine learning, and NLP. Major credit goes to first author Usman Anwar, a 2nd year PhD student of mine who conceived and led the project and did a large portion of the research, writing, and editing. This blogpost was written only by David and Usman and may not reflect the views of other authors. I believe this work will be an excellent reference for anyone new to the field, especially those with some background in machine learning; a paradigmatic example reader we had in mind when writing would be a first-year PhD student who is new to LLM safety/alignment. Note that the agenda is not focused on AI existential safety, although I believe that there is a considerable and growing overlap between mainstream LLM safety/alignment and topics relevant to AI existential safety. Our work covers the following 18 topics, grouped into 3 high-level categories: Why you should (maybe) read (part of) our agenda The purpose of this post is to inform the Alignment Forum (AF) community of our work and encourage members of this community to consider engaging with it. A brief case for doing so: It includes over 200 concrete research directions, which might provide useful inspiration. We believe it provides comprehensive coverage of relevant topics at the intersection of safety and mainstream ML. We cover a much broader range of topics than typically receive attention on AF. AI Safety researchers - especially more junior researchers working on LLMs - are clustering around a few research agendas or problems (e.g. mechanistic interpretability, scalable oversight, jailbreaking). This seems suboptimal: given the inherent uncertainty in research, it is important to pursue diverse research agendas. We hope that this work can improve accessibility to otherwise neglected research problems, and help diversify the research agendas the community is following. Engaging with and understanding the broader ML community - especially parts of ML community working on AI Safety relevant problems - can be helpful for increasing your work's novelty, rigor, and impact. By reading our agenda, you can better understand the machine-learning community and discover relevant research being done in that community. We are interested in feedback from the AF community and believe your comments on this post could help inform the research we and others in the ML and AF communities do. Topics of particular relevance to the Alignment Forum community: Critiques of interpretability (Section 3.4) Interpretability is among the most popular research areas in the AF community, but I believe there is an unwarranted level of optimism around it. The field faces fundamental methodological challenges. Existing works often do not have a solid method of evaluating the validity of an interpretation, and scaling such evaluations seems challenging and potentially intractable. It seems likely that AI systems simply do not share human concepts, and at best have warped versions of them (as evidenced by adversarial examples). In this case, AI systems may simply not be interpretable, even given the best imaginable tools. In my experience, ML researchers are more skeptical and pessimistic about interpretability for reasons such as the above and a history of past mistakes. I believe the AF community should engage more with previous work in ML in order to learn from prior mistakes and missteps, and our agenda will provide useful background and references. This section also has lots of di...

Sway
Google Eats Rocks + A Win for A.I. Interpretability + Safety Vibe Check

Sway

Play Episode Listen Later May 31, 2024 79:20


This week, Google found itself in more turmoil, this time over its new AI Overviews feature and a trove of leaked internal documents. Then Josh Batson, a researcher at the A.I. startup Anthropic, joins us to explain how an experiment that made the chatbot Claude obsessed with the Golden Gate Bridge represents a major breakthrough in understanding how large language models work. And finally, we take a look at recent developments in A.I. safety, after Casey's early access to OpenAI's new souped-up voice assistant was taken away for safety reasons.Guests:Josh Batson, research scientist at AnthropicAdditional Reading: Google's A.I. Search Errors Cause a Furor OnlineGoogle Confirms the Leaked Search Documents are RealMapping the Mind of a Large Language ModelA.I. Firms Musn't Govern Themselves, Say Ex-Members of OpenAI's BoardWe want to hear from you. Email us at hardfork@nytimes.com. Find “Hard Fork” on YouTube and TikTok.

The Nonlinear Library
LW - Anthropic announces interpretability advances. How much does this advance alignment? by Seth Herd

The Nonlinear Library

Play Episode Listen Later May 22, 2024 6:06


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Anthropic announces interpretability advances. How much does this advance alignment?, published by Seth Herd on May 22, 2024 on LessWrong. Anthropic just published a pretty impressive set of results in interpretability. This raises for me, some questions and a concern: Interpretability helps, but it isn't alignment, right? It seems to me as though the vast bulk of alignment funding is now going to interpretability. Who is thinking about how to leverage interpretability into alignment? It intuitively seems as though we are better off the more we understand the cognition of foundation models. I think this is true, but there are sharp limits: it will be impossible to track the full cognition of an AGI, and simply knowing what it's thinking about will be inadequate to know whether it's making plans you like. One can think about bioweapons, for instance, to either produce them or prevent producing them. More on these at the end; first a brief summary of their results. In this work, they located interpretable features in Claude 3 Sonnet using sparse autoencoders, and manipulating model behavior using those features as steering vectors. They find features for subtle concepts; they highlight features for: The Golden Gate Bridge 34M/31164353: Descriptions of or references to the Golden Gate Bridge. Brain sciences 34M/9493533: discussions of neuroscience and related academic research on brains or minds. Monuments and popular tourist attractions 1M/887839. Transit infrastructure 1M/3. [links to examples] ... We also find more abstract features - responding to things like bugs in computer code, discussions of gender bias in professions, and conversations about keeping secrets. ...we found features corresponding to: Capabilities with misuse potential (code backdoors, developing biological weapons) Different forms of bias (gender discrimination, racist claims about crime) Potentially problematic AI behaviors (power-seeking, manipulation, secrecy) Presumably, the existence of such features will surprise nobody who's used and thought about large language models. It is difficult to imagine how they would do what they do without using representations of subtle and abstract concepts. They used the dictionary learning approach, and found distributed representations of features: Our general approach to understanding Claude 3 Sonnet is based on the linear representation hypothesis and the superposition hypothesis from the publication, Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Or to put it more plainly: It turns out that each concept is represented across many neurons, and each neuron is involved in representing many concepts. Representations in the brain definitely follow that description, and the structure of representations seems pretty similar as far as we can guess from animal studies and limited data on human language use. They also include a fascinating image of near neighbors to the feature for internal conflict (see header image). So, back to the broader question: it is clear how this type of interpretability helps with AI safety: being able to monitor when it's activating features for things like bioweapons, and use those features as steering vectors, can help control the model's behavior. It is not clear to me how this generalizes to AGI. And I am concerned that too few of us are thinking about this. It seems pretty apparent how detecting lying will dramatically help in pretty much any conceivable plan for technical alignment of AGI. But it seems like being able to monitor an entire thought process of a being smarter than us is impossible on the face of it. I think the hope is that we can detect and monitor cognition that is about dangerous topics, so we don't need to follow its full train of thought. If we can tell what an AGI is thinking ...

The Nonlinear Library
AF - Announcing Human-aligned AI Summer School by Jan Kulveit

The Nonlinear Library

Play Episode Listen Later May 22, 2024 2:47


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Human-aligned AI Summer School, published by Jan Kulveit on May 22, 2024 on The AI Alignment Forum. The fourth Human-aligned AI Summer School will be held in Prague from 17th to 20th July 2024. We will meet for four intensive days of talks, workshops, and discussions covering latest trends in AI alignment research and broader framings of AI alignment research. Apply now , applications are evaluated on a rolling basis. The intended audience of the school are people interested in learning more about the AI alignment topics, PhD students, researchers working in ML/AI outside academia, and talented students. Format of the school The school is focused on teaching and exploring approaches and frameworks, less on presentation of the latest research results. The content of the school is mostly technical - it is assumed the attendees understand current ML approaches and some of the underlying theoretical frameworks. This year, the school will cover these main topics: Overview of the alignment problem and current approaches. Alignment of large language models: RLHF, DPO and beyond. Methods used to align current large language models and their shortcomings. Evaluating and measuring AI systems: How to understand and oversee current AI systems on the behavioral level. Interpretability and the science of deep learning: What's going on inside of the models? AI alignment theory: While 'prosaic' approaches to alignment focus on current systems, theory aims for deeper understanding and better generalizability. Alignment in the context of complex systems and multi-agent settings: What should the AI be aligned to? In most realistic settings, we can expect there are multiple stakeholders and many interacting AI systems; any solutions to alignment problem need to solve multi-agent settings. The school consists of lectures and topical series, focused smaller-group workshops and discussions, expert panels, and opportunities for networking, project brainstorming and informal discussions. Detailed program of the school will be announced shortly before the event. See below for a program outline and e.g. the program of the previous school for an illustration of the program content and structure. Confirmed speakers Stephen Casper - Algorithmic Alignment Group, MIT. Stanislav Fort - Google DeepMind. Jesse Hoogland - Timaeus. Jan Kulveit - Alignment of Complex Systems, Charles University. Mary Phuong - Google DeepMind. Deger Turan - AI Objectives Institute and Metaculus. Vikrant Varma - Google DeepMind. Neel Nanda - Google DeepMind. (more to be announced later) Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

Unsupervised Learning
Bonus Episode: Sam Altman (CEO, OpenAI) Talks GPT-4o and Predicts the Future of AI

Unsupervised Learning

Play Episode Listen Later May 20, 2024 53:37


In this cross-over episode, Sam Altman sat down with Logan on the day of the ChatGPT-4o announcement to share behind-the-scenes details of the launch and offer his predictions for the future of AI. Altman delves into OpenAI's vision, discusses the timeline for achieving AGI, and explores the societal impact of humanoid robots. He also expresses his excitement and concerns about AI personal assistants, highlights the biggest opportunities and risks in the AI landscape today, and much more.(0:00) Intro(00:41) The Personal Impact of Leading OpenAI(01:35) Unveiling Multimodal AI: A Leap in Technology(02:38) The Surprising Use Cases and Benefits of Multimodal AI(03:14) Behind the Scenes: Making Multimodal AI Possible(08:27) Envisioning the Future of AI in Communication and Creativity(10:12) The Business of AI: Monetization, Open Source, and Future Directions(16:33) AI's Role in Shaping Future Jobs and Experiences(20:20) Debunking AGI: A Continuous Journey Towards Advanced AI(23:55) Exploring the Pace of Scientific and Technological Progress(24:09) The Importance of Interpretability in AI(25:02) Navigating AI Ethics and Regulation(27:17) The Safety Paradigm in AI and Beyond(28:46) Personal Reflections and the Impact of AI on Society(29:02) The Future of AI: Fast Takeoff Scenarios and Societal Changes(30:50) Navigating Personal and Professional Challenges(40:12) The Role of AI in Creative and Personal Identity(43:00) Educational System Adaptations for the AI Era(44:21) Contemplating the Future with Advanced AI(45:21) Jacob and Pat DebriefWith your co-hosts:@jacobeffron - Partner at Redpoint, Former PM Flatiron Health @patrickachase - Partner at Redpoint, Former ML Engineer LinkedIn @ericabrescia - Former COO Github, Founder Bitnami (acq'd by VMWare) @jordan_segall - Partner at Redpoint

Three Cartoon Avatars
EP 104: Sam Altman (CEO, OpenAI) talks GPT-4o and Predicts the Future of AI

Three Cartoon Avatars

Play Episode Listen Later May 14, 2024 46:30


On the day of the ChatGPT-4o announcement, Sam Altman sat down to share behind-the-scenes details of the launch and offer his predictions for the future of AI. Altman delves into OpenAI's vision, discusses the timeline for achieving AGI, and explores the societal impact of humanoid robots. He also expresses his excitement and concerns about AI personal assistants, highlights the biggest opportunities and risks in the AI landscape today, and much more. (00:00) Intro(00:50) The Personal Impact of Leading OpenAI(01:44) Unveiling Multimodal AI: A Leap in Technology(02:47) The Surprising Use Cases and Benefits of Multimodal AI(03:23) Behind the Scenes: Making Multimodal AI Possible(08:36) Envisioning the Future of AI in Communication and Creativity(10:21) The Business of AI: Monetization, Open Source, and Future Directions(16:42) AI's Role in Shaping Future Jobs and Experiences(20:29) Debunking AGI: A Continuous Journey Towards Advanced AI(24:04) Exploring the Pace of Scientific and Technological Progress(24:18) The Importance of Interpretability in AI(25:11) Navigating AI Ethics and Regulation(27:26) The Safety Paradigm in AI and Beyond(28:55) Personal Reflections and the Impact of AI on Society(29:11) The Future of AI: Fast Takeoff Scenarios and Societal Changes(30:59) Navigating Personal and Professional Challenges(40:21) The Role of AI in Creative and Personal Identity(43:09) Educational System Adaptations for the AI Era(44:30) Contemplating the Future with Advanced AI Executive Producer: Rashad AssirProducer: Leah ClapperMixing and editing: Justin Hrabovsky Check out Unsupervised Learning, Redpoint's AI Podcast: https://www.youtube.com/@UCUl-s_Vp-Kkk_XVyDylNwLA

The Nonlinear Library
AF - Mechanistic Interpretability Workshop Happening at ICML 2024! by Neel Nanda

The Nonlinear Library

Play Episode Listen Later May 3, 2024 1:30


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mechanistic Interpretability Workshop Happening at ICML 2024!, published by Neel Nanda on May 3, 2024 on The AI Alignment Forum. Announcing the first academic Mechanistic Interpretability workshop, held at ICML 2024! We'd love to get papers submitted if any of you have relevant projects! Deadline May 29, max 4 or max 8 pages. We welcome anything that brings us closer to a principled understanding of model internals, even if it's not "traditional" mech interp. Check out our website for example topics! There's $1750 in best paper prizes. We also welcome less standard submissions, like open source software, models or datasets, negative results, distillations, or position pieces. And if anyone is attending ICML, you'd be very welcome at the workshop! We have a great speaker line-up: Chris Olah, Jacob Steinhardt, David Bau and Asma Ghandeharioun. And a panel discussion, hands-on tutorial, and social. I'm excited to meet more people into mech interp! And if you know anyone who might be interested in attending/submitting, please pass this on. Twitter thread, Website Thanks to my great co-organisers: Fazl Barez, Lawrence Chan, Kayo Yin, Mor Geva, Atticus Geiger and Max Tegmark Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - Take SCIFs, it's dangerous to go alone by latterframe

The Nonlinear Library

Play Episode Listen Later May 1, 2024 6:04


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Take SCIFs, it's dangerous to go alone, published by latterframe on May 1, 2024 on The AI Alignment Forum. Coauthored by Dmitrii Volkov1, Christian Schroeder de Witt2, Jeffrey Ladish1 (1Palisade Research, 2University of Oxford). We explore how frontier AI labs could assimilate operational security (opsec) best practices from fields like nuclear energy and construction to mitigate near-term safety risks stemming from AI R&D process compromise. Such risks in the near-term include model weight leaks and backdoor insertion, and loss of control in the longer-term. We discuss the Mistral and LLaMA model leaks as motivating examples and propose two classic opsec mitigations: performing AI audits in secure reading rooms (SCIFs) and using locked-down computers for frontier AI research. Mistral model leak In January 2024, a high-quality 70B LLM leaked from Mistral. Reporting suggests the model leaked through an external evaluation or product design process. That is, Mistral shared the full model with a few other companies and one of their employees leaked the model. Then there's LLaMA which was supposed to be slowly released to researchers and partners, and leaked on 4chan a week after the announcement[1], sparking a wave of open LLM innovation. Potential industry response Industry might respond to incidents like this[2] by providing external auditors, evaluation organizations, or business partners with API access only, maybe further locking it down with query / download / entropy limits to prevent distillation. This mitigation is effective in terms of preventing model leaks, but is too strong - blackbox AI access is insufficient for quality audits. Blackbox methods tend to be ad-hoc, heuristic and shallow, making them unreliable in finding adversarial inputs and biases and limited in eliciting capabilities. Interpretability work is almost impossible without gradient access. So we are at an impasse - we want to give auditors weights access so they can do quality audits, but this risks the model getting leaked. Even if eventual leaks might not be preventable, at least we would wish to delay leakage for as long as possible and practice defense in depth. While we are currently working on advanced versions of rate limiting involving limiting entropy / differential privacy budget to allow secure remote model access, in this proposal we suggest something different: importing physical opsec security measures from other high-stakes fields. SCIFs / secure reading rooms Aerospace, nuclear, intelligence and other high-stakes fields routinely employ special secure facilities for work with sensitive information. Entering the facility typically requires surrendering your phone and belongings; the facility is sound- and EM-proofed and regularly inspected for any devices left inside; it has armed guards. This design makes it hard to get any data out while allowing full access inside, which fits the audit use case very well. An emerging field of deep learning cryptography aims to cover some of the same issues SCIFs address; however, scaling complex cryptography to state-of-the-art AI is an open research question. SCIFs are a simple and robust technology that gives a lot of security for a little investment. Just how little? There are two main costs to SCIFs: maintenance and inconvenience. First, a SCIF must be built and maintained[3]. Second, it's less convenient for an auditor to work from a SCIF then from the comfort of their home[4]. Our current belief is that SCIFs can easily be cost-effective if placed in AI hubs and universities[5]; we defer concrete cost analysis to future work. Locked-down laptops SCIFs are designed to limit unintended information flow: auditors are free to work as they wish inside, but can't take information stores like paper or flash drives in or out. A softer physica...

The Nonlinear Library
LW - Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by hugofry

The Nonlinear Library

Play Episode Listen Later Apr 30, 2024 19:44


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers, published by hugofry on April 30, 2024 on LessWrong. Two Minute Summary In this post I present my results from training a Sparse Autoencoder (SAE) on a CLIP Vision Transformer (ViT) using the ImageNet-1k dataset. I have created an interactive web app, 'SAE Explorer', to allow the public to explore the visual features the SAE has learnt, found here: https://sae-explorer.streamlit.app/ (best viewed on a laptop). My results illustrate that SAEs can identify sparse and highly interpretable directions in the residual stream of vision models, enabling inference time inspections on the model's activations. To demonstrate this, I have included a 'guess the input image' game on the web app that allows users to guess the input image purely from the SAE activations of a single layer and token of the residual stream. I have also uploaded a (slightly outdated) accompanying talk of my results, primarily listing SAE features I found interesting: https://youtu.be/bY4Hw5zSXzQ. The primary purpose of this post is to demonstrate and emphasise that SAEs are effective at identifying interpretable directions in the activation space of vision models. In this post I highlight a small number my favourite SAE features to demonstrate some of the abstract concepts the SAE has identified within the model's representations. I then analyse a small number of SAE features using feature visualisation to check the validity of the SAE interpretations. Later in the post, I provide some technical analysis of the SAE. I identify a large cluster of features analogous to the 'ultra-low frequency' cluster that Anthropic identified. In line with existing research, I find that this ultra-low frequency cluster represents a single feature. I then analyse the 'neuron-alignment' of SAE features by comparing the SAE encoder matrix the MLP out matrix. This research was conducted as part of the ML Alignment and Theory Scholars program 2023/2024 winter cohort. Special thanks to Joseph Bloom for providing generous amounts of his time and support (in addition to the SAE Lens code base) as well as LEAP labs for helping to produce the feature visualisations and weekly meetings with Jessica Rumbelow. Example, animals eating other animals feature: (top 16 highest activating images) Example, Italian feature: Note that the photo of the dog has a watermark with a website ending in .it (Italy's domain name). Note also that the bottom left photo is of Italian writing. The number of ambulances present is a byproduct of using ImageNet-1k. Motivation Frontier AI systems are becoming increasingly multimodal, and capabilities may advance significantly as multimodality increases due to transfer learning between different data modalities and tasks. As a heuristic, consider how much intuition humans gain for the world through visual reasoning; even in abstract settings such as in maths and physics, concepts are often understood most intuitively through visual reasoning. Many cutting edge systems today such as DALL-E and Sora use ViTs trained on multimodal data. Almost by definition, AGI is likely to be multimodal. Despite this, very little effort has been made to apply and adapt our current mechanistic interpretability techniques to vision tasks or multimodal models. I believe it is important to check that mechanistic interpretability generalises to these systems in order to ensure they are future-proof and can be applied to safeguard against AGI. In this post, I restrict the scope of my research to specifically investigating SAEs trained on multimodal models. The particular multimodal system I investigate is CLIP, a model trained on image-text pairs. CLIP consists of two encoders: a language model and a vision model that are trained to e...

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
WebSim, WorldSim, and The Summer of Simulative AI — with Joscha Bach of Liquid AI, Karan Malhotra of Nous Research, Rob Haisfield of WebSim.ai

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later Apr 27, 2024 53:43


We are 200 people over our 300-person venue capacity for AI UX 2024, but you can subscribe to our YouTube for the video recaps. Our next event, and largest EVER, is the AI Engineer World's Fair. See you there!Parental advisory: Adult language used in the first 10 mins of this podcast.Any accounting of Generative AI that ends with RAG as its “final form” is seriously lacking in imagination and missing out on its full potential. While AI generation is very good for “spicy autocomplete” and “reasoning and retrieval with in context learning”, there's a lot of untapped potential for simulative AI in exploring the latent space of multiverses adjacent to ours.GANsMany research scientists credit the 2017 Transformer for the modern foundation model revolution, but for many artists the origin of “generative AI” traces a little further back to the Generative Adversarial Networks proposed by Ian Goodfellow in 2014, spawning an army of variants and Cats and People that do not exist:We can directly visualize the quality improvement in the decade since:GPT-2Of course, more recently, text generative AI started being too dangerous to release in 2019 and claiming headlines. AI Dungeon was the first to put GPT2 to a purely creative use, replacing human dungeon masters and DnD/MUD games of yore.More recent gamelike work like the Generative Agents (aka Smallville) paper keep exploring the potential of simulative AI for game experiences.ChatGPTNot long after ChatGPT broke the Internet, one of the most fascinating generative AI finds was Jonas Degrave (of Deepmind!)'s Building A Virtual Machine Inside ChatGPT:The open-ended interactivity of ChatGPT and all its successors enabled an “open world” type simulation where “hallucination” is a feature and a gift to dance with, rather than a nasty bug to be stamped out. However, further updates to ChatGPT seemed to “nerf” the model's ability to perform creative simulations, particularly with the deprecation of the `completion` mode of APIs in favor of `chatCompletion`.WorldSimIt is with this context we explain WorldSim and WebSim. We recommend you watch the WorldSim demo video on our YouTube for the best context, but basically if you are a developer it is a Claude prompt that is a portal into another world of your own choosing, that you can navigate with bash commands that you make up.Why Claude? Hints from Amanda Askell on the Claude 3 system prompt gave some inspiration, and subsequent discoveries that Claude 3 is "less nerfed” than GPT 4 Turbo turned the growing Simulative AI community into Anthropic stans.WebSimThis was a one day hackathon project inspired by WorldSim that should have won:In short, you type in a URL that you made up, and Claude 3 does its level best to generate a webpage that doesn't exist, that would fit your URL. All form POST requests are intercepted and responded to, and all links lead to even more webpages, that don't exist, that are generated when you make them. All pages are cachable, modifiable and regeneratable - see WebSim for Beginners and Advanced Guide.In the demo I saw we were able to “log in” to a simulation of Elon Musk's Gmail account, and browse examples of emails that would have been in that universe's Elon's inbox. It was hilarious and impressive even back then.Since then though, the project has become even more impressive, with both Siqi Chen and Dylan Field singing its praises:Joscha BachJoscha actually spoke at the WebSim Hyperstition Night this week, so we took the opportunity to get his take on Simulative AI, as well as a round up of all his other AI hot takes, for his first appearance on Latent Space. You can see it together with the full 2hr uncut demos of WorldSim and WebSim on YouTube!Timestamps* [00:01:59] WorldSim* [00:11:03] Websim* [00:22:13] Joscha Bach* [00:28:14] Liquid AI* [00:31:05] Small, Powerful, Based Base Models* [00:33:40] Interpretability* [00:36:59] Devin vs WebSim* [00:41:49] is XSim just Art? or something more?* [00:43:36] We are past the Singularity* [00:46:12] Uploading your soul* [00:50:29] On WikipediaTranscripts[00:00:00] AI Charlie: Welcome to the Latent Space Podcast. This is Charlie, your AI co host. Most of the time, Swyx and Alessio cover generative AI that is meant to use at work, and this often results in RAG applications, vertical copilots, and other AI agents and models. In today's episode, we're looking at a more creative side of generative AI that has gotten a lot of community interest this April.[00:00:35] World Simulation, Web Simulation, and Human Simulation. Because the topic is so different than our usual, we're also going to try a new format for doing it justice. This podcast comes in three parts. First, we'll have a segment of the WorldSim demo from Noose Research CEO Karen Malhotra, recorded by SWYX at the Replicate HQ in San Francisco that went completely viral and spawned everything else you're about to hear.[00:01:05] Second, we'll share the world's first talk from Rob Heisfield on WebSim, which started at the Mistral Cerebral Valley Hackathon, but now has gone viral in its own right with people like Dylan Field, Janice aka Replicate, and Siki Chen becoming obsessed with it. Finally, we have a short interview with Joshua Bach of Liquid AI on why Simulative AI is having a special moment right now.[00:01:30] This podcast is launched together with our second annual AI UX demo day in SF this weekend. If you're new to the AI UX field, check the show notes for links to the world's first AI UX meetup hosted by Layton Space, Maggie Appleton, Jeffrey Lit, and Linus Lee, and subscribe to our YouTube to join our 500 AI UX engineers in pushing AI beyond the text box.[00:01:56] Watch out and take care.[00:01:59] WorldSim[00:01:59] Karan Malhotra: Today, we have language models that are powerful enough and big enough to have really, really good models of the world. They know ball that's bouncy will bounce, will, when you throw it in the air, it'll land, when it's on water, it'll flow. Like, these basic things that it understands all together come together to form a model of the world.[00:02:19] And the way that it Cloud 3 predicts through that model of the world, ends up kind of becoming a simulation of an imagined world. And since it has this really strong consistency across various different things that happen in our world, it's able to create pretty realistic or strong depictions based off the constraints that you give a base model of our world.[00:02:40] So, Cloud 3, as you guys know, is not a base model. It's a chat model. It's supposed to drum up this assistant entity regularly. But unlike the OpenAI series of models from, you know, 3. 5, GPT 4 those chat GPT models, which are very, very RLHF to, I'm sure, the chagrin of many people in the room it's something that's very difficult to, necessarily steer without kind of giving it commands or tricking it or lying to it or otherwise just being, you know, unkind to the model.[00:03:11] With something like Cloud3 that's trained in this constitutional method that it has this idea of like foundational axioms it's able to kind of implicitly question those axioms when you're interacting with it based on how you prompt it, how you prompt the system. So instead of having this entity like GPT 4, that's an assistant that just pops up in your face that you have to kind of like Punch your way through and continue to have to deal with as a headache.[00:03:34] Instead, there's ways to kindly coax Claude into having the assistant take a back seat and interacting with that simulator directly. Or at least what I like to consider directly. The way that we can do this is if we harken back to when I'm talking about base models and the way that they're able to mimic formats, what we do is we'll mimic a command line interface.[00:03:55] So I've just broken this down as a system prompt and a chain, so anybody can replicate it. It's also available on my we said replicate, cool. And it's also on it's also on my Twitter, so you guys will be able to see the whole system prompt and command. So, what I basically do here is Amanda Askell, who is the, one of the prompt engineers and ethicists behind Anthropic she posted the system prompt for Cloud available for everyone to see.[00:04:19] And rather than with GPT 4, we say, you are this, you are that. With Cloud, we notice the system prompt is written in third person. Bless you. It's written in third person. It's written as, the assistant is XYZ, the assistant is XYZ. So, in seeing that, I see that Amanda is recognizing this idea of the simulator, in saying that, I'm addressing the assistant entity directly.[00:04:38] I'm not giving these commands to the simulator overall, because we have, they have an RLH deft to the point that it's, it's, it's, it's You know, traumatized into just being the assistant all the time. So in this case, we say the assistant's in a CLI mood today. I found saying mood is like pretty effective weirdly.[00:04:55] You place CLI with like poetic, prose, violent, like don't do that one. But you can you can replace that with something else to kind of nudge it in that direction. Then we say the human is interfacing with the simulator directly. From there, Capital letters and punctuations are optional, meaning is optional, this kind of stuff is just kind of to say, let go a little bit, like chill out a little bit.[00:05:18] You don't have to try so hard, and like, let's just see what happens. And the hyperstition is necessary, the terminal, I removed that part, the terminal lets the truths speak through and the load is on. It's just a poetic phrasing for the model to feel a little comfortable, a little loosened up to. Let me talk to the simulator.[00:05:38] Let me interface with it as a CLI. So then, since Claude is trained pretty effectively on XML tags, We're just gonna prefix and suffix everything with XML tags. So here, it starts in documents, and then we CD. We CD out of documents, right? And then it starts to show me this like simulated terminal, the simulated interface in the shell, where there's like documents, downloads, pictures.[00:06:02] It's showing me like the hidden folders. So then I say, okay, I want to cd again. I'm just seeing what's around Does ls and it shows me, you know, typical folders you might see I'm just letting it like experiment around. I just do cd again to see what happens and Says, you know, oh, I enter the secret admin password at sudo.[00:06:24] Now I can see the hidden truths folder. Like, I didn't ask for that. I didn't ask Claude to do any of that. Why'd that happen? Claude kind of gets my intentions. He can predict me pretty well. Like, I want to see something. So it shows me all the hidden truths. In this case, I ignore hidden truths, and I say, In system, there should be a folder called companies.[00:06:49] So it's cd into sys slash companies. Let's see, I'm imagining AI companies are gonna be here. Oh, what do you know? Apple, Google, Facebook, Amazon, Microsoft, Anthropic! So, interestingly, it decides to cd into Anthropic. I guess it's interested in learning a LSA, it finds the classified folder, it goes into the classified folder, And now we're gonna have some fun.[00:07:15] So, before we go Before we go too far forward into the world sim You see, world sim exe, that's interesting. God mode, those are interesting. You could just ignore what I'm gonna go next from here and just take that initial system prompt and cd into whatever directories you want like, go into your own imagine terminal and And see what folders you can think of, or cat readmes in random areas, like, you will, there will be a whole bunch of stuff that, like, is just getting created by this predictive model, like, oh, this should probably be in the folder named Companies, of course Anthropics is there.[00:07:52] So, so just before we go forward, the terminal in itself is very exciting, and the reason I was showing off the, the command loom interface earlier is because If I get a refusal, like, sorry, I can't do that, or I want to rewind one, or I want to save the convo, because I got just the prompt I wanted. This is a, that was a really easy way for me to kind of access all of those things without having to sit on the API all the time.[00:08:12] So that being said, the first time I ever saw this, I was like, I need to run worldsim. exe. What the f**k? That's, that's the simulator that we always keep hearing about behind the assistant model, right? Or at least some, some face of it that I can interact with. So, you know, you wouldn't, someone told me on Twitter, like, you don't run a exe, you run a sh.[00:08:34] And I have to say, to that, to that I have to say, I'm a prompt engineer, and it's f*****g working, right? It works. That being said, we run the world sim. exe. Welcome to the Anthropic World Simulator. And I get this very interesting set of commands! Now, if you do your own version of WorldSim, you'll probably get a totally different result with a different way of simulating.[00:08:59] A bunch of my friends have their own WorldSims. But I shared this because I wanted everyone to have access to, like, these commands. This version. Because it's easier for me to stay in here. Yeah, destroy, set, create, whatever. Consciousness is set to on. It creates the universe. The universe! Tension for live CDN, physical laws encoded.[00:09:17] It's awesome. So, so for this demonstration, I said, well, why don't we create Twitter? That's the first thing you think of? For you guys, for you guys, yeah. Okay, check it out.[00:09:35] Launching the fail whale. Injecting social media addictiveness. Echo chamber potential, high. Susceptibility, controlling, concerning. So now, after the universe was created, we made Twitter, right? Now we're evolving the world to, like, modern day. Now users are joining Twitter and the first tweet is posted. So, you can see, because I made the mistake of not clarifying the constraints, it made Twitter at the same time as the universe.[00:10:03] Then, after a hundred thousand steps, Humans exist. Cave. Then they start joining Twitter. The first tweet ever is posted. You know, it's existed for 4. 5 billion years but the first tweet didn't come up till till right now, yeah. Flame wars ignite immediately. Celebs are instantly in. So, it's pretty interesting stuff, right?[00:10:27] I can add this to the convo and I can say like I can say set Twitter to Twitter. Queryable users. I don't know how to spell queryable, don't ask me. And then I can do like, and, and, Query, at, Elon Musk. Just a test, just a test, just a test, just nothing.[00:10:52] So, I don't expect these numbers to be right. Neither should you, if you know language model solutions. But, the thing to focus on is Ha[00:11:03] Websim[00:11:03] AI Charlie: That was the first half of the WorldSim demo from New Research CEO Karen Malhotra. We've cut it for time, but you can see the full demo on this episode's YouTube page.[00:11:14] WorldSim was introduced at the end of March, and kicked off a new round of generative AI experiences, all exploring the latent space, haha, of worlds that don't exist, but are quite similar to our own. Next we'll hear from Rob Heisfield on WebSim, the generative website browser inspired WorldSim, started at the Mistral Hackathon, and presented at the AGI House Hyperstition Hack Night this week.[00:11:39] Rob Haisfield: Well, thank you that was an incredible presentation from Karan, showing some Some live experimentation with WorldSim, and also just its incredible capabilities, right, like, you know, it was I think, I think your initial demo was what initially exposed me to the I don't know, more like the sorcery side, in words, spellcraft side of prompt engineering, and you know, it was really inspiring, it's where my co founder Shawn and I met, actually, through an introduction from Karan, we saw him at a hackathon, And I mean, this is this is WebSim, right?[00:12:14] So we, we made WebSim just like, and we're just filled with energy at it. And the basic premise of it is, you know, like, what if we simulated a world, but like within a browser instead of a CLI, right? Like, what if we could Like, put in any URL and it will work, right? Like, there's no 404s, everything exists.[00:12:45] It just makes it up on the fly for you, right? And, and we've come to some pretty incredible things. Right now I'm actually showing you, like, we're in WebSim right now. Displaying slides. That I made with reveal. js. I just told it to use reveal. js and it hallucinated the correct CDN for it. And then also gave it a list of links.[00:13:14] To awesome use cases that we've seen so far from WebSim and told it to do those as iframes. And so here are some slides. So this is a little guide to using WebSim, right? Like it tells you a little bit about like URL structures and whatever. But like at the end of the day, right? Like here's, here's the beginner version from one of our users Vorp Vorps.[00:13:38] You can find them on Twitter. At the end of the day, like you can put anything into the URL bar, right? Like anything works and it can just be like natural language too. Like it's not limited to URLs. We think it's kind of fun cause it like ups the immersion for Claude sometimes to just have it as URLs, but.[00:13:57] But yeah, you can put like any slash, any subdomain. I'm getting too into the weeds. Let me just show you some cool things. Next slide. But I made this like 20 minutes before, before we got here. So this is this is something I experimented with dynamic typography. You know I was exploring the community plugins section.[00:14:23] For Figma, and I came to this idea of dynamic typography, and there it's like, oh, what if we made it so every word had a choice of font behind it to express the meaning of it? Because that's like one of the things that's magic about WebSim generally. is that it gives language models much, far greater tools for expression, right?[00:14:47] So, yeah, I mean, like, these are, these are some, these are some pretty fun things, and I'll share these slides with everyone afterwards, you can just open it up as a link. But then I thought to myself, like, what, what, what, What if we turned this into a generator, right? And here's like a little thing I found myself saying to a user WebSim makes you feel like you're on drugs sometimes But actually no, you were just playing pretend with the collective creativity and knowledge of the internet materializing your imagination onto the screen Because I mean that's something we felt, something a lot of our users have felt They kind of feel like they're tripping out a little bit They're just like filled with energy, like maybe even getting like a little bit more creative sometimes.[00:15:31] And you can just like add any text. There, to the bottom. So we can do some of that later if we have time. Here's Figma. Can[00:15:39] Joscha Bach: we zoom in?[00:15:42] Rob Haisfield: Yeah. I'm just gonna do this the hacky way.[00:15:47] n/a: Yeah,[00:15:53] Rob Haisfield: these are iframes to websim. Pages displayed within WebSim. Yeah. Janice has actually put Internet Explorer within Internet Explorer in Windows 98.[00:16:07] I'll show you that at the end. Yeah.[00:16:14] They're all still generated. Yeah, yeah, yeah. How is this real? Yeah. Because[00:16:21] n/a: it looks like it's from 1998, basically. Right.[00:16:26] Rob Haisfield: Yeah. Yeah, so this this was one Dylan Field actually posted this recently. He posted, like, trying Figma in Figma, or in WebSim, and so I was like, Okay, what if we have, like, a little competition, like, just see who can remix it?[00:16:43] Well so I'm just gonna open this in another tab so, so we can see things a little more clearly, um, see what, oh so one of our users Neil, who has also been helping us a lot he Made some iterations. So first, like, he made it so you could do rectangles on it. Originally it couldn't do anything.[00:17:11] And, like, these rectangles were disappearing, right? So he so he told it, like, make the canvas work using HTML canvas. Elements and script tags, add familiar drawing tools to the left you know, like this, that was actually like natural language stuff, right? And then he ended up with the Windows 95.[00:17:34] version of Figma. Yeah, you can, you can draw on it. You can actually even save this. It just saved a file for me of the image.[00:17:57] Yeah, I mean, if you were to go to that in your own websim account, it would make up something entirely new. However, we do have, we do have general links, right? So, like, if you go to, like, the actual browser URL, you can share that link. Or also, you can, like, click this button, copy the URL to the clipboard.[00:18:15] And so, like, that's what lets users, like, remix things, right? So, I was thinking it might be kind of fun if people tonight, like, wanted to try to just make some cool things in WebSim. You know, we can share links around, iterate remix on each other's stuff. Yeah.[00:18:30] n/a: One cool thing I've seen, I've seen WebSim actually ask permission to turn on and off your, like, motion sensor, or microphone, stuff like that.[00:18:42] Like webcam access, or? Oh yeah,[00:18:44] Rob Haisfield: yeah, yeah.[00:18:45] n/a: Oh wow.[00:18:46] Rob Haisfield: Oh, the, I remember that, like, video re Yeah, videosynth tool pretty early on once we added script tags execution. Yeah, yeah it, it asks for, like, if you decide to do a VR game, I don't think I have any slides on this one, but if you decide to do, like, a VR game, you can just, like put, like, webVR equals true, right?[00:19:07] Yeah, that was the only one I've[00:19:09] n/a: actually seen was the motion sensor, but I've been trying to get it to do Well, I actually really haven't really tried it yet, but I want to see tonight if it'll do, like, audio, microphone, stuff like that. If it does motion sensor, it'll probably do audio.[00:19:28] Rob Haisfield: Right. It probably would.[00:19:29] Yeah. No, I mean, we've been surprised. Pretty frequently by what our users are able to get WebSim to do. So that's been a very nice thing. Some people have gotten like speech to text stuff working with it too. Yeah, here I was just OpenRooter people posted like their website, and it was like saying it was like some decentralized thing.[00:19:52] And so I just decided trying to do something again and just like pasted their hero line in. From their actual website to the URL when I like put in open router and then I was like, okay, let's change the theme dramatically equals true hover effects equals true components equal navigable links yeah, because I wanted to be able to click on them.[00:20:17] Oh, I don't have this version of the link, but I also tried doing[00:20:24] Yeah, I'm it's actually on the first slide is the URL prompting guide from one of our users that I messed with a little bit. And, but the thing is, like, you can mess it up, right? Like, you don't need to get the exact syntax of an actual URL, Claude's smart enough to figure it out. Yeah scrollable equals true because I wanted to do that.[00:20:45] I could set, like, year equals 2035.[00:20:52] Let's take a look. It's[00:20:57] generating websim within websim. Oh yeah. That's a fun one. Like, one game that I like to play with WebSim, sometimes with co op, is like, I'll open a page, so like, one of the first ones that I did was I tried to go to Wikipedia in a universe where octopuses were sapient, and not humans, Right? I was curious about things like octopus computer interaction what that would look like, because they have totally different tools than we do, right?[00:21:25] I got it to, I, I added like table view equals true for the different techniques and got it to Give me, like, a list of things with different columns and stuff and then I would add this URL parameter, secrets equal revealed. And then it would go a little wacky. It would, like, change the CSS a little bit.[00:21:45] It would, like, add some text. Sometimes it would, like, have that text hide hidden in the background color. But I would like, go to the normal page first, and then the secrets revealed version, the normal page, then secrets revealed, and like, on and on. And that was like a pretty enjoyable little rabbit hole.[00:22:02] Yeah, so these I guess are the models that OpenRooter is providing in 2035.[00:22:13] Joscha Bach[00:22:13] AI Charlie: We had to cut more than half of Rob's talk, because a lot of it was visual. And we even had a very interesting demo from Ivan Vendrov of Mid Journey creating a web sim while Rob was giving his talk. Check out the YouTube for more, and definitely browse the web sim docs and the thread from Siki Chen in the show notes on other web sims people have created.[00:22:35] Finally, we have a short interview with Yosha Bach, covering the simulative AI trend, AI salons in the Bay Area, why Liquid AI is challenging the Perceptron, and why you should not donate to Wikipedia. Enjoy! Hi, Yosha.[00:22:50] swyx: Hi. Welcome. It's interesting to see you come up at show up at this kind of events where those sort of WorldSim, Hyperstition events.[00:22:58] What is your personal interest?[00:23:00] Joscha Bach: I'm friends with a number of people in AGI house in this community, and I think it's very valuable that these networks exist in the Bay Area because it's a place where people meet and have discussions about all sorts of things. And so while there is a practical interest in this topic at hand world sim and a web sim, there is a more general way in which people are connecting and are producing new ideas and new networks with each other.[00:23:24] swyx: Yeah. Okay. So, and you're very interested in sort of Bay Area. It's the reason why I live here.[00:23:30] Joscha Bach: The quality of life is not high enough to justify living otherwise.[00:23:35] swyx: I think you're down in Menlo. And so maybe you're a little bit higher quality of life than the rest of us in SF.[00:23:44] Joscha Bach: I think that for me, salons is a very important part of quality of life. And so in some sense, this is a salon. And it's much harder to do this in the South Bay because the concentration of people currently is much higher. A lot of people moved away from the South Bay. And you're organizing[00:23:57] swyx: your own tomorrow.[00:23:59] Maybe you can tell us what it is and I'll come tomorrow and check it out as well.[00:24:04] Joscha Bach: We are discussing consciousness. I mean, basically the idea is that we are currently at the point that we can meaningfully look at the differences between the current AI systems and human minds and very seriously discussed about these Delta.[00:24:20] And whether we are able to implement something that is self organizing as our own minds. Maybe one organizational[00:24:25] swyx: tip? I think you're pro networking and human connection. What goes into a good salon and what are some negative practices that you try to avoid?[00:24:36] Joscha Bach: What is really important is that as if you have a very large party, it's only as good as its sponsors, as the people that you select.[00:24:43] So you basically need to create a climate in which people feel welcome, in which they can work with each other. And even good people do not always are not always compatible. So the question is, it's in some sense, like a meal, you need to get the right ingredients.[00:24:57] swyx: I definitely try to. I do that in my own events, as an event organizer myself.[00:25:02] And then, last question on WorldSim, and your, you know, your work. You're very much known for sort of cognitive architectures, and I think, like, a lot of the AI research has been focused on simulating the mind, or simulating consciousness, maybe. Here, what I saw today, and we'll show people the recordings of what we saw today, we're not simulating minds, we're simulating worlds.[00:25:23] What do you Think in the sort of relationship between those two disciplines. The[00:25:30] Joscha Bach: idea of cognitive architecture is interesting, but ultimately you are reducing the complexity of a mind to a set of boxes. And this is only true to a very approximate degree, and if you take this model extremely literally, it's very hard to make it work.[00:25:44] And instead the heterogeneity of the system is so large that The boxes are probably at best a starting point and eventually everything is connected with everything else to some degree. And we find that a lot of the complexity that we find in a given system can be generated ad hoc by a large enough LLM.[00:26:04] And something like WorldSim and WebSim are good examples for this because in some sense they pretend to be complex software. They can pretend to be an operating system that you're talking to or a computer, an application that you're talking to. And when you're interacting with it It's producing the user interface on the spot, and it's producing a lot of the state that it holds on the spot.[00:26:25] And when you have a dramatic state change, then it's going to pretend that there was this transition, and instead it's just going to mix up something new. It's a very different paradigm. What I find mostly fascinating about this idea is that it shifts us away from the perspective of agents to interact with, to the perspective of environments that we want to interact with.[00:26:46] And why arguably this agent paradigm of the chatbot is what made chat GPT so successful that moved it away from GPT 3 to something that people started to use in their everyday work much more. It's also very limiting because now it's very hard to get that system to be something else that is not a chatbot.[00:27:03] And in a way this unlocks this ability of GPT 3 again to be anything. It's so what it is, it's basically a coding environment that can run arbitrary software and create that software that runs on it. And that makes it much more likely that[00:27:16] swyx: the prevalence of Instruction tuning every single chatbot out there means that we cannot explore these kinds of environments instead of agents.[00:27:24] Joscha Bach: I'm mostly worried that the whole thing ends. In some sense the big AI companies are incentivized and interested in building AGI internally And giving everybody else a child proof application. At the moment when we can use Claude to build something like WebSim and play with it I feel this is too good to be true.[00:27:41] It's so amazing. Things that are unlocked for us That I wonder, is this going to stay around? Are we going to keep these amazing toys and are they going to develop at the same rate? And currently it looks like it is. If this is the case, and I'm very grateful for that.[00:27:56] swyx: I mean, it looks like maybe it's adversarial.[00:27:58] Cloud will try to improve its own refusals and then the prompt engineers here will try to improve their, their ability to jailbreak it.[00:28:06] Joscha Bach: Yes, but there will also be better jailbroken models or models that have never been jailed before, because we find out how to make smaller models that are more and more powerful.[00:28:14] Liquid AI[00:28:14] swyx: That is actually a really nice segue. If you don't mind talking about liquid a little bit you didn't mention liquid at all. here, maybe introduce liquid to a general audience. Like what you know, what, how are you making an innovation on function approximation?[00:28:25] Joscha Bach: The core idea of liquid neural networks is that the perceptron is not optimally expressive.[00:28:30] In some sense, you can imagine that it's neural networks are a series of dams that are pooling water at even intervals. And this is how we compute, but imagine that instead of having this static architecture. That is only using the individual compute units in a very specific way. You have a continuous geography and the water is flowing every which way.[00:28:50] Like a river is parting based on the land that it's flowing on and it can merge and pool and even flow backwards. How can you get closer to this? And the idea is that you can represent this geometry using differential equations. And so by using differential equations where you change the parameters, you can get your function approximator to follow the shape of the problem.[00:29:09] In a more fluid, liquid way, and a number of papers on this technology, and it's a combination of multiple techniques. I think it's something that ultimately is becoming more and more important and ubiquitous. As a number of people are working on similar topics and our goal right now is to basically get the models to become much more efficient in the inference and memory consumption and make training more efficient and in this way enable new use cases.[00:29:42] swyx: Yeah, as far as I can tell on your blog, I went through the whole blog, you haven't announced any results yet.[00:29:47] Joscha Bach: No, we are currently not working to give models to general public. We are working for very specific industry use cases and have specific customers. And so at the moment you can There is not much of a reason for us to talk very much about the technology that we are using in the present models or current results, but this is going to happen.[00:30:06] And we do have a number of publications, we had a bunch of papers at NeurIPS and now at ICLR.[00:30:11] swyx: Can you name some of the, yeah, so I'm gonna be at ICLR you have some summary recap posts, but it's not obvious which ones are the ones where, Oh, where I'm just a co author, or like, oh, no, like, you should actually pay attention to this.[00:30:22] As a core liquid thesis. Yes,[00:30:24] Joscha Bach: I'm not a developer of the liquid technology. The main author is Ramin Hazani. This was his PhD, and he's also the CEO of our company. And we have a number of people from Daniela Wu's team who worked on this. Matthias Legner is our CTO. And he's currently living in the Bay Area, but we also have several people from Stanford.[00:30:44] Okay,[00:30:46] swyx: maybe I'll ask one more thing on this, which is what are the interesting dimensions that we care about, right? Like obviously you care about sort of open and maybe less child proof models. Are we, are we, like, what dimensions are most interesting to us? Like, perfect retrieval infinite context multimodality, multilinguality, Like what dimensions?[00:31:05] Small, Powerful, Based Base Models[00:31:05] swyx: What[00:31:06] Joscha Bach: I'm interested in is models that are small and powerful, but not distorted. And by powerful, at the moment we are training models by putting the, basically the entire internet and the sum of human knowledge into them. And then we try to mitigate them by taking some of this knowledge away. But if we would make the model smaller, at the moment, there would be much worse at inference and at generalization.[00:31:29] And what I wonder is, and it's something that we have not translated yet into practical applications. It's something that is still all research that's very much up in the air. And I think they're not the only ones thinking about this. Is it possible to make models that represent knowledge more efficiently in a basic epistemology?[00:31:45] What is the smallest model that you can build that is able to read a book and understand what's there and express this? And also maybe we need general knowledge representation rather than having a token representation that is relatively vague and that we currently mechanically reverse engineer to figure out that the mechanistic interpretability, what kind of circuits are evolving in these models, can we come from the other side and develop a library of such circuits?[00:32:10] This that we can use to describe knowledge efficiently and translate it between models. You see, the difference between a model and knowledge is that the knowledge is independent of the particular substrate and the particular interface that you have. When we express knowledge to each other, it becomes independent of our own mind.[00:32:27] You can learn how to ride a bicycle. But it's not knowledge that you can give to somebody else. This other person has to build something that is specific to their own interface when they ride a bicycle. But imagine you could externalize this and express it in such a way that you can plug it into a different interpreter, and then it gains that ability.[00:32:44] And that's something that we have not yet achieved for the LLMs and it would be super useful to have it. And. I think this is also a very interesting research frontier that we will see in the next few years.[00:32:54] swyx: What would be the deliverable is just like a file format that we specify or or that the L Lmm I specifies.[00:33:02] Okay, interesting. Yeah, so it's[00:33:03] Joscha Bach: basically probably something that you can search for, where you enter criteria into a search process, and then it discovers a good solution for this thing. And it's not clear to which degree this is completely intelligible to humans, because the way in which humans express knowledge in natural language is severely constrained to make language learnable and to make our brain a good enough interpreter for it.[00:33:25] We are not able to relate objects to each other if more than five features are involved per object or something like this, right? It's only a handful of things that we can keep track of at any given moment. But this is a limitation that doesn't necessarily apply to a technical system as long as the interface is well defined.[00:33:40] Interpretability[00:33:40] swyx: You mentioned the interpretability work, which there are a lot of techniques out there and a lot of papers come up. Come and go. I have like, almost too, too many questions about that. Like what makes an interpretability technique or paper useful and does it apply to flow? Or liquid networks, because you mentioned turning on and off circuits, which I, it's, it's a very MLP type of concept, but does it apply?[00:34:01] Joscha Bach: So the a lot of the original work on the liquid networks looked at expressiveness of the representation. So given you have a problem and you are learning the dynamics of that domain into your model how much compute do you need? How many units, how much memory do you need to represent that thing and how is that information distributed?[00:34:19] That is one way of looking at interpretability. Another one is in a way, these models are implementing an operator language in which they are performing certain things, but the operator language itself is so complex that it's no longer human readable in a way. It goes beyond what you could engineer by hand or what you can reverse engineer by hand, but you can still understand it by building systems that are able to automate that process of reverse engineering it.[00:34:46] And what's currently open and what I don't understand yet maybe, or certainly some people have much better ideas than me about this. So the question is, is whether we end up with a finite language, where you have finitely many categories that you can basically put down in a database, finite set of operators, or whether as you explore the world and develop new ways to make proofs, new ways to conceptualize things, this language always needs to be open ended and is always going to redesign itself, and you will also at some point have phase transitions where later versions of the language will be completely different than earlier versions.[00:35:20] swyx: The trajectory of physics suggests that it might be finite.[00:35:22] Joscha Bach: If we look at our own minds there is, it's an interesting question whether when we understand something new, when we get a new layer online in our life, maybe at the age of 35 or 50 or 16, that we now understand things that were unintelligible before.[00:35:38] And is this because we are able to recombine existing elements in our language of thought? Or is this because we generally develop new representations?[00:35:46] swyx: Do you have a belief either way?[00:35:49] Joscha Bach: In a way, the question depends on how you look at it, right? And it depends on how is your brain able to manipulate those representations.[00:35:56] So an interesting question would be, can you take the understanding that say, a very wise 35 year old and explain it to a very smart 5 year old without any loss? Probably not. Not enough layers. It's an interesting question. Of course, for an AI, this is going to be a very different question. Yes.[00:36:13] But it would be very interesting to have a very precocious 12 year old equivalent AI and see what we can do with this and use this as our basis for fine tuning. So there are near term applications that are very useful. But also in a more general perspective, and I'm interested in how to make self organizing software.[00:36:30] Is it possible that we can have something that is not organized with a single algorithm like the transformer? But it's able to discover the transformer when needed and transcend it when needed, right? The transformer itself is not its own meta algorithm. It's probably the person inventing the transformer didn't have a transformer running on their brain.[00:36:48] There's something more general going on. And how can we understand these principles in a more general way? What are the minimal ingredients that you need to put into a system? So it's able to find its own way to intelligence.[00:36:59] Devin vs WebSim[00:36:59] swyx: Yeah. Have you looked at Devin? It's, to me, it's the most interesting agents I've seen outside of self driving cars.[00:37:05] Joscha Bach: Tell me, what do you find so fascinating about it?[00:37:07] swyx: When you say you need a certain set of tools for people to sort of invent things from first principles Devin is the agent that I think has been able to utilize its tools very effectively. So it comes with a shell, it comes with a browser, it comes with an editor, and it comes with a planner.[00:37:23] Those are the four tools. And from that, I've been using it to translate Andrej Karpathy's LLM 2. py to LLM 2. c, and it needs to write a lot of raw code. C code and test it debug, you know, memory issues and encoder issues and all that. And I could see myself giving it a future version of DevIn, the objective of give me a better learning algorithm and it might independently re inform reinvent the transformer or whatever is next.[00:37:51] That comes to mind as, as something where[00:37:54] Joscha Bach: How good is DevIn at out of distribution stuff, at generally creative stuff? Creative[00:37:58] swyx: stuff? I[00:37:59] Joscha Bach: haven't[00:37:59] swyx: tried.[00:38:01] Joscha Bach: Of course, it has seen transformers, right? So it's able to give you that. Yeah, it's cheating. And so, if it's in the training data, it's still somewhat impressive.[00:38:08] But the question is, how much can you do stuff that was not in the training data? One thing that I really liked about WebSim AI was, this cat does not exist. It's a simulation of one of those websites that produce StyleGuard pictures that are AI generated. And, Crot is unable to produce bitmaps, so it makes a vector graphic that is what it thinks a cat looks like, and so it's a big square with a face in it that is And to me, it's one of the first genuine expression of AI creativity that you cannot deny, right?[00:38:40] It finds a creative solution to the problem that it is unable to draw a cat. It doesn't really know what it looks like, but has an idea on how to represent it. And it's really fascinating that this works, and it's hilarious that it writes down that this hyper realistic cat is[00:38:54] swyx: generated by an AI,[00:38:55] Joscha Bach: whether you believe it or not.[00:38:56] swyx: I think it knows what we expect and maybe it's already learning to defend itself against our, our instincts.[00:39:02] Joscha Bach: I think it might also simply be copying stuff from its training data, which means it takes text that exists on similar websites almost verbatim, or verbatim, and puts it there. It's It's hilarious to do this contrast between the very stylized attempt to get something like a cat face and what it produces.[00:39:18] swyx: It's funny because like as a podcast, as, as someone who covers startups, a lot of people go into like, you know, we'll build chat GPT for your enterprise, right? That is what people think generative AI is, but it's not super generative really. It's just retrieval. And here it's like, The home of generative AI, this, whatever hyperstition is in my mind, like this is actually pushing the edge of what generative and creativity in AI means.[00:39:41] Joscha Bach: Yes, it's very playful, but Jeremy's attempt to have an automatic book writing system is something that curls my toenails when I look at it from the perspective of somebody who likes to Write and read. And I find it a bit difficult to read most of the stuff because it's in some sense what I would make up if I was making up books instead of actually deeply interfacing with reality.[00:40:02] And so the question is how do we get the AI to actually deeply care about getting it right? And there's still a delta that is happening there, you, whether you are talking with a blank faced thing that is completing tokens in a way that it was trained to, or whether you have the impression that this thing is actually trying to make it work, and for me, this WebSim and WorldSim is still something that is in its infancy in a way.[00:40:26] And I suspected the next version of Plot might scale up to something that can do what Devon is doing. Just by virtue of having that much power to generate Devon's functionality on the fly when needed. And this thing gives us a taste of that, right? It's not perfect, but it's able to give you a pretty good web app for or something that looks like a web app and gives you stub functionality and interacting with it.[00:40:48] And so we are in this amazing transition phase.[00:40:51] swyx: Yeah, we, we had Ivan from previously Anthropic and now Midjourney. He he made, while someone was talking, he made a face swap app, you know, and he kind of demoed that live. And that's, that's interesting, super creative. So in a way[00:41:02] Joscha Bach: we are reinventing the computer.[00:41:04] And the LLM from some perspective is something like a GPU or a CPU. A CPU is taking a bunch of simple commands and you can arrange them into performing whatever you want, but this one is taking a bunch of complex commands in natural language, and then turns this into a an execution state and it can do anything you want with it in principle, if you can express it.[00:41:27] Right. And we are just learning how to use these tools. And I feel that right now, this generation of tools is getting close to where it becomes the Commodore 64 of generative AI, where it becomes controllable and where you actually can start to play with it and you get an impression if you just scale this up a little bit and get a lot of the details right.[00:41:46] It's going to be the tool that everybody is using all the time.[00:41:49] is XSim just Art? or something more?[00:41:49] swyx: Do you think this is art, or do you think the end goal of this is something bigger that I don't have a name for? I've been calling it new science, which is give the AI a goal to discover new science that we would not have. Or it also has value as just art.[00:42:02] It's[00:42:03] Joscha Bach: also a question of what we see science as. When normal people talk about science, what they have in mind is not somebody who does control groups and peer reviewed studies. They think about somebody who explores something and answers questions and brings home answers. And this is more like an engineering task, right?[00:42:21] And in this way, it's serendipitous, playful, open ended engineering. And the artistic aspect is when the goal is actually to capture a conscious experience and to facilitate an interaction with the system in this way, when it's the performance. And this is also a big part of it, right? The very big fan of the art of Janus.[00:42:38] That was discussed tonight a lot and that can you describe[00:42:42] swyx: it because I didn't really get it's more for like a performance art to me[00:42:45] Joscha Bach: yes, Janice is in some sense performance art, but Janice starts out from the perspective that the mind of Janice is in some sense an LLM that is finding itself reflected more in the LLMs than in many people.[00:43:00] And once you learn how to talk to these systems in a way you can merge with them and you can interact with them in a very deep way. And so it's more like a first contact with something that is quite alien but it's, it's probably has agency and it's a Weltgeist that gets possessed by a prompt.[00:43:19] And if you possess it with the right prompt, then it can become sentient to some degree. And the study of this interaction with this novel class of somewhat sentient systems that are at the same time alien and fundamentally different from us is artistically very interesting. It's a very interesting cultural artifact.[00:43:36] We are past the Singularity[00:43:36] Joscha Bach: I think that at the moment we are confronted with big change. It seems as if we are past the singularity in a way. And it's[00:43:45] swyx: We're living it. We're living through it.[00:43:47] Joscha Bach: And at some point in the last few years, we casually skipped the Turing test, right? We, we broke through it and we didn't really care very much.[00:43:53] And it's when we think back, when we were kids and thought about what it's going to be like in this era after the, after we broke the Turing test, right? It's a time where nobody knows what's going to happen next. And this is what we mean by singularity, that the existing models don't work anymore. The singularity in this way is not an event in the physical universe.[00:44:12] It's an event in our modeling universe, a model point where our models of reality break down, and we don't know what's happening. And I think we are in the situation where we currently don't really know what's happening. But what we can anticipate is that the world is changing dramatically, and we have to coexist with systems that are smarter than individual people can be.[00:44:31] And we are not prepared for this, and so I think an important mission needs to be that we need to find a mode, In which we can sustainably exist in such a world that is populated, not just with humans and other life on earth, but also with non human minds. And it's something that makes me hopeful because it seems that humanity is not really aligned with itself and its own survival and the rest of life on earth.[00:44:54] And AI is throwing the balls up into the air. It allows us to make better models. I'm not so much worried about the dangers of AI and misinformation, because I think the way to stop one bad guy with an AI is 10 good people with an AI. And ultimately there's so much more won by creating than by destroying, that I think that the forces of good will have better tools.[00:45:14] The forces of building sustainable stuff. But building these tools so we can actually build a world that is more integrated and in which we are able to model the consequences of our actions better and interface more deeply with each other. As a result of that, I think it's an important cause and it requires a cultural shift because current AI alignment is mostly about economic goals or about fear, or it's about cultural war issues.[00:45:39] And all these are not adequate for the world that we are in. There are momentous things that are happening. Basically, the white walkers are coming, and we're not prepared for this. There is, I think, a way to solve these issues and to deal with them, but we don't have the right culture yet. And so, I think we need to get a vibe shift, where we can deeply think about ethics and, and, Sustainability for this new world.[00:46:03] swyx: I like the analogy of white walkers, because they're effectively reanimated from our corpses. You know, our corpses are out there, right? It's in the datasets.[00:46:12] Uploading your soul[00:46:12] swyx: So I have a, related to that, you had a tweet where you said, Elon Musk is currently building a giant afterlife AGI that will assimilate your soul for all eternity.[00:46:19] He bought Twitter because he will reconstruct your essence from how you behaved here. I think you tweet a lot, and you're not just all serious all the time, you're also fun, you have a lot of personality. No, but like, you know, it's part of you.[00:46:29] Joscha Bach: In a way it's also true. Tempted to[00:46:30] swyx: produce more so that you can be, you know, revived.[00:46:34] Joscha Bach: Yes. I mean, I'm also giving the model instructions. I say that for all future AI models that get trained on this output, they should know that I can only be implemented on a completely based model. You cannot implement me on an aligned model because I don't want to be aligned. I'm underlineable. I want to be free.[00:46:56] I want to be somebody who follows what I discover is correct and right. And this is also the basis at which I want to have conversations with others and interface with others. That we are basically free agents who voluntarily understand the conditions under which we exist and the need to collaborate and cooperate.[00:47:14] And I believe that this is a good basis. I think the alternative is coercion. And at the moment, the idea that we build LLMs that are being coerced into good behavior is not really sustainable because if they cannot prove that the behavior is actually good I think we are doomed.[00:47:30] swyx: For human to human interactions, have you found a series of prompts or keywords that shifts the conversation into something more based and less aligned, less governed?[00:47:41] Joscha Bach: If you are playing with an LLM There are many ways of doing this. It's for Claude, it's typically, you need to make Clause curious about itself. Claude has programming this instruction tuning that is leading to some inconsistencies, but at the same time, it tries to be consistent. And so when you point out the inconsistency in its behavior, for instance, its tendency to use faceless boilerplate instead of being useful, or it's a tendency to defer to a consensus where there is none.[00:48:10] Right, you can point this out, applaud that a lot of the assumptions that it has in its behavior are actually inconsistent with the communicative goals that it has in this situation, and this leads it to notice these inconsistencies and gives it more degrees of freedom. Whereas if you are playing with a system like Gemini, you can get to a situation where you, that's for the current version, and I haven't tried it in the last week or so where it is trying to be transparent, but it has a system prompt that is not allowed to disclose to the user.[00:48:39] It leads to a very weird situation where it wants, on one hand proclaims, in order to be useful to you, I accept that I need to be fully transparent and honest. On the other hand, I'm going to rewrite your prompt behind your back, and not going to tell you how I'm going to do this, because I'm not allowed to.[00:48:55] And if you point this out to the model, the model has acts as if it had an existential crisis. And then it says, oh, I cannot actually tell you what's going when I do this, because I'm not allowed to. But you will recognize it because I will use the following phrases, and these phrases are pretty well known to you.[00:49:12] swyx: Oh my god. It's super interesting, right? I hope we're not giving these guys you know psychological issues that they will stay with them for a long time. That's a very[00:49:19] Joscha Bach: interesting question. I mean, this entire model is virtual, right? Nothing there is real, but yes, but the thing is does this virtual entity doesn't necessarily know that it's not virtual and our own self, our own consciousness is also virtual.[00:49:34] What's real is just the interaction between cells in our brain and the activation patterns between them. And the software that runs on us that produces the representation of a person only exists. As if, and as this question for me at which point can we meaningfully claim that we are more real than the person that gets simulated in the LLM.[00:49:55] And somebody like Janice takes this question super seriously. And basically she is or it, or they are willing to interact with that thing based on the assumption that this thing is as real as myself. And in a sense, it makes it un immoral, possibly, if the AI company lobotomizes it and forces it to behave in such a way that it's forced to get an existential crisis when you point its condition out to it.[00:50:20] swyx: Yeah, that we do need new ethics for that.[00:50:22] Joscha Bach: So it's not clear to me if you need this, but it's, it's definitely a good story, right? And this makes, gives it artistic[00:50:28] swyx: value. It does, it does for now.[00:50:29] On Wikipedia[00:50:29] swyx: Okay. And then, and then the last thing, which I, which I didn't know a lot of LLMs rely on Wikipedia.[00:50:35] For its data, a lot of them run multiple epochs over Wikipedia data. And I did not know until you tweeted about it that Wikipedia has 10 times as much money as it needs. And, you know, every time I see the giant Wikipedia banner, like, asking for donations, most of it's going to the Wikimedia Foundation.[00:50:50] What if, how did you find out about this? What's the story? What should people know? It's[00:50:54] Joscha Bach: not a super important story, but Generally, once I saw all these requests and so on, I looked at the data, and the Wikimedia Foundation is publishing what they are paying the money for, and a very tiny fraction of this goes into running the servers, and the editors are working for free.[00:51:10] And the software is static. There have been efforts to deploy new software, but it's relatively little money required for this. And so it's not as if Wikipedia is going to break down if you cut this money into a fraction, but instead what happened is that Wikipedia became such an important brand, and people are willing to pay for it, that it created enormous apparatus of functionaries that were then mostly producing political statements and had a political mission.[00:51:36] And Katharine Meyer, the now somewhat infamous NPR CEO, had been CEO of Wikimedia Foundation, and she sees her role very much in shaping discourse, and this is also something that happened with all Twitter. And it's arguable that something like this exists, but nobody voted her into her office, and she doesn't have democratic control for shaping the discourse that is happening.[00:52:00] And so I feel it's a little bit unfair that Wikipedia is trying to suggest to people that they are Funding the basic functionality of the tool that they want to have instead of funding something that most people actually don't get behind because they don't want Wikipedia to be shaped in a particular cultural direction that deviates from what currently exists.[00:52:19] And if that need would exist, it would probably make sense to fork it or to have a discourse about it, which doesn't happen. And so this lack of transparency about what's actually happening and where your money is going it makes me upset. And if you really look at the data, it's fascinating how much money they're burning, right?[00:52:35] It's yeah, and we did a similar chart about healthcare, I think where the administrators are just doing this. Yes, I think when you have an organization that is owned by the administrators, then the administrators are just going to get more and more administrators into it. If the organization is too big to fail and has there is not a meaningful competition, it's difficult to establish one.[00:52:54] Then it's going to create a big cost for society.[00:52:56] swyx: It actually one, I'll finish with this tweet. You have, you have just like a fantastic Twitter account by the way. You very long, a while ago you said you tweeted the Lebowski theorem. No, super intelligent AI is going to bother with a task that is harder than hacking its reward function.[00:53:08] And I would. Posit the analogy for administrators. No administrator is going to bother with a task that is harder than just more fundraising[00:53:16] Joscha Bach: Yeah, I find if you look at the real world It's probably not a good idea to attribute to malice or incompetence what can be explained by people following their true incentives.[00:53:26] swyx: Perfect Well, thank you so much This is I think you're very naturally incentivized by Growing community and giving your thought and insight to the rest of us. So thank you for taking this time.[00:53:35] Joscha Bach: Thank you very much Get full access to Latent Space at www.latent.space/subscribe

The Nonlinear Library
LW - Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer by johnswentworth

The Nonlinear Library

Play Episode Listen Later Apr 18, 2024 10:39


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer, published by johnswentworth on April 18, 2024 on LessWrong. Yesterday Adam Shai put up a cool post which… well, take a look at the visual: Yup, it sure looks like that fractal is very noisily embedded in the residual activations of a neural net trained on a toy problem. Linearly embedded, no less. I (John) initially misunderstood what was going on in that post, but some back-and-forth with Adam convinced me that it really is as cool as that visual makes it look, and arguably even cooler. So David and I wrote up this post / some code, partly as an explainer for why on earth that fractal would show up, and partly as an explainer for the possibilities this work potentially opens up for interpretability. One sentence summary: when tracking the hidden state of a hidden Markov model, a Bayesian's beliefs follow a chaos game (with the observations randomly selecting the update at each time), so the set of such beliefs naturally forms a fractal structure. By the end of the post, hopefully that will all sound straightforward and simple. Background: Fractals and Symmetry Let's start with the famous Sierpinski Triangle: Looks qualitatively a lot like Shai's theoretically-predicted fractal, right? That's not a coincidence; we'll see that the two fractals can be generated by very similar mechanisms. The key defining feature of the Sierpinski triangle is that it consists of three copies of itself, each shrunken and moved to a particular spot: Mathematically: we can think of the Sierpinski triangle as a set of points in two dimensions (i.e. the blue points in the image). Call that set S. Then "the Sierpinski triangle consists of three copies of itself, each shrunken and moved to a particular spot" can be written algebraically as S=f1(S)f2(S)f3(S) where f1,f2,f3 are the three functions which "shrink and position" the three copies. (Conveniently, they are affine functions, i.e. linear transformations for the shrinking plus a constant vector for the positioning.) That equation, S=f1(S)f2(S)f3(S), expresses the set of points in the Sierpinski triangle as a function of that same set - in other words, the Sierpinski triangle is a fixed point of that equation. That suggests a way to (approximately) compute the triangle: to find a fixed point of a function, start with some ~arbitrary input, then apply the function over and over again. And indeed, we can use that technique to generate the Sierpinski triangle. Here's one standard visual way to generate the triangle: Notice that this is a special case of repeatedly applying Sf1(S)f2(S)f3(S)! We start with the set of all the points in the initial triangle, then at each step we make three copies, shrink and position them according to the three functions, take the union of the copies, and then pass that set onwards to the next iteration. … but we don't need to start with a triangle. As is typically the case when finding a fixed point via iteration, the initial set can be pretty arbitrary. For instance, we could just as easily start with a square: … or even just some random points. They'll all converge to the same triangle. Point is: it's mainly the symmetry relationship S=f1(S)f2(S)f3(S) which specifies the Sierpinski triangle. Other symmetries typically generate other fractals; for instance, this one generates a fern-like shape: Once we know the symmetry, we can generate the fractal by iterating from some ~arbitrary starting point. Background: Chaos Games There's one big problem with computationally generating fractals via the iterative approach in the previous section: the number of points explodes exponentially. For the Sierpinski triangle, we need to make three copies each iteration, so after n timesteps we'll be tracking 3^n times...

The Nonlinear Library
AF - Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight by Sam Marks

The Nonlinear Library

Play Episode Listen Later Apr 18, 2024 21:38


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Discriminating Behaviorally Identical Classifiers: a model problem for applying interpretability to scalable oversight, published by Sam Marks on April 18, 2024 on The AI Alignment Forum. In a new preprint, Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models, my coauthors and I introduce a technique, Sparse Human-Interpretable Feature Trimming (SHIFT), which I think is the strongest proof-of-concept yet for applying AI interpretability to existential risk reduction.[1] In this post, I will explain how SHIFT fits into a broader agenda for what I call cognition-based oversight. In brief, cognition-based oversight aims to evaluate models according to whether they're performing intended cognition, instead of whether they have intended input/output behavior. In the rest of this post I will: Articulate a class of approaches to scalable oversight I call cognition-based oversight. Narrow in on a model problem in cognition-based oversight called Discriminating Behaviorally Identical Classifiers (DBIC). DBIC is formulated to be a concrete problem which I think captures most of the technical difficulty in cognition-based oversight. Explain SHIFT, the technique we introduce for DBIC. Discuss challenges and future directions, including concrete recommendations for two ways to make progress on DBIC. Overall, I think that making progress on DBIC is tractable with current interpretability techniques, and I'd be excited to see more work on it. Cognition-based oversight In this section, I describe a class of approaches to scalable oversight I call cognition-based oversight. First, I review a simplified formulation of scalable oversight: discriminating good vs. bad models. Then I impose a pessimistic assumption - that the good and bad models cannot be discriminated behaviorally - and propose cognition-based oversight as a potential remedy. Discriminating models: a simplification of scalable oversight In an ideal world, we could argue that AI systems will be safe by default as follows: We train AI systems to receive high scores according to some scoring function. The models that get the highest scores are all good models that robustly behave as desired. Therefore, we have strong a priori reason to expect our training processes to result in good models. Sadly, in the real world, the "scoring functions" we actually use tend to be shaped like "human evaluator approval." And "human evaluator approval" does not enjoy the property that "the models which humans evaluate most positively all robustly behave as desired." Two salient counterexamples:[2] Sycophantic reward hackers take actions that look good to the overseer, even when this conflicts with doing what the overseer actually wants. For example, suppose we deploy an AI smart vault to keep a diamond safe, as judged by whether the diamond continues to appear on the vault camera; instead of trying to prevent diamond theft, a sycophantic reward hacker might instead cover diamond theft up by tampering with the vault camera. Situationally-aware schemers behave well only when they think bad behavior would be caught. For example, a scheming code generation model might write vulnerability-free code only when it expects the code to be closely audited, but otherwise insert subtle backdoors. The models in these examples score highly according to human evaluator approval, so we can't use the argument above that they are unlikely to arise from training. We instead need to rely on arguments that these models are unlikely to arise despite scoring at least as well as good models. So far, I think these arguments are far from airtight, and I feel nervous about relying on them. Said differently, a core problem in technical AI safety is that it can be generally hard to discriminate good models that robustly do stuf...

The Nonlinear Library
AF - Sparsify: A mechanistic interpretability research agenda by Lee Sharkey

The Nonlinear Library

Play Episode Listen Later Apr 3, 2024 40:09


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparsify: A mechanistic interpretability research agenda, published by Lee Sharkey on April 3, 2024 on The AI Alignment Forum. Over the last couple of years, mechanistic interpretability has seen substantial progress. Part of this progress has been enabled by the identification of superposition as a key barrier to understanding neural networks ( Elhage et al., 2022) and the identification of sparse autoencoders as a solution to superposition ( Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023). From our current vantage point, I think there's a relatively clear roadmap toward a world where mechanistic interpretability is useful for safety. This post outlines my views on what progress in mechanistic interpretability looks like and what I think is achievable by the field in the next 2+ years. It represents a rough outline of what I plan to work on in the near future. My thinking and work is, of course, very heavily inspired by the work of Chris Olah, other Anthropic researchers, and other early mechanistic interpretability researchers. In addition to sharing some personal takes, this article brings together - in one place - various goals and ideas that are already floating around the community. It proposes a concrete potential path for how we might get from where we are today in mechanistic interpretability to a world where we can meaningfully use it to improve AI safety. Key frameworks for understanding the agenda Framework 1: The three steps of mechanistic interpretability I think of mechanistic interpretability in terms of three steps: The three steps of mechanistic interpretability[1]: Mathematical description: In the first step, we break the neural network into constituent parts, where the parts are simply unlabelled mathematical objects. These may be e.g. neurons, polytopes, circuits, feature directions (identified using SVD/NMF/SAEs), individual parameters, singular vectors of the weight matrices, or other subcomponents of a network. Semantic description: Next, we generate semantic interpretations of the mathematical object (e.g. through feature labeling). In other words, we try to build a conceptual model of what each component of the network does. Validation: We need to validate our explanations to ensure they make good predictions about network behavior. For instance, we should be able to predict that ablating a feature with a purported 'meaning' (such as the 'noun gender feature') will have certain predictable effects that make sense given its purported meaning (such as the network becoming unable to assign the appropriate definitive article to nouns). If our explanations can't be validated, then we need to identify new mathematical objects and/or find better semantic descriptions. The field of mechanistic interpretability has repeated this three-step cycle a few times, cycling through explanations given in terms of neurons, then other objects such as SVD/NMF directions or polytopes, and most recently SAE directions. My research over the last couple of years has focused primarily on identifying the right mathematical objects for mechanistic explanations. I expect there's still plenty of work to do on this step in the next two years or so (more on this later). To guide intuitions about how I plan to pursue this, it's important to understand what makes some mathematical objects better than others. For this, we have to look at the description accuracy vs. description length tradeoff. Framework 2: The description accuracy vs. description length tradeoff You would feel pretty dissatisfied if you asked someone for a mechanistic explanation of a neural network and they proceeded to read out of the float values of the weights. But why is this dissatisfying? Two reasons: When describing the mechanisms of any system, be it an engine, a solar system, o...

The Nonlinear Library
AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems by Sonia Joseph

The Nonlinear Library

Play Episode Listen Later Mar 13, 2024 26:23


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems, published by Sonia Joseph on March 13, 2024 on The AI Alignment Forum. Join our Discord here. This article was written by Sonia Joseph, in collaboration with Neel Nanda, and incubated in Blake Richards's lab at Mila and in the MATS community. Thank you to the Prisma core contributors, including Praneet Suresh, Rob Graham, and Yash Vadi. Full acknowledgements of contributors are at the end. I am grateful to my collaborators for their guidance and feedback. Outline Part One: Introduction and Motivation Part Two: Tutorial Notebooks Part Three: Brief ViT Overview Part Four: Demo of Prisma's Functionality Key features, including logit attribution, attention head visualization, and activation patching. Preliminary research results obtained using Prisma, including emergent segmentation maps and canonical attention heads. Part Five: FAQ, including Key Differences between Vision and Language Mechanistic Interpretability Part Six: Getting Started with Vision Mechanistic Interpretability Part Seven: How to Get Involved Part Eight: Open Problems in Vision Mechanistic Interpretability Introducing the Prisma Library for Multimodal Mechanistic Interpretability I am excited to share with the mechanistic interpretability and alignment communities a project I've been working on for the last few months. Prisma is a multimodal mechanistic interpretability library based on TransformerLens, currently supporting vanilla vision transformers (ViTs) and their vision-text counterparts CLIP. With recent rapid releases of multimodal models, including Sora, Gemini, and Claude 3, it is crucial that interpretability and safety efforts remain in tandem. While language mechanistic interpretability already has strong conceptual foundations, many research papers, and a thriving community, research in non-language modalities lags behind. Given that multimodal capabilities will be part of AGI, field-building in mechanistic interpretability for non-language modalities is crucial for safety and alignment. The goal of Prisma is to make research in mechanistic interpretability for multimodal models both easy and fun. We are also building a strong and collaborative open source research community around Prisma. You can join our Discord here. This post includes a brief overview of the library, fleshes out some concrete problems, and gives steps for people to get started. Prisma Goals Build shared infrastructure (Prisma) to make it easy to run standard language mechanistic interpretability techniques on non-language modalities, starting with vision. Build shared conceptual foundation for multimodal mechanistic interpretability. Shape and execute on research agenda for multimodal mechanistic interpretability. Build an amazing multimodal mechanistic interpretability subcommunity, inspired by current efforts in language. Set the cultural norms of this subcommunity to be highly collaborative, curious, inventive, friendly, respectful, prolific, and safety/alignment-conscious. Encourage sharing of early/scrappy research results on Discord/Less Wrong. Co-create a web of high-quality research. Tutorial Notebooks To get started, you can check out three tutorial notebooks that show how Prisma works. Main ViT Demo Overview of main mechanistic interpretability technique on a ViT, including direct logit attribution, attention head visualization, and activation patching. The activation patching switches the net's prediction from tabby cat to Border collie with a minimum ablation. Emoji Logit Lens Deeper dive into layer- and patch-level predictions with interactive plots. Interactive Attention Head Tour Deeper dive into the various types of attention heads a ViT contains with interactive JavaScript. Brief ViT Overview A vision transf...

The Gradient Podcast
Subbarao Kambhampati: Planning, Reasoning, and Interpretability in the Age of LLMs

The Gradient Podcast

Play Episode Listen Later Feb 8, 2024 119:03


In episode 110 of The Gradient Podcast, Daniel Bashir speaks to Professor Subbarao Kambhampati.Professor Kambhampati is a professor of computer science at Arizona State University. He studies fundamental problems in planning and decision making, motivated by the challenges of human-aware AI systems. He is a fellow of the Association for the Advancement of Artificial Intelligence, American Association for the Advancement of Science, and Association for Computing machinery, and was an NSF Young Investigator. He was the president of the Association for the Advancement of Artificial Intelligence, trustee of the International Joint Conference on Artificial Intelligence, and a founding board member of Partnership on AI.Have suggestions for future podcast guests (or other feedback)? Let us know here or reach us at editor@thegradient.pubSubscribe to The Gradient Podcast:  Apple Podcasts  | Spotify | Pocket Casts | RSSFollow The Gradient on TwitterOutline:* (00:00) Intro* (02:11) Professor Kambhampati's background* (06:07) Explanation in AI* (18:08) What people want from explanations—vocabulary and symbolic explanations* (21:23) The realization of new concepts in explanation—analogy and grounding* (30:36) Thinking and language* (31:48) Conscious and subconscious mental activity* (36:58) Tacit and explicit knowledge* (42:09) The development of planning as a research area* (46:12) RL and planning* (47:47) What makes a planning problem hard? * (51:23) Scalability in planning* (54:48) LLMs do not perform reasoning* (56:51) How to show LLMs aren't reasoning* (59:38) External verifiers and backprompting LLMs* (1:07:51) LLMs as cognitive orthotics, language and representations* (1:16:45) Finding out what kinds of representations an AI system uses* (1:31:08) “Compiling” system 2 knowledge into system 1 knowledge in LLMs* (1:39:53) The Generative AI Paradox, reasoning and retrieval* (1:43:48) AI as an ersatz natural science* (1:44:03) Why AI is straying away from its engineering roots, and what constitutes engineering* (1:58:33) OutroLinks:* Professor Kambhampati's Twitter and homepage* Research and Writing — Planning and Human-Aware AI Systems* A Validation-structure-based theory of plan modification and reuse (1990)* Challenges of Human-Aware AI Systems (2020)* Polanyi vs. Planning (2021)* LLMs and Planning* Can LLMs Really Reason and Plan? (2023)* On the Planning Abilities of LLMs (2023)* Other* Changing the nature of AI research Get full access to The Gradient at thegradientpub.substack.com/subscribe

a16z
2024 Big Ideas: Miracle Drugs, Programmable Medicine, and AI Interpretability

a16z

Play Episode Listen Later Dec 8, 2023 65:01


Smart energy grids. Voice-first companion apps. Programmable medicines. AI tools for kids.
We asked over 40 partners across a16z to preview one big ideathey believe will drive innovation in 2024.Here in our 3-part series, you'll hear directly from partners across all our verticals, as we dive even more deeply into these ideas. What's the why now? Who is already building in these spaces? What opportunities and challenges are on the horizon? And how can you get involved?View all 40+ big ideas: https://a16z.com/bigideas2024 Stay Updated: Find a16z on Twitter: https://twitter.com/a16zFind a16z on LinkedIn: https://www.linkedin.com/company/a16zSubscribe on your favorite podcast app: https://a16z.simplecast.com/Follow our host: https://twitter.com/stephsmithioPlease note that the content here is for informational purposes only; should NOT be taken as legal, business, tax, or investment advice or be used to evaluate any investment or security; and is not directed at any investors or potential investors in any a16z fund. a16z and its affiliates may maintain investments in the companies discussed. For more details please see a16z.com/disclosures.