Podcasts about saes

  • 112PODCASTS
  • 322EPISODES
  • 37mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • May 15, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about saes

Latest podcast episodes about saes

Take A T.O. With Turner And O'Neill
The DMV Hoops Podcast | "SAES Lions With Coach Kevin Jones" | 5.15.2025

Take A T.O. With Turner And O'Neill

Play Episode Listen Later May 15, 2025 46:07


Welcome to Episode 3 of The DMV Hoops Podcast.  This week, we welcome Coach Kevin Jones, the Head Coach of St. Andrew's Episcopal High School boys basketball, to the podcast!  Coach Jones talks about the coaching transition from public to private high school hoops, the people he surrounds himself with in the pursuit of his success and the development of their scholar athletes.In This Episode...Providing the right resources on & off the courtThe coaching camaraderie in the DMVHaving "basketball people" in his villageListen to all of this & more in this week's episode of "The DMV Hoops Podcast."Kurt Cross - Producer & Host | Adam Crain - On Air TalentIG @dmvhoopspodcastSupport the show

High 5 Adventure - The Podcast
Future Farmers of America (FFA) | Dr. Travis Park

High 5 Adventure - The Podcast

Play Episode Listen Later Apr 15, 2025 25:34


"Learning to do, doing to learn"   Phil, alongside guest host Jamie Thibodeau, is joined by Dr. Travis Park to explore the National FFA Organization's mission and its connection to experiential education. Travis discusses the importance of agricultural education in developing leadership, personal growth, and career success among students. The discussion highlights the role of experiential learning in FFA programs, the leadership development opportunities available to students, and the empowerment of youth through peer leadership. The conversation concludes with insights on collaboration between FFA and experiential education organizations. FFA is an agricultural leadership organization for students. The mission of FFA is to develop leadership and career success. Experiential education is integral to FFA's teaching methods. Students engage in supervised agricultural experiences (SAEs). Peer leadership is a key component of FFA's structure. FFA chapters empower students to lead their peers. Leadership development occurs through conferences and workshops. FFA provides opportunities for networking and mentorship. Agriculture teachers play a crucial role in student development. Collaboration between FFA and experiential education can enhance learning. Learn more about the FFA - https://www.ffa.org/ Connect with Phil; Email - podcast@high5adventure.org Instagram - https://www.instagram.com/verticalplaypen/ Donate to the podcast - verticalplaypen.org Music and sound effects - epidemicsound.com  

LessWrong Curated Podcast
“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong Curated Podcast

Play Episode Listen Later Apr 12, 2025 57:32


Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn't consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.TL;DR To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts [...] ---Outline:(01:08) TL;DR(02:38) Introduction(02:41) Motivation(06:09) Our Task(08:35) Conclusions and Strategic Updates(13:59) Comparing different ways to train Chat SAEs(18:30) Using SAEs for OOD Probing(20:21) Technical Setup(20:24) Datasets(24:16) Probing(26:48) Results(30:36) Related Work and Discussion(34:01) Is it surprising that SAEs didn't work?(39:54) Dataset debugging with SAEs(42:02) Autointerp and high frequency latents(44:16) Removing High Frequency Latents from JumpReLU SAEs(45:04) Method(45:07) Motivation(47:29) Modifying the sparsity penalty(48:48) How we evaluated interpretability(50:36) Results(51:18) Reconstruction loss at fixed sparsity(52:10) Frequency histograms(52:52) Latent interpretability(54:23) Conclusions(56:43) AppendixThe original text contained 7 footnotes which were omitted from this narration. --- First published: March 26th, 2025 Source: https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft --- Narrated by TYPE III AUDIO. ---Images from the article:

Proactive - Interviews for investors
Clinical study SKNJCT-003 shows positive trends as company advances skin cancer research in UAE

Proactive - Interviews for investors

Play Episode Listen Later Mar 6, 2025 7:27


Medicus Pharma CEO Dr. Raza Bokhari joined Steve Darling from Proactive to share updates on the ongoing SKNJCT-003 clinical study, which is being conducted at nine sites across the United States and aims to randomize 60 patients. An interim analysis conducted after enrolling more than half of the targeted participants has shown encouraging results. The study reports a clinical clearance rate exceeding 60%, with the investigational therapy demonstrating strong tolerability across both tested dosage levels. Notably, no dose-limiting toxicities (DLTs) or serious adverse events (SAEs) have been observed, and there have been no systemic effects or clinically significant abnormalities in laboratory results, ECGs, vital signs, or physical exams. These promising findings will be submitted to the U.S. Food and Drug Administration (FDA) as part of a Type C meeting request in Q2 2025. In parallel, the company is advancing its clinical initiatives for non-invasive basal cell carcinoma treatment with a newly proposed study. This randomized, double-blind, placebo-controlled, multi-center trial has been submitted for approval to the UAE Department of Health. The study aims to enroll up to 36 patients at prominent medical institutions, including Cleveland Clinic Abu Dhabi, Sheikh Shakbout Medical City, Burjeel Medical City, and the American Hospital of Dubai. Participants will be assigned in a 1:1:1 ratio to receive either a placebo or one of two dosage levels of the experimental treatment. This expansion into the UAE represents a significant step forward in the company's mission to develop innovative, non-invasive therapies for skin cancer, broadening its global clinical footprint and reinforcing its commitment to advancing patient care. #proactiveinvestors #nasdaq #mdcx #tsxv #mdcx #pharma #Biotech #CancerTreatment #ClinicalTrials #FDAApproval #SkinCancer #HealthcareInnovation #Investing #MedicalResearch

LessWrong Curated Podcast
“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

LessWrong Curated Podcast

Play Episode Listen Later Jan 10, 2025 15:56


TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neural networks by decomposing their individual activation spaces in isolation: It seems likely to find features of the activations - features that help explain the statistical structure of activation spaces, rather than features of the model - the features the model's own computations make use of.Written at Apollo Research IntroductionClaim: Activation space interpretability is likely to give us features of the activations, not features of the model, and this is a problem.Let's walk through this claim.What do we mean by activation space interpretability? Interpretability work that attempts to understand neural networks by explaining the inputs and outputs of their layers in isolation. In this post, we focus in particular on the problem of decomposing activations, via techniques such as sparse autoencoders (SAEs), PCA, or just by looking at individual neurons. This [...] ---Outline:(00:33) Introduction(02:40) Examples illustrating the general problem(12:29) The general problem(13:26) What can we do about this?The original text contained 11 footnotes which were omitted from this narration. --- First published: January 8th, 2025 Source: https://www.lesswrong.com/posts/gYfpPbww3wQRaxAFD/activation-space-interpretability-may-be-doomed --- Narrated by TYPE III AUDIO.

Owens Recovery Science
69 Intermittent Claudication

Owens Recovery Science

Play Episode Listen Later Oct 29, 2024 60:07


Chief paper discussed: T Parkington, T Maden-Wilkinson, D Broom, S Nawaz... (2023). Low-Intensity Resistance Exercise with Blood Flow Restriction for Patients with Claudication: A Randomised Controlled Feasibility Trial. Vascular Medicine . Position statement on managing PAD: Askew, C. D., Parmenter, B., Leicht, A. S., Walker, P. J., & Golledge, J. (2014). Exercise & Sports Science Australia (ESSA) position statement on exercise prescription for patients with peripheral arterial disease and intermittent claudication. Journal of Science and Medicine in Sport / Sports Medicine Australia, 17(6), 623–629. Additional papers referenced: Bentzen, A., Nisgaard, L. B., Mikkelsen, R. B. L., Høgh, A., Mechlenburg, I., & Jørgensen, S. L. (2023). Blood flow restricted walking in patients suffering from intermittent claudication: a case series feasibility and safety study. Annals of Medicine and Surgery (2012), 85(5), 1430–1435. Saes, G. F., Zerati, A. E., Wolosker, N., Ragazzo, L., Rosoky, R. M. A., Ritti-Dias, R. M., Cucato, G. G., Chehuen, M., Farah, B. Q., & Puech-Leão, P. (2013). Remote ischemic preconditioning in patients with intermittent claudication. Clinics , 68(4), 495–499. Ahmed, K. M., Hernon, S., Mohamed, S., Tubassum, M., Newell, M., & Walsh, S. R. (2018). Remote ischemic preconditioning in the management of intermittent claudication: a pilot randomized controlled trial. Annals of Vascular Surgery. https://doi.org/10.1016/j.avsg.2018.07.046 Podcast w/ Jamie Burr we referenced: https://owensrecoveryscience.com/podcasts/owens-recovery-science-podcast-bfr-ipc-for-performance-rehab-and-health-w-jamie-burr-phd

Owl Pellets: Tips for Ag Teachers
Implementing Middle School SAEs

Owl Pellets: Tips for Ag Teachers

Play Episode Listen Later Sep 24, 2024 19:31


Middle schoolers are developmentally different, which requires us to think about agricultural education program implementation a little differently as well. Join the team as we chat with Chris Eck from Oklahoma State University to learn more about the opportunity and responsibility to integrate an intercurricular program (and especially SAEs) for Middle Schoolers.   Journal Article: https://jae-online.org/index.php/jae/article/view/158

The Nonlinear Library
AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann

The Nonlinear Library

Play Episode Listen Later Aug 24, 2024 35:53


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Showing SAE Latents Are Not Atomic Using Meta-SAEs, published by Bart Bussmann on August 24, 2024 on The AI Alignment Forum. Bart, Michael and Patrick are joint first authors. Research conducted as part of MATS 6.0 in Lee Sharkey and Neel Nanda's streams. Thanks to Mckenna Fitzgerald and Robert Krzyzanowski for their feedback! TL;DR: Sparse Autoencoder (SAE) latents have been shown to typically be monosemantic (i.e. correspond to an interpretable property of the input). It is sometimes implicitly assumed that they are therefore atomic, i.e. simple, irreducible units that make up the model's computation. We provide evidence against this assumption by finding sparse, interpretable decompositions of SAE decoder directions into seemingly more atomic latents, e.g. Einstein -> science + famous + German + astronomy + energy + starts with E We do this by training meta-SAEs, an SAE trained to reconstruct the decoder directions of a normal SAE. We argue that, conceptually, there's no reason to expect SAE latents to be atomic - when the model is thinking about Albert Einstein, it likely also thinks about Germanness, physicists, etc. Because Einstein always entails those things, the sparsest solution is to have the Albert Einstein latent also boost them. Key results SAE latents can be decomposed into more atomic, interpretable meta-latents. We show that when latents in a larger SAE have split out from latents in a smaller SAE, a meta SAE trained on the larger SAE often recovers this structure. We demonstrate that meta-latents allow for more precise causal interventions on model behavior than SAE latents on a targeted knowledge editing task. We believe that the alternate, interpretable decomposition using MetaSAEs casts doubt on the implicit assumption that SAE latents are atomic. We show preliminary results that MetaSAE latents have significant ovelap with latents in a normal SAE of the same size but may relate differently to the larger SAEs used in MetaSAE training. We made a dashboard that lets you explore meta-SAE latents. Terminology: Throughout this post we use "latents" to describe the concrete components of the SAE's dictionary, whereas "feature" refers to the abstract concepts, following Lieberum et al. Introduction Mechanistic interpretability (mech interp) attempts to understand neural networks by breaking down their computation into interpretable components. One of the key challenges of this line of research is the polysemanticity of neurons, meaning they respond to seemingly unrelated inputs. Sparse autoencoders (SAEs) have been proposed as a method for decomposing model activations into sparse linear sums of latents. Ideally, these latents should be monosemantic i.e. respond to inputs that clearly share a similar meaning (implicitly, from the perspective of a human interpreter). That is, a human should be able to reason about the latents both in relation to the features to which they are associated, and also use the latents to better understand the model's overall behavior. There is a popular notion, both implicitly in related work on SAEs within mech interp and explicitly by the use of the term "atom" in sparse dictionary learning as a whole, that SAE features are atomic or can be "true features". However, monosemanticity does not imply atomicity. Consider the example of shapes of different colors - the set of shapes is [circle, triangle, square], and the set of colors is [white, red, green, black], each of which is represented with a linear direction. 'Red triangle' represents a monosemantic feature, but not an atomic feature, as it can be decomposed into red and triangle. It has been shown that sufficiently wide SAEs on toy models will learn 'red triangle', rather than representing 'red' and 'triangle' with separate latents. Furthermore, whilst one may naively re...

The SPARC Podcast
E79: Mike Saes, BRIDGE RUNNERS

The SPARC Podcast

Play Episode Listen Later Aug 21, 2024 47:30


In this episode we've got MIKE SAES on with us – one of the most important figures in run culture! Mike's the founder of BRIDGE RUNNERS and Bridge The Gap, and pretty much the Godfather of Run Crews. Follow @mikesaes and @bridgerunners

The Nonlinear Library
AF - Calendar feature geometry in GPT-2 layer 8 residual stream SAEs by Patrick Leask

The Nonlinear Library

Play Episode Listen Later Aug 17, 2024 7:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Calendar feature geometry in GPT-2 layer 8 residual stream SAEs, published by Patrick Leask on August 17, 2024 on The AI Alignment Forum. TL;DR: We demonstrate that the decoder directions of GPT-2 SAEs are highly structured by finding a historical date direction onto which projecting non-date related features lets us read off their historical time period by comparison to year features. Calendar years are linear: there are as many years between 2000 and 2024, as there are between 1800 and 1824. Linear probes can be used to predict years of particular events from the activations of language models. Since calendar years are linear, one might think the same of other time-based features such as weekday features, however weekday activations in sparse autoencoders (SAEs) were recently found to be arranged in a circular configuration in their top principal components. Inspired by this, we looked into weekdays, months, and most interestingly calendar years from the perspective of SAE feature decoder similarity. For each group of calendar features, we found interesting patterns of feature splitting between sparse autoencoders of different sizes. For calendar years, we found a timeline direction that meaningfully ordered events, individuals, and concepts with respect to their historical period, which furthermore does not correspond to a principal component of the decoder directions. Finally, we introduce a simple method for finding some of these interpretable directions. Features at different scales We started by replicating the weekday results by performing PCA on the decoder directions of features that had high activations when prompted with days of the week, using the same GPT-2 SAEs as in this post, ranging from 768 to 98304 features. In the 768 feature SAE, we found a single weekday feature that activated strongly on all days of the week. In the largest SAE, we found 10 weekday features, 3 of which activated on all days of the week, with the remaining 7 activating on a single day of the week each. We found a group of features that activate primarily on specific days of the week by taking the top 20 activating samples for each feature and checking that the max activating token in each of these samples was the specific weekday. We found the first two principal components for this set of features, and projected the features that activate on any day or number of days from all SAEs onto these directions. The labeled features are those that activate on a single day across all SAEs, with the multi-day features unlabeled to maintain legibility. The smallest SAE (blue) has a single feature that activates on all weekday tokens, and lies near the mean of all the weekday features. The largest SAEs learn features for each day of the week, plus additional multi-day features. Across SAE sizes, the single day features form clusters. In each of these examples, the smallest SAE has a single feature that splits into many specific features that seem of roughly the same importance. With calendar years, however, the situation is more complex. The same method of finding the principal components for single year features between 1900 and 2020 only succeeds in a few 21st century features, and nothing from the 20th century. There is also a group of single year features in a smaller SAE in the center of the plot, suggesting these principal components do not explain variance in them. The plot below shows the years for which each of the features is active, with the x-axis being years from 1950 to 2020, the y-axis being separate features, and the colored bars indicating the periods of year for which that feature is active. Only in the largest SAEs do you see more than a few single calendar year features, with most of the features activating on ranges of years, or other patterns such as the start and end...

The Nonlinear Library
AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko

The Nonlinear Library

Play Episode Listen Later Aug 12, 2024 17:20


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Extracting SAE task features for ICL, published by Dmitrii Kharlapenko on August 12, 2024 on The AI Alignment Forum. TL;DR We try to study task vectors in the SAE basis. This is challenging because there is no canonical way to convert an arbitrary vector in the residual stream to a linear combination of SAE features - you can't just pass an arbitrary vector through the encoder without going off distribution. We explored the algorithm of gradient pursuit suggested in Smith et al, but it didn't work for us without modifications. Our approach is to apply the SAE encoder to the task vector, and then apply a gradient-based cleanup. This exploits the fact that task vectors have a differentiable objective. We find that this gives a sparser and cleaner reconstruction, which is also highly interpretable, and also serves as a better task vector due to directly optimizing for log likelihood. This takes us from ~100 active features to ~10. Using our algorithm, we find two classes of SAE features involved in ICL. One of them recognizes the exact tasks or output formats from the examples, and another one encodes the tasks for execution by the model later on. We show that steering with these features has causal effects similar to task vectors. This work was produced as part of the ML Alignment & Theory Scholars Program - Summer 24 Cohort, under mentorship from Neel Nanda and Arthur Conmy. Prior work Task or function vectors are internal representations of some task that LLMs form while processing an ICL prompt. They can be extracted from a model running on a few-shot prompt and then be used to make it complete the same task without having any prior context or task description. Several papers (Function vectors in large language models, In-Context Learning Creates Task Vectors) have proposed different ways to extract those task vectors. They all center around having ICL examples being fed to a model in the form of "input output, … " and averaging the residuals on the "separator" token over a batch. This approach can reconstruct some part of the ICL performance but does not admit a straightforward conversion to the SAE basis. ITO with gradient pursuit can be used to do a sparse coding of a residual vector using SAE features. The post suggests using this algorithm for steering vector SAE decomposition. Since task vectors can be thought of as steering vectors, ITO may provide some insight into the ways they operate. Initial Phi-3 experiments Direct SAE task vector reconstruction In our study we trained a set of gated SAEs for Phi-3 Mini 3.8B using a model-generated synthetic instruction dataset. While offering a sparse dictionary decomposition of residuals, SAEs tend to introduce a reconstruction error that impacts the performance of the model. They also have no guarantee to be able to decompose out-of-distribution vectors, and task vectors being a product of averaging activations across prompts and tokens may be the case of such vectors. Thus, we first studied the performance of SAE reconstructions of task vectors in transferring the definition of two tasks: 1) antonym generation and 2) English to Spanish word translation. These and other tasks used to study task vectors were taken from the ICL task vectors paper github repository. These charts show the NLL loss of the model on the evaluation set of zero-shot prompts for both of the tasks depending on the layer of extraction/insertion. TV stands for the original task vector performance; Recon of TV stands for using the SAE reconstruction of the task vector instead of the task vector; TV on recon stands for first doing a SAE reconstruction of the residuals and then collecting a task vector on them; ITO stands for the ITO algorithm with 40 target l0 loss. It can be seen from charts that SAE reconstruction significantly decrea...

The Nonlinear Library
LW - You can remove GPT2's LayerNorm by fine-tuning for an hour by StefanHex

The Nonlinear Library

Play Episode Listen Later Aug 8, 2024 19:02


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You can remove GPT2's LayerNorm by fine-tuning for an hour, published by StefanHex on August 8, 2024 on LessWrong. This work was produced at Apollo Research, based on initial research done at MATS. LayerNorm is annoying for mechanstic interpretability research ("[...] reason #78 for why interpretability researchers hate LayerNorm" - Anthropic, 2023). Here's a Hugging Face link to a GPT2-small model without any LayerNorm. The final model is only slightly worse than a GPT2 with LayerNorm[1]: Dataset Original GPT2 Fine-tuned GPT2 with LayerNorm Fine-tuned GPT without LayerNorm OpenWebText (ce_loss) 3.095 2.989 3.014 (+0.025) ThePile (ce_loss) 2.856 2.880 2.926 (+0.046) HellaSwag (accuracy) 29.56% 29.82% 29.54% I fine-tuned GPT2-small on OpenWebText while slowly removing its LayerNorm layers, waiting for the loss to go back down after reach removal: Introduction LayerNorm (LN) is a component in Transformer models that normalizes embedding vectors to have constant length; specifically it divides the embeddings by their standard deviation taken over the hidden dimension. It was originally introduced to stabilize and speed up training of models (as a replacement for batch normalization). It is active during training and inference. The equation includes the standard deviation (std) Var[x]+ϵ which makes it a non-linear operation. This hinders interpretability in a variety of ways, from annoyances and inaccuracies such as attributing residual stream directions to logit effects (e.g. SAE features, direct logit attribution),[2] being annoying to deal with Attribution Patching, or being difficult to deal with in Apollo's LIB method. In the Docstring circuit analysis we seriously considered whether the model might be using LN in its algorithm. This post even shows that LN can be used as the sole non-linearity to solve non-linear classification problems (see also this related work). Recently, with progress in Sparse Dictionary Learning, agendas (e.g. this one) imagine decomposing networks into sets of sparsely connected components (SAEs, Transcoders, etc.). A core difficulty to "putting it all together" is that the interactions between different components often route through LayerNorm whose effect we do not understand. Motivation It would be pretty neat to have an LLM that still works (speaks English etc.) while less or no LN layers. One option would be to train a model without LN from scratch (done for tiny models, e.g. TinyModel), but this is very hard or impossible for larger models (hearsay is that you need a low learning rate and to be very careful). Taking an existing model and removing the LN layers however seems doable if LN isn't implementing some important computation.[3] That is, LN "does its thing" and the model has learned to "deal with it", but it's not irreplaceable. A reason to be optimistic is that the spread of standard deviations across different samples isn't that large, so maybe replacing the LN-computed standard deviation with a fixed number might kinda work. Method I take GPT2-small, fine-tune it on OpenWebText, and remove LNs one-by-one while fine-tuning. The only non-linear operation in a LN layer is the division by the standard deviation (std) of the embedding vectors; the remaining operations can be absorbed into later weight matrices (see the fold_ln option in TransformerLens; also discussed in this appendix). Thus I mainly focus on the std part here. My general strategy is to "remove" an LN layer (this makes the loss go up), and then to train the model for some time (on the original training data) until the loss is back near the baseline. For this "remove" step I do the following Calculate the average std on the dataset (I used a quite small sample, 16 prompts), separately for position 0 and position > 0 Replace the std calculation with the average std...

The Nonlinear Library
AF - You can remove GPT2's LayerNorm by fine-tuning for an hour by Stefan Heimersheim

The Nonlinear Library

Play Episode Listen Later Aug 8, 2024 19:03


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: You can remove GPT2's LayerNorm by fine-tuning for an hour, published by Stefan Heimersheim on August 8, 2024 on The AI Alignment Forum. This work was produced at Apollo Research, based on initial research done at MATS. LayerNorm is annoying for mechanstic interpretability research ("[...] reason #78 for why interpretability researchers hate LayerNorm" - Anthropic, 2023). Here's a Hugging Face link to a GPT2-small model without any LayerNorm. The final model is only slightly worse than a GPT2 with LayerNorm[1]: Dataset Original GPT2 Fine-tuned GPT2 with LayerNorm Fine-tuned GPT without LayerNorm OpenWebText (ce_loss) 3.095 2.989 3.014 (+0.025) ThePile (ce_loss) 2.856 2.880 2.926 (+0.046) HellaSwag (accuracy) 29.56% 29.82% 29.54% I fine-tuned GPT2-small on OpenWebText while slowly removing its LayerNorm layers, waiting for the loss to go back down after reach removal: Introduction LayerNorm (LN) is a component in Transformer models that normalizes embedding vectors to have constant length; specifically it divides the embeddings by their standard deviation taken over the hidden dimension. It was originally introduced to stabilize and speed up training of models (as a replacement for batch normalization). It is active during training and inference. The equation includes the standard deviation (std) Var[x]+ϵ which makes it a non-linear operation. This hinders interpretability in a variety of ways, from annoyances and inaccuracies such as attributing residual stream directions to logit effects (e.g. SAE features, direct logit attribution),[2] being annoying to deal with Attribution Patching, or being difficult to deal with in Apollo's LIB method. In the Docstring circuit analysis we seriously considered whether the model might be using LN in its algorithm. This post even shows that LN can be used as the sole non-linearity to solve non-linear classification problems (see also this related work). Recently, with progress in Sparse Dictionary Learning, agendas (e.g. this one) imagine decomposing networks into sets of sparsely connected components (SAEs, Transcoders, etc.). A core difficulty to "putting it all together" is that the interactions between different components often route through LayerNorm whose effect we do not understand. Motivation It would be pretty neat to have an LLM that still works (speaks English etc.) while less or no LN layers. One option would be to train a model without LN from scratch (done for tiny models, e.g. TinyModel), but this is very hard or impossible for larger models (hearsay is that you need a low learning rate and to be very careful). Taking an existing model and removing the LN layers however seems doable if LN isn't implementing some important computation.[3] That is, LN "does its thing" and the model has learned to "deal with it", but it's not irreplaceable. A reason to be optimistic is that the spread of standard deviations across different samples isn't that large, so maybe replacing the LN-computed standard deviation with a fixed number might kinda work. Method I take GPT2-small, fine-tune it on OpenWebText, and remove LNs one-by-one while fine-tuning. The only non-linear operation in a LN layer is the division by the standard deviation (std) of the embedding vectors; the remaining operations can be absorbed into later weight matrices (see the fold_ln option in TransformerLens; also discussed in this appendix). Thus I mainly focus on the std part here. My general strategy is to "remove" an LN layer (this makes the loss go up), and then to train the model for some time (on the original training data) until the loss is back near the baseline. For this "remove" step I do the following Calculate the average std on the dataset (I used a quite small sample, 16 prompts), separately for position 0 and position > 0 Replace the std calculatio...

The Nonlinear Library
AF - The 'strong' feature hypothesis could be wrong by lewis smith

The Nonlinear Library

Play Episode Listen Later Aug 2, 2024 31:14


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lewis smith on August 2, 2024 on The AI Alignment Forum. NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. "It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition.It's been conventionally understood that there are two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'. Here are two importantly different ways to parse this: 1. (Weak LRH) some or many features used by neural networks are represented as atoms in representation space 2. (Strong LRH) all (or the vast majority of) features used by neural networks are represented by atoms. The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature and the proportion of the model we expect to yield to analysis, but I think that the distinction between just a weak and strong form is clear enough to work with. I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption. This is difficult to state precisely, but we might formulate it as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications. For instance, if one thinks of 'feat...

The Nonlinear Library
LW - The 'strong' feature hypothesis could be wrong by lsgos

The Nonlinear Library

Play Episode Listen Later Aug 2, 2024 30:53


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: The 'strong' feature hypothesis could be wrong, published by lsgos on August 2, 2024 on LessWrong. NB. I am on the Google Deepmind language model interpretability team. But the arguments/views in this post are my own, and shouldn't be read as a team position. "It would be very convenient if the individual neurons of artificial neural networks corresponded to cleanly interpretable features of the input. For example, in an "ideal" ImageNet classifier, each neuron would fire only in the presence of a specific visual feature, such as the color red, a left-facing curve, or a dog snout" : Elhage et. al, Toy Models of Superposition Recently, much attention in the field of mechanistic interpretability, which tries to explain the behavior of neural networks in terms of interactions between lower level components, has been focussed on extracting features from the representation space of a model. The predominant methodology for this has used variations on the sparse autoencoder, in a series of papers inspired by Elhage et. als. model of superposition. Conventionally there understood to be two key theories underlying this agenda. The first is the 'linear representation hypothesis' (LRH), the hypothesis that neural networks represent many intermediates or variables of the computation (such as the 'features of the input' in the opening quote) as linear directions in it's representation space, or atoms[1]. And second, the theory that the network is capable of representing more of these 'atoms' than it has dimensions in its representation space, via superposition (the superposition hypothesis). While superposition is a relatively uncomplicated hypothesis, I think the LRH is worth examining in more detail. It is frequently stated quite vaguely, and I think there are several possible formulations of this hypothesis, with varying degrees of plausibility, that it is worth carefully distinguishing between. For example, the linear representation hypothesis is often stated as 'networks represent features of the input as directions in representation space'. There are a few possible formulations of this: 1. (Weak LRH) some features used by neural networks are represented as atoms in representation space 2. (Strong LRH) all features used by neural networks are represented by atoms. The weak LRH I would say is now well supported by considerable empirical evidence. The strong form is much more speculative: confirming the existence of many linear representations does not necessarily provide strong evidence for the strong hypothesis. Both the weak and the strong forms of the hypothesis can still have considerable variation, depending on what we understand by a feature. I think that in addition to the acknowledged assumption of the LRH and superposition hypotheses, much work on SAEs in practice makes the assumption that each atom in the network will represent a "simple feature" or a "feature of the input". These features that the atoms are representations of are assumed to be 'monosemantic': they will all stand for features which are human interpretable in isolation. I will call this the monosemanticity assumption. This is difficult to state precisely, but we might formulate as the theory that every represented variable will have a single meaning in a good description of a model. This is not a straightforward assumption due to how imprecise the notion of a single meaning is. While various more or less reasonable definitions for features are discussed in the pioneering work of Elhage, these assumptions have different implications. For instance, if one thinks of 'features' as computational intermediates in a broad sense, then superposition and the LRH imply a certain picture of the format of a models internal representation: that what the network is doing is manipulating atoms in superposition (if y...

The Nonlinear Library
LW - Understanding Positional Features in Layer 0 SAEs by bilalchughtai

The Nonlinear Library

Play Episode Listen Later Jul 30, 2024 9:29


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding Positional Features in Layer 0 SAEs, published by bilalchughtai on July 30, 2024 on LessWrong. This is an informal research note. It is the result of a few-day exploration into positional SAE features conducted as part of Neel Nanda's training phase of the ML Alignment & Theory Scholars Program - Summer 2024 cohort. Thanks to Andy Arditi, Arthur Conmy and Stefan Heimersheim for helpful feedback. Thanks to Joseph Bloom for training this SAE. Summary We investigate positional SAE features learned by layer 0 residual stream SAEs trained on gpt2-small. In particular, we study the activation blocks.0.hook_resid_pre, which is the sum of the token embeddings and positional embeddings. Importantly gpt2-small uses absolute learned positional embeddings - that is, the positional embeddings are a trainable parameter (learned) and are injected into the residual stream (absolute). We find that this SAE learns a set of positional features. We investigate some of the properties of these features, finding Positional and semantic features are entirely disjoint at layer 0. Note that we do not expect this to continue holding in later layers as attention mixes semantic and positional information. In layer 0, we should expect the SAE to disentangle positional and semantic features as there is a natural notion of ground truth positional and semantic features that interact purely additively. Generically, each positional feature spans a range of positions, except for the first few positions which each get dedicated (and sometimes, several) features. We can attribute degradation of SAE performance beyond the SAE training context length to (lack of) these positional features, and to the absolute nature of positional embeddings used by this model. Set Up We study pretrained gpt2-small SAEs trained on blocks.0.hook_resid_pre. This is particularly clean, as we can generate the entire input distribution to the SAE by summing each of the d_vocab token embeddings with each of the n_ctx positional embeddings, obtaining a tensor all_resid_pres: Float[Tensor, "d_vocab n_ctx d_model"] By passing this tensor through the SAE, we can grab all of the pre/post activation function feature activations all_feature_acts: Float[Tensor, "d_vocab n_ctx d_sae"] In this post, d_model = 768 and d_sae = 24576. Importantly the SAE we study in this post has context_size=128. The SAE context size corresponds is the maximal length of input sequence used to generate activations for training of the SAE. Finding features The activation space of study can be thought of as the direct sum of the token embedding space and the positional embedding space. As such, we hypothesize that semantic and positional features learned by the SAE should be distinct. That is, we hypothesize that the feature activations for some feature i can be written in the form where for each i, either gi=0 or hi=0 identically for all inputs in their domain and x is a d_model dimensional vector. To investigate this we hold tok or pos fixed in all_feature_acts and vary the other input. We first restrict to pos < sae.cfg.context_size. Positional features We first replicate Figure 1f of Gurnee et al. (2024), which finds instances of sinusoidal positional neurons in MLP layers. To do so, we assign each feature a positional score. We first compute the mean activation of each feature at each position by averaging over all possible input tokens. The position score is the max value of this over all positions, i.e. where fi(tok,pos) is the feature activation for feature i for the given input. We find positional scores drop off rapidly. There seem to only be ~50 positional features (of 24k total features) in this SAE. Inspecting the features, we find 1. Many positional features, each with small standard deviation over input tokens (shown in lower opacit...

The Nonlinear Library
EA - Non-Western EAs' perception of cross cultural interactions they had with Western EAs by Yi-Yang

The Nonlinear Library

Play Episode Listen Later Jul 24, 2024 33:11


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Non-Western EAs' perception of cross cultural interactions they had with Western EAs, published by Yi-Yang on July 24, 2024 on The Effective Altruism Forum. Summary I investigated non-Western EAs' perception of cross cultural interactions (CCIs) they had with Westerners, specifically: 1. Whether or not non-Westerners experienced CCI issues, and how often; 2. How their CCIs compare between EA and non-EA settings; 3. What kinds of subtle acts of exclusion (SAEs) they had experienced. I interviewed 21 non-Western EAs (selected from an EA conference's Swapcard and a few from my own personal network) and discovered: An overwhelming number of interviewees (19 out of 21) thought their cross-cultural interactions in EA settings were almost all neutral or positive. However, among the same 19 interviewees who found their CCIs to be mostly neutral or positive, they've also reported the following: 43% (9 out of 19) reported at least one general negative CCI 48% (10 out of 19) reported at least one SAE caused by Western EAs 19% (4 out of 19) reported at least one SAE caused by other non-Western EAs (or themselves) 81% (17 out of 19) reported: At least one general negative CCI, or At least one SAE caused by Western EAs, or At least one SAE caused by other non-Western EAs (or themselves), or A mix or all of the above. When asked to compared CCIs between EA settings and non-EA settings, 7 out of 14 thought CCIs in EA settings are about the same when compared to non-EA settings. 5 out of 14 thought CCIs in EA settings are better for them. 2 out of 14 thought CCIs in EA Settings are worse for them. Here are the most reported experiences: General negative CCIs Non-Western EAs found the act of connecting with Western EAs challenging. (4x) Non-Western EAs felt suspicious about the lack of representation. (3x) Non-Western EAs found the English language barrier challenging to overcome. (3x) SAEs caused by Western EAs Western EAs treating non-Western EAs in a way that's demeaning. (4x) Western EAs were coming across as paternalistic towards non-Western EAs. (2x) SAEs caused by non-Western EAs Non-Western EAs changing their accent or communication style to be more Western. (2x) For a better understanding of Western and non-Western CCIs, I highly recommend reading the highlighted negative vignettes and highlighted positive vignettes. Methodology I thought a more hands-on qualitative approach, like doing interviews, would be a better choice compared to a survey, because it offered me: 1. More flexibility to pivot the type of questions I ask or the things I want to say; 2. More information about a person's emotional state; 3. A way to potentially express empathy to those who might need it. I've also received feedback that interviewing people seems like the next best option too. Hence, I decided to interview people online who would identify themselves as EA or EA adjacent, and are predominantly non-Western. In these interviews, I asked: 1. How much cross cultural interactions in EA have you had? 2. How are the cross cultural interactions in EA settings that you've experienced? 3. Have you encountered any kinds of subtle acts of exclusion from others in EA settings? 4. Have you encountered acts of exclusion that are done by oppressed groups or minorities onto themselves in EA settings? 5. How do your cross-cultural experiences compare between EA and non-EA settings? 6. Are there other experiences you'd like to share? Or questions you'd like me to ask but I didn't? I did two things with the qualitative data I got from the interviews: 1. I collected their experiences, paraphrased them, and compiled them under the appendix below. For those I found to be resonant in some hard-to-describe way, I included them in the "highlighted negative/positive vignettes" sections. 2. I did some basic qualitative re...

The Nonlinear Library
EA - Evidence of Poor Cross-Cultural Interactions in the EA community by Yi-Yang

The Nonlinear Library

Play Episode Listen Later Jul 24, 2024 20:15


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Evidence of Poor Cross-Cultural Interactions in the EA community, published by Yi-Yang on July 24, 2024 on The Effective Altruism Forum. Summary In this project, I investigated non-Western EAs' perception of CCIs they had with Westerners, specifically: 1. How often non-Westerners experienced CCI issues; 2. What kinds of subtle acts of exclusion (SAEs) they had experienced; 3. How their CCIs compare between EA and non-EA settings. To do that, I collected an array of evidence from seven sources (e.g., anecdotes from interviews and a focus group, and some statistics from three surveys not done by me). And based on the evidence on CCIs I have collected so far, I believe that poor CCIs are likely to be a common but minor problem for most non-westerners in the EA community. At the organisational or community level, I would not flag CCI issues as something to be heavily prioritised (moderate confidence), but I would recommend EA-aligned organisations and organisers to start or maintain interventions that are sensible or if the trade-offs are acceptable, like some of the ones listed here by AmAristizabal. At the individual level, I recommend: 1. Checking out some of the vignettes shared by non-Western EAs here and here 2. Read more examples of SAEs here 3. Read some of my low-confidence takes on what non-Western and Western folks could do to improve CCIs Background I noticed that I was feeling annoyed in some of my cross-cultural interactions (CCIs) in the EA community, but I couldn't tell for sure whether these interactions had exclusionary elements in them. These are more subtle, and are not the overt racist behaviours that I'm more familiar with. Hence, I started this investigation out of a desire to sanity check myself ("Am I misinterpreting things? Or has anyone else experienced the same thing?"). I would also be happy if this project is useful to others too, perhaps by making non-Western folks feel less perplexed or less alone. In this project, I investigated non-Western EAs' perception of CCIs they had with Westerners, specifically: 1. How often non-Westerners experienced CCI issues; 2. What kinds of subtle acts of exclusion (SAEs) they had experienced; 3. How their CCIs compare between EA and non-EA settings. This investigation was done pretty informally and in a non-strategic way (e.g. I wasn't really explicitly thinking about this in a Bayesian probability way), but it does consist of an array of evidence from seven sources that I think, when combined, are pretty informative. Evidence compiled Evidence that might indicate less negative CCIs 1. EA Survey 2022 According to the Rethink Priorities team who lead the EA Survey 2022 project, survey respondents who identified as more non-Western scored slightly better than survey respondents who identified as more Western in terms of: Satisfaction (mean): 7.55 (N=219) versus 7.17 (N=2251) out of 10.00 points Retention (mean): 5.51 (N=144) versus 5.42 (N=1736) out of 7.00 points Mental health (mean): 3.49 (N=143) versus 3.27 (N=1528) out of 5.00 points The above three metrics aren't exactly what I'm looking for, that is belongingness. It might be the case that non-Westerners do experience CCI issues but still get a lot of value from EA or belongingness in their local EA groups. Evidence that might indicate more negative CCIs 1. My personal experience Firstly, I've noticed Western folks "hijacking" (most likely unconsciously or unintentionally) norms in spaces where non-Western folks traditionally belong, are the majority, or a mix of both. I've noticed at least one such behaviour in an EA setting before. Here are a few non-EA-related examples (to preserve anonymity): A discussion group in Malaysia I was a part of has a norm about raising one's hands and letting the moderator pick the next speaker to make speaking time more ...

The Nonlinear Library
LW - Efficient Dictionary Learning with Switch Sparse Autoencoders by Anish Mudide

The Nonlinear Library

Play Episode Listen Later Jul 22, 2024 20:21


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Efficient Dictionary Learning with Switch Sparse Autoencoders, published by Anish Mudide on July 22, 2024 on LessWrong. Produced as part of the ML Alignment & Theory Scholars Program - Summer 2024 Cohort 0. Summary To recover all the relevant features from a superintelligent language model, we will likely need to scale sparse autoencoders (SAEs) to billions of features. Using current architectures, training extremely wide SAEs across multiple layers and sublayers at various sparsity levels is computationally intractable. Conditional computation has been used to scale transformers (Fedus et al.) to trillions of parameters while retaining computational efficiency. We introduce the Switch SAE, a novel architecture that leverages conditional computation to efficiently scale SAEs to many more features. 1. Introduction The internal computations of large language models are inscrutable to humans. We can observe the inputs and the outputs, as well as every intermediate step in between, and yet, we have little to no sense of what the model is actually doing. For example, is the model inserting security vulnerabilities or backdoors into the code that it writes? Is the model lying, deceiving or seeking power? Deploying a superintelligent model into the real world without being aware of when these dangerous capabilities may arise leaves humanity vulnerable. Mechanistic interpretability (Olah et al.) aims to open the black-box of neural networks and rigorously explain the underlying computations. Early attempts to identify the behavior of individual neurons were thwarted by polysemanticity, the phenomenon in which a single neuron is activated by several unrelated features (Olah et al.). Language models must pack an extremely vast amount of information (e.g., the entire internet) within a limited capacity, encouraging the model to rely on superposition to represent many more features than there are dimensions in the model state (Elhage et al.). Sharkey et al. and Cunningham et al. propose to disentangle superimposed model representations into monosemantic, cleanly interpretable features by training unsupervised sparse autoencoders (SAEs) on intermediate language model activations. Recent work (Templeton et al., Gao et al.) has focused on scaling sparse autoencoders to frontier language models such as Claude 3 Sonnet and GPT-4. Despite scaling SAEs to 34 million features, Templeton et al. estimate that they are likely orders of magnitude short of capturing all features. Furthermore, Gao et al. train SAEs on a series of language models and find that larger models require more features to achieve the same reconstruction error. Thus, to capture all relevant features of future large, superintelligent models, we will likely need to scale SAEs to several billions of features. With current methodologies, training SAEs with billions of features at various layers, sublayers and sparsity levels is computationally infeasible. Training a sparse autoencoder generally consists of six major computations: the encoder forward pass, the encoder gradient, the decoder forward pass, the decoder gradient, the latent gradient and the pre-bias gradient. Gao et al. introduce kernels and tricks that leverage the sparsity of the TopK activation function to dramatically optimize all computations excluding the encoder forward pass, which is not (yet) sparse. After implementing these optimizations, Gao et al. attribute the majority of the compute to the dense encoder forward pass and the majority of the memory to the latent pre-activations. No work has attempted to accelerate or improve the memory efficiency of the encoder forward pass, which remains the sole dense matrix multiplication. In a standard deep learning model, every parameter is used for every input. An alternative approach is conditional computatio...

The Nonlinear Library
AF - BatchTopK: A Simple Improvement for TopK-SAEs by Bart Bussmann

The Nonlinear Library

Play Episode Listen Later Jul 20, 2024 7:17


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: BatchTopK: A Simple Improvement for TopK-SAEs, published by Bart Bussmann on July 20, 2024 on The AI Alignment Forum. Work done in Neel Nanda's stream of MATS 6.0. Epistemic status: Tried this on a single sweep and seems to work well, but it might definitely be a fluke of something particular to our implementation or experimental set-up. As there are also some theoretical reasons to expect this technique to work (adaptive sparsity), it seems probable that for many TopK SAE set-ups it could be a good idea to also try BatchTopK. As we're not planning to investigate this much further and it might be useful to others, we're just sharing what we've found so far. TL;DR: Instead of taking the TopK feature activations per token during training, taking the Top(K*batch_size) for every batch seems to improve SAE performance. During inference, this activation can be replaced with a single global threshold for all features. Introduction Sparse autoencoders (SAEs) have emerged as a promising tool for interpreting the internal representations of large language models. By learning to reconstruct activations using only a small number of features, SAEs can extract monosemantic concepts from the representations inside transformer models. Recently, OpenAI published a paper exploring the use of TopK activation functions in SAEs. This approach directly enforces sparsity by only keeping the K largest activations per sample. While effective, TopK forces every token to use exactly k features, which is likely suboptimal. We came up with a simple modification that solves this and seems to improve its performance. BatchTopK Standard TopK SAEs apply the TopK operation independently to each sample in a batch. For a target sparsity of K, this means exactly K features are activated for every sample. BatchTopK instead applies the TopK operation across the entire flattened batch: 1. Flatten all feature activations across the batch 2. Take the top (K * batch_size) activations 3. Reshape back to the original batch shape This allows more flexibility in how many features activate per sample, while still maintaining an average of K active features across the batch. Experimental Set-Up For both the TopK and the BatchTopK SAEs we train a sweep with the following hyperparameters: Model: gpt2-small Site: layer 8 resid_pre Batch size: 4096 Optimizer: Adam (lr=3e-4, beta1 = 0.9, beta2=0.99) Number of tokens: 1e9 Expansion factor: [4, 8, 16, 32] Target L0 (k): [16, 32, 64] As in the OpenAI paper, the input gets normalized before feeding it into the SAE and calculating the reconstruction loss. We also use the same auxiliary loss function for dead features (features that didn't activate for 5 batches) that calculates the loss on the residual using the top 512 dead features per sample and gets multiplied by a factor 1/32. Results For a fixed number of active features (L0=32) the BatchTopK SAE has a lower normalized MSE than the TopK SAE and less downstream loss degradation across different dictionary sizes. Similarly, for fixed dictionary size (12288) BatchTopK outperforms TopK for different values of k. Our main hypothesis for the improved performance is thanks to adaptive sparsity: some samples contain more highly activating features than others. Let's have look at the distribution of number of active samples for the BatchTopK model. The BatchTopK model indeed makes use of its possibility to use different sparsities for different inputs. We suspect that the weird peak on the left side are the feature activations on BOS-tokens, given that its frequency is very close to 1 in 128, which is the sequence length. This serves as a great example of why BatchTopK might outperform TopK. At the BOS-token, a sequence has very little information yet, but the TopK SAE still activates 32 features. The BatchTopK model "saves" th...

The Nonlinear Library
AF - JumpReLU SAEs + Early Access to Gemma 2 SAEs by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Jul 19, 2024 2:42


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: JumpReLU SAEs + Early Access to Gemma 2 SAEs, published by Neel Nanda on July 19, 2024 on The AI Alignment Forum. New paper from the Google DeepMind mechanistic interpretability team, led by Sen Rajamanoharan! We introduce JumpReLU SAEs, a new SAE architecture that replaces the standard ReLUs with discontinuous JumpReLU activations, and seems to be (narrowly) state of the art over existing methods like TopK and Gated SAEs for achieving high reconstruction at a given sparsity level, without a hit to interpretability. We train through discontinuity with straight-through estimators, which also let us directly optimise the L0. To accompany this, we will release the weights of hundreds of JumpReLU SAEs on every layer and sublayer of Gemma 2 2B and 9B in a few weeks. Apply now for early access to the 9B ones! We're keen to get feedback from the community, and to get these into the hands of researchers as fast as possible. There's a lot of great projects that we hope will be much easier with open SAEs on capable models! Gated SAEs already reduced to JumpReLU activations after weight tying, so this can be thought of as Gated SAEs++, but less computationally intensive to train, and better performing. They should be runnable in existing Gated implementations. Abstract: Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations faithfully; yet to be interpretable the decomposition must be sparse - two objectives that are in tension. In this paper, we introduce JumpReLU SAEs, which achieve state-of the-art reconstruction fidelity at a given sparsity level on Gemma 2 9B activations, compared to other recent advances such as Gated and TopK SAEs. We also show that this improvement does not come at the cost of interpretability through manual and automated interpretability studies. JumpReLU SAEs are a simple modification of vanilla (ReLU) SAEs - where we replace the ReLU with a discontinuous JumpReLU activation function - and are similarly efficient to train and run. By utilising straight-through-estimators (STEs) in a principled manner, we show how it is possible to train JumpReLU SAEs effectively despite the discontinuous JumpReLU function introduced in the SAE's forward pass. Similarly, we use STEs to directly train L0 to be sparse, instead of training on proxies such as L1, avoiding problems like shrinkage. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - SAEs (usually) Transfer Between Base and Chat Models by Connor Kissane

The Nonlinear Library

Play Episode Listen Later Jul 18, 2024 19:23


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAEs (usually) Transfer Between Base and Chat Models, published by Connor Kissane on July 18, 2024 on The AI Alignment Forum. This is an interim report sharing preliminary results that we are currently building on. We hope this update will be useful to related research occurring in parallel. Executive Summary We train SAEs on base / chat model pairs and find that SAEs trained on the base model transfer surprisingly well to reconstructing chat activations (and vice versa) on Mistral-7B and Qwen 1.5 0.5B. We also find that they don't transfer on Gemma v1 2B, and are generally bad at reconstructing

The Nonlinear Library
AF - Stitching SAEs of different sizes by Bart Bussmann

The Nonlinear Library

Play Episode Listen Later Jul 13, 2024 21:12


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Stitching SAEs of different sizes, published by Bart Bussmann on July 13, 2024 on The AI Alignment Forum. Work done in Neel Nanda's stream of MATS 6.0, equal contribution by Bart Bussmann and Patrick Leask, Patrick Leask is concurrently a PhD candidate at Durham University TL;DR: When you scale up an SAE, the features in the larger SAE can be categorized in two groups: 1) "novel features" with new information not in the small SAE and 2) "reconstruction features" that sparsify information that already exists in the small SAE. You can stitch SAEs by adding the novel features to the smaller SAE. Introduction Sparse autoencoders (SAEs) have been shown to recover sparse, monosemantic features from language models. However, there has been limited research into how those features vary with dictionary size, that is, when you take the same activation in the same model and train a wider dictionary on it, what changes? And how do the features learned vary? We show that features in larger SAEs cluster into two kinds of features: those that capture similar information to the smaller SAE (either identical features, or split features; about 65%), and those which capture novel features absent in the smaller mode (the remaining 35%). We validate this by showing that inserting the novel features from the larger SAE into the smaller SAE boosts the reconstruction performance, while inserting the similar features makes performance worse. Building on this insight, we show how features from multiple SAEs of different sizes can be combined to create a "Frankenstein" model that outperforms SAEs with an equal number of features, though tends to lead to higher L0, making a fair comparison difficult. Our work provides new understanding of how SAE dictionary size impacts the learned feature space, and how to reason about whether to train a wider SAE. We hope that this method may also lead to a practically useful way of training high-performance SAEs with less feature splitting and a wider range of learned novel features. Larger SAEs learn both similar and entirely novel features Set-up We use sparse autoencoders as in Towards Monosemanticity and Sparse Autoencoders Find Highly Interpretable Directions. In our setup, the feature activations are computed as: Based on these feature activations, the input is then reconstructed as The encoder and decoder matrices and biases are trained with a loss function that combines an L2 penalty on the reconstruction loss and an L1 penalty on the feature activations: In our experiments, we train a range of sparse autoencoders (SAEs) with varying widths across residual streams in GPT-2 and Pythia-410m. The width of an SAE is determined by the number of features (F) in the sparse autoencoder. Our smallest SAE on GPT-2 consists of only 768 features, while the largest one has nearly 100,000 features. Here is the full list of SAEs used in this research: Name Model site Dictionary size L0 MSE CE Loss Recovered from zero ablation CE Loss Recovered from mean ablation GPT2-768 gpt2-small layer 8 of 12 resid_pre 768 35.2 2.72 0.915 0.876 GPT2-1536 gpt2-small layer 8 of 12 resid_pre 1536 39.5 2.22 0.942 0.915 GPT2-3072 gpt2-small layer 8 of 12 resid_pre 3072 42.4 1.89 0.955 0.937 GPT2-6144 gpt2-small layer 8 of 12 resid_pre 6144 43.8 1.631 0.965 0.949 GPT2-12288 gpt2-small layer 8 of 12 resid_pre 12288 43.9 1.456 0.971 0.958 GPT2-24576 gpt2-small layer 8 of 12 resid_pre 24576 42.9 1.331 0.975 0.963 GPT2-49152 gpt2-small layer 8 of 12 resid_pre 49152 42.4 1.210 0.978 0.967 GPT2-98304 gpt2-small layer 8 of 12 resid_pre 98304 43.9 1.144 0.980 0.970 Pythia-8192 Pythia-410M-deduped layer 3 of 24 resid_pre 8192 51.0 0.030 0.977 0.972 Pythia-16384 Pythia-410M-deduped layer 3 of 24 resid_pre 16384 43.2 0.024 0.983 0.979 The base language models used are those included in Transform...

The Nonlinear Library
LW - How ARENA course material gets made by CallumMcDougall

The Nonlinear Library

Play Episode Listen Later Jul 3, 2024 14:10


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How ARENA course material gets made, published by CallumMcDougall on July 3, 2024 on LessWrong. TL;DR In this post, I describe my methodology for building new material for ARENA. I'll mostly be referring to the exercises on IOI, Superposition and Function Vectors as case studies. I expect this to be useful for people who are interested in designing material for ARENA or ARENA-like courses, as well as people who are interested in pedagogy or ML paper replications. The process has 3 steps: 1. Start with something concrete 2. First pass: replicate, and understand 3. Second pass: exercise-ify Summary I'm mostly basing this on the following 3 sets of exercises: Indirect Object Identification - these exercises focus on the IOI paper (from Conmy et al). The goal is to have people understand what exploratory analysis of transformers looks like, and introduce the key ideas of the circuits agenda. Superposition & SAEs - these exercises focus on understanding superposition and the agenda of dictionary learning (specifically sparse autoencoders). Most of the exercises explore Anthropic's Toy Models of Superposition paper, except for the last 2 sections which explore sparse autoencoders (firstly by applying them to the toy model setup, secondly by exploring a sparse autoencoder trained on a language model). Function Vectors - these exercises focus on the Function Vectors paper by David Bau et al, although they also make connections with related work such as Alex Turner's GPT2-XL steering vector work. These exercises were interesting because they also had the secondary goal of being an introduction to the nnsight library, in much the same way that the intro to mech interp exercises were also an introduction to TransformerLens. The steps I go through are listed below. I'm indexing from zero because I'm a software engineer so of course I am. The steps assume you already have an idea of what exercises you want to create; in Appendix (1) you can read some thoughts on what makes for a good exercise set. 1. Start with something concrete When creating material, you don't want to be starting from scratch. It's useful to have source code available to browse - bonus points if that takes the form of a Colab or something which is self-contained and has easily visible output. IOI - this was Neel's "Exploratory Analysis Demo" exercises. The rest of the exercises came from replicating the paper directly. Superposition - this was Anthroic's Colab notebook (although the final version went quite far beyond this). The very last section (SAEs on transformers) was based on Neel Nanda's demo Colab). Function Vectors - I started with the NDIF demo notebook, to show how some basic nnsight syntax worked. As for replicating the actual function vectors paper, unlike the other 2 examples I was mostly just working from the paper directly. It helped that I was collaborating with some of this paper's authors, so I was able to ask them some questions to clarify aspects of the paper. 2. First-pass: replicate, and understand The first thing I'd done in each of these cases was go through the material I started with, and make sure I understood what was going on. Paper replication is a deep enough topic for its own series of blog posts (many already exist), although I'll emphasise that I'm not usually talking about full paper replication here, because ideally you'll be starting from something a it further along, be that a Colab, a different tutorial, or something else. And even when you are just working directly from a paper, you shouldn't make the replication any harder for yourself than you need to. If there's code you can take from somewhere else, then do. My replication usually takes the form of working through a notebook in VSCode. I'll either start from scratch, or from a downloaded Colab if I'm using one as a ...

The Nonlinear Library
AF - Interpreting Preference Models w/ Sparse Autoencoders by Logan Riggs Smith

The Nonlinear Library

Play Episode Listen Later Jul 1, 2024 15:43


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Interpreting Preference Models w/ Sparse Autoencoders, published by Logan Riggs Smith on July 1, 2024 on The AI Alignment Forum. Preference Models (PMs) are trained to imitate human preferences and are used when training with RLHF (reinforcement learning from human feedback); however, we don't know what features the PM is using when outputting reward. For example, maybe curse words make the reward go down and wedding-related words make it go up. It would be good to verify that the features we wanted to instill in the PM (e.g. helpfulness, harmlessness, honesty) are actually rewarded and those we don't (e.g. deception, sycophancey) aren't. Sparse Autoencoders (SAEs) have been used to decompose intermediate layers in models into interpretable feature. Here we train SAEs on a 7B parameter PM, and find the features that are most responsible for the reward going up & down. High level takeaways: 1. We're able to find SAE features that have a large causal effect on reward which can be used to "jail break" prompts. 2. We do not explain 100% of reward differences through SAE features even though we tried for a couple hours. What are PMs? [skip if you're already familiar] When talking to a chatbot, it can output several different responses, and you can choose which one you believe is better. We can then train the LLM on this feedback for every output, but humans are too slow. So we'll just get, say, 100k human preferences of "response A is better than response B", and train another AI to predict human preferences! But to take in text & output a reward, a PM would benefit from understanding language. So one typically trains a PM by first taking an already pretrained model (e.g. GPT-3), and replacing the last component of the LLM of shape [d_model, vocab_size], which converts the residual stream to 50k numbers for the probability of each word in its vocabulary, to [d_model, 1] which converts it to 1 number which represents reward. They then call this pretrained model w/ this new "head" a "Preference Model", and train it to predict the human-preference dataset. Did it give the human preferred response [A] a higher number than [B]? Good. If not, bad! This leads to two important points: 1. Reward is relative - the PM is only trained to say the human preferred response is better than the alternative. So a large negative reward or large positive reward don't have objective meaning. All that matters is the relative reward difference for two completions given the same prompt. 1. (h/t to Ethan Perez's post) 2. Most features are already learned in pretraining - the PM isn't learning new features from scratch. It's taking advantage of the pretrained model's existing concepts. These features might change a bit or compose w/ each other differently though. 1. Note: this an unsubstantiated hypothesis of mine. Finding High Reward-affecting Features w/ SAEs We trained 6 SAEs on layers 2,8,12,14,16,20 of an open source 7B parameter PM, finding 32k features for each layer. We then find the most important features for the reward going up or down (specifics in Technical Details section). Below is a selection of features found through this process that we thought were interesting enough to try to create prompts w/. (My list of feature interpretations for each layer can be found here) Negative Features A "negative" feature is a feature that will decrease the reward that the PM predicts. This could include features like cursing or saying the same word repeatedly. Therefore, we should expect that removing a negative feature makes the reward go up I don't know When looking at a feature, I'll look at the top datapoints that removing it affected the reward the most: Removing feature 11612 made the chosen reward go up by 1.2 from 4.79->6.02, and had no effect on the rejected completion because it doesn't a...

The Nonlinear Library
AF - SAE feature geometry is outside the superposition hypothesis by Jake Mendel

The Nonlinear Library

Play Episode Listen Later Jun 24, 2024 18:10


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAE feature geometry is outside the superposition hypothesis, published by Jake Mendel on June 24, 2024 on The AI Alignment Forum. Written at Apollo Research Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don't currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models. Epistemic status: Decently confident that the ideas here are directionally correct. I've been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including both SAE stans and SAE skeptics) have thought very similar things before and some of them have written about it in various places too. Some of my views, especially the merit of certain research approaches to tackle the problems I highlight, have been presented here without my best attempt to argue for them. What would it mean if we could fully understand an activation space through the lens of superposition? If you fully understand something, you can explain everything about it that matters to someone else in terms of concepts you (and hopefully they) understand. So we can think about how well I understand an activation space by how well I can communicate to you what the activation space is doing, and we can test if my explanation is good by seeing if you can construct a functionally equivalent activation space (which need not be completely identical of course) solely from the information I have given you. In the case of SAEs, here's what I might say: 1. The activation space contains this list of 100 million features, which I can describe concisely in words because they are monosemantic. 2. The features are embedded as vectors, and the activation vector on any input is a linear combination of the feature vectors that are related to the input. 3. As for where in the activation space each feature vector is placed, oh that doesn't really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I'm being more sophisticated, I can specify the correlations between features and that's enough to pin down all the structure that matters - all the other details of the overcomplete basis are random. Every part of this explanation is in terms of things I understand precisely. My features are described in natural language, and I know what a random overcomplete basis is (although I'm on the fence about whether a large correlation matrix counts as something that I understand). The placement of each feature vector in the activation space matters Why might this description be insufficient? First, there is the pesky problem of SAE reconstruction errors, which are parts of activation vectors that are missed when we give this description. Second, not all features seem monosemantic, and it is hard to find semantic descriptions of even the most monosemantic features that have both high sensitivity and specificity, let alone descriptions which allow us to predict the quantitative values that activating features take on a particular input. But let's suppose that these issues have been solved: SAE improvements lead to perfect reconstruction and extremely monosemantic features, and new ...

The Nonlinear Library
LW - SAE feature geometry is outside the superposition hypothesis by jake mendel

The Nonlinear Library

Play Episode Listen Later Jun 24, 2024 18:09


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAE feature geometry is outside the superposition hypothesis, published by jake mendel on June 24, 2024 on LessWrong. Summary: Superposition-based interpretations of neural network activation spaces are incomplete. The specific locations of feature vectors contain crucial structural information beyond superposition, as seen in circular arrangements of day-of-the-week features and in the rich structures of feature UMAPs. We don't currently have good concepts for talking about this structure in feature geometry, but it is likely very important for model computation. An eventual understanding of feature geometry might look like a hodgepodge of case-specific explanations, or supplementing superposition with additional concepts, or plausibly an entirely new theory that supersedes superposition. To develop this understanding, it may be valuable to study toy models in depth and do theoretical or conceptual work in addition to studying frontier models. Epistemic status: Decently confident that the ideas here are directionally correct. I've been thinking these thoughts for a while, and recently got round to writing them up at a high level. Lots of people (including both SAE stans and SAE skeptics) have thought very similar things before and some of them have written about it in various places too. Some of my views, especially the merit of certain research approaches to tackle the problems I highlight, have been presented here without my best attempt to argue for them. What would it mean if we could fully understand an activation space through the lens of superposition? If you fully understand something, you can explain everything about it that matters to someone else in terms of concepts you (and hopefully they) understand. So we can think about how well I understand an activation space by how well I can communicate to you what the activation space is doing, and we can test if my explanation is good by seeing if you can construct a functionally equivalent activation space (which need not be completely identical of course) solely from the information I have given you. In the case of SAEs, here's what I might say: 1. The activation space contains this list of 100 million features, which I can describe concisely in words because they are monosemantic. 2. The features are embedded as vectors, and the activation vector on any input is a linear combination of the feature vectors that are related to the input. 3. As for where in the activation space each feature vector is placed, oh that doesn't really matter and any nearly orthogonal overcomplete basis will do. Or maybe if I'm being more sophisticated, I can specify the correlations between features and that's enough to pin down all the structure that matters - all the other details of the overcomplete basis are random. Every part of this explanation is in terms of things I understand precisely. My features are described in natural language, and I know what a random overcomplete basis is (although I'm on the fence about whether a large correlation matrix counts as something that I understand). The placement of each feature vector in the activation space matters Why might this description be insufficient? First, there is the pesky problem of SAE reconstruction errors, which are parts of activation vectors that are missed when we give this description. Second, not all features seem monosemantic, and it is hard to find semantic descriptions of even the most monosemantic features that have both high sensitivity and specificity, let alone descriptions which allow us to quantitatively predict the quantitative values that activating features take on a particular input. But let's suppose that these issues have been solved: SAE improvements lead to perfect reconstruction and extremely monosemantic features, and new autointerp techniques lea...

The Nonlinear Library
AF - Attention Output SAEs Improve Circuit Analysis by Connor Kissane

The Nonlinear Library

Play Episode Listen Later Jun 21, 2024 32:55


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention Output SAEs Improve Circuit Analysis, published by Connor Kissane on June 21, 2024 on The AI Alignment Forum. This is the final post of our Alignment Forum sequence produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort. Executive Summary In a previous post we trained A ttention Output SAEs on every layer of GPT-2 Small. Following that work, we wanted to stress-test that Attention SAEs were genuinely helpful for circuit analysis research. This would both validate SAEs as a useful tool for mechanistic interpretability researchers, and provide evidence that they are identifying the real variables of the model's computation. We believe that we now have evidence that attention SAEs can: Help make novel mechanistic interpretability discoveries that prior methods could not make. Allow for tracing information through the model's forward passes on arbitrary prompts. In this post we discuss the three outputs from this circuit analysis work: 1. We use SAEs to deepen our understanding of the IOI circuit. It was previously thought that the indirect object's name was identified by tracking the names positions, whereas we find that instead the model tracks whether names are before or after "and". This was not noticed in prior work, but is obvious with the aid of SAEs. 2. We introduce "recursive direct feature attribution" (recursive DFA) and release an Attention Circuit Explorer tool for circuit analysis on GPT-2 Small (Demo 1 and Demo 2). One of the nice aspects of attention is that attention heads are linear when freezing the appropriate attention patterns. As a result, we can identify which source tokens triggered the firing of a feature. We can perform this recursively to track backwards through both attention and residual stream SAE features in models. 1. We also announce a $1,000 bounty for whomever can produce the most interesting example of an attention feature circuit by 07/15/24 as subjectively assessed by the authors. See the section "Even cooler examples" for more details on the bounty. 3. We open source HookedSAETransformer to SAELens, which makes it easy to splice in SAEs during a forward pass and cache + intervene on SAE features. Get started with this demo notebook. Introduction With continued investment into dictionary learning research, there still remains a concerning lack of evidence that SAEs are useful interpretability tools in practice. Further, while SAEs clearly find interpretable features (Cunningham et al.; Bricken et al.), it's not obvious that these features are true causal variables used by the model. In this post we address these concerns by applying our GPT-2 Small Attention SAEs to improve circuit analysis research. We start by using our SAEs to deepen our understanding of the IOI task. The first step is evaluating if our SAEs are sufficient for the task. We "splice in" our SAEs at each layer, replacing attention layer outputs with their SAE reconstructed activations, and study how this affects the model's ability to perform the task - if crucial information is lost by the SAE, then they will be a poor tool for analysis. At their best, we find that SAEs at the early-middle layers almost fully recover model performance, allowing us to leverage these to answer a long standing open question and discover novel insights about IOI. However, we also find that our SAEs at the later layers (and layer 0) damage the model's ability to perform the task, suggesting we'll need more progress in the science and scaling of SAEs before we can analyze a full end-to-end feature circuit. We then move beyond IOI and develop a visualization tool (link) to explore attention feature circuits on arbitrary prompts, introducing a new technique called recursive DFA. This technique exploits the fact that transformers are almost linear i...

The Nonlinear Library
AF - SAEs Discover Meaningful Features in the IOI Task by Alex Makelov

The Nonlinear Library

Play Episode Listen Later Jun 5, 2024 19:04


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAEs Discover Meaningful Features in the IOI Task, published by Alex Makelov on June 5, 2024 on The AI Alignment Forum. TLDR: recently, we wrote a paper proposing several evaluations of SAEs against "ground-truth" features computed w/ supervision for a given task (in our case, IOI [1]). However, we didn't optimize the SAEs much for performance in our tests. After putting the paper on arxiv, Alex carried out a more exhaustive search for SAEs that do well on our test for controlling (a.k.a. steering) model output with SAE features. The results show that: SAEs trained on IOI data find interpretable features that come close to matching supervised features (computed with knowledge of the IOI circuit) for the task of editing representations to steer the model. Gated SAEs outperform vanilla SAEs across the board for steering SAE training metrics like sparsity and loss recovered significantly correlate with how good representation edits are. In particular, sparsity is more strongly correlated than loss recovered. Partial Paper Recap: Towards More Objective SAE Evals Motivation: SAE Evals Are Too Indirect We train SAEs with the goal of finding the true features in LLM representations - but currently, "true features" is more of a vague direction than a well-defined concept in mech interp research. SAE evaluations mostly use indirect measures of performance - ones we hope correlate with the features being the "true" ones, such as the ℓ0 (sparsity) loss, the LLM loss recovered when using SAE reconstructions, and how interpretable the features are. This leaves a big gap in our understanding of the usefulness of SAEs and similar unsupervised methods; it also makes it hard to objectively compare different SAE architectures and/or training algorithms. So, we wanted to develop more objective SAE evaluations, by benchmarking SAEs against features that we know to be meaningful through other means, even if in a narrow context. We chose the IOI task, as it's perhaps the most well-studied example of a non-trivial narrow capability in a real-world LLM (GPT2-Small). We set out to compute a "skyline" for SAE performance: an object of the same "type" as an SAE - a "sparse feature dictionary" - which is constructed and validated "by hand" using our very precise knowledge about IOI. Such an object would allow us to evaluate how close a given SAE is to the limit of what's afforded by its representational power. The IOI circuit (copy of Figure 2 from the IOI paper [1]). Creating Our Own Feature Dictionaries for the IOI Task With Supervision Following the prior work by Wang et al [1] that discovered the IOI circuit, we conjectured that internal LLM activations for an IOI prompt p (e.g., "When Mary and John went to the store, John gave a book to") can be described using the following three attributes: IO(p): the indirect object token (" Mary" in our example) S(p): the subject token (" John" in our example) Pos(p): whether the IO token comes first or second in the sentence (1st in our example; the alternative would be "When John and Mary went...") And indeed, we found that intermediate activations of the model at a given site (e.g., the output of some attention head) for a prompt p can be approximated as[1] activation(p)Ep'IOI(activation(p'))+vIO=IO(p)+vS=S(p)+vPos=Pos(p) where the vectors vIO=...,… form a "supervised sparse feature dictionary" that we construct using our prior knowledge about the IOI circuit[2]. In fact, these vectors can be chosen in a very simple way as the (centered) conditional mean, e.g. vIO=Mary=EpIOI(activation(p)|IO(p)=" Mary")EpIOI(activation(p)) Not just that, but we can use these vectors for editing individual attributes' values in internal model states in a natural way via feature arithmetic, e.g. to change the IO from " Mary" to " Mike", we can use the activation aedit...

The Nonlinear Library
AF - Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning by Dan Braun

The Nonlinear Library

Play Episode Listen Later May 17, 2024 9:00


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning, published by Dan Braun on May 17, 2024 on The AI Alignment Forum. A short summary of the paper is presented below. This work was produced by Apollo Research in collaboration with Jordan Taylor (MATS + University of Queensland) . TL;DR: We propose end-to-end (e2e) sparse dictionary learning, a method for training SAEs that ensures the features learned are functionally important by minimizing the KL divergence between the output distributions of the original model and the model with SAE activations inserted. Compared to standard SAEs, e2e SAEs offer a Pareto improvement: They explain more network performance, require fewer total features, and require fewer simultaneously active features per datapoint, all with no cost to interpretability. We explore geometric and qualitative differences between e2e SAE features and standard SAE features. Introduction Current SAEs focus on the wrong goal: They are trained to minimize mean squared reconstruction error (MSE) of activations (in addition to minimizing their sparsity penalty). The issue is that the importance of a feature as measured by its effect on MSE may not strongly correlate with how important the feature is for explaining the network's performance. This would not be a problem if the network's activations used a small, finite set of ground truth features -- the SAE would simply identify those features, and thus optimizing MSE would have led the SAE to learn the functionally important features. In practice, however, Bricken et al. observed the phenomenon of feature splitting, where increasing dictionary size while increasing sparsity allows SAEs to split a feature into multiple, more specific features, representing smaller and smaller portions of the dataset. In the limit of large dictionary size, it would be possible to represent each individual datapoint as its own dictionary element. Since minimizing MSE does not explicitly prioritize learning features based on how important they are for explaining the network's performance, an SAE may waste much of its fixed capacity on learning less important features. This is perhaps responsible for the observation that, when measuring the causal effects of some features on network performance, a significant amount is mediated by the reconstruction residual errors (i.e. everything not explained by the SAE) and not mediated by SAE features (Marks et al.). Given these issues, it is therefore natural to ask how we can identify the functionally important features used by the network. We say a feature is functional important if it is important for explaining the network's behavior on the training distribution. If we prioritize learning functionally important features, we should be able to maintain strong performance with fewer features used by the SAE per datapoint as well as fewer overall features. To optimize SAEs for these properties, we introduce a new training method. We still train SAEs using a sparsity penalty on the feature activations (to reduce the number of features used on each datapoint), but we no longer optimize activation reconstruction. Instead, we replace the original activations with the SAE output and optimize the KL divergence between the original output logits and the output logits when passing the SAE output through the rest of the network, thus training the SAE end-to-end (e2e). One risk with this method is that it may be possible for the outputs of SAE_e2e to take a different computational pathway through subsequent layers of the network (compared with the original activations) while nevertheless producing a similar output distribution. For example, it might learn a new feature that exploits a particular transformation in a downstream layer that is unused by the regular netw...

The Nonlinear Library
LW - Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers by hugofry

The Nonlinear Library

Play Episode Listen Later Apr 30, 2024 19:44


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers, published by hugofry on April 30, 2024 on LessWrong. Two Minute Summary In this post I present my results from training a Sparse Autoencoder (SAE) on a CLIP Vision Transformer (ViT) using the ImageNet-1k dataset. I have created an interactive web app, 'SAE Explorer', to allow the public to explore the visual features the SAE has learnt, found here: https://sae-explorer.streamlit.app/ (best viewed on a laptop). My results illustrate that SAEs can identify sparse and highly interpretable directions in the residual stream of vision models, enabling inference time inspections on the model's activations. To demonstrate this, I have included a 'guess the input image' game on the web app that allows users to guess the input image purely from the SAE activations of a single layer and token of the residual stream. I have also uploaded a (slightly outdated) accompanying talk of my results, primarily listing SAE features I found interesting: https://youtu.be/bY4Hw5zSXzQ. The primary purpose of this post is to demonstrate and emphasise that SAEs are effective at identifying interpretable directions in the activation space of vision models. In this post I highlight a small number my favourite SAE features to demonstrate some of the abstract concepts the SAE has identified within the model's representations. I then analyse a small number of SAE features using feature visualisation to check the validity of the SAE interpretations. Later in the post, I provide some technical analysis of the SAE. I identify a large cluster of features analogous to the 'ultra-low frequency' cluster that Anthropic identified. In line with existing research, I find that this ultra-low frequency cluster represents a single feature. I then analyse the 'neuron-alignment' of SAE features by comparing the SAE encoder matrix the MLP out matrix. This research was conducted as part of the ML Alignment and Theory Scholars program 2023/2024 winter cohort. Special thanks to Joseph Bloom for providing generous amounts of his time and support (in addition to the SAE Lens code base) as well as LEAP labs for helping to produce the feature visualisations and weekly meetings with Jessica Rumbelow. Example, animals eating other animals feature: (top 16 highest activating images) Example, Italian feature: Note that the photo of the dog has a watermark with a website ending in .it (Italy's domain name). Note also that the bottom left photo is of Italian writing. The number of ambulances present is a byproduct of using ImageNet-1k. Motivation Frontier AI systems are becoming increasingly multimodal, and capabilities may advance significantly as multimodality increases due to transfer learning between different data modalities and tasks. As a heuristic, consider how much intuition humans gain for the world through visual reasoning; even in abstract settings such as in maths and physics, concepts are often understood most intuitively through visual reasoning. Many cutting edge systems today such as DALL-E and Sora use ViTs trained on multimodal data. Almost by definition, AGI is likely to be multimodal. Despite this, very little effort has been made to apply and adapt our current mechanistic interpretability techniques to vision tasks or multimodal models. I believe it is important to check that mechanistic interpretability generalises to these systems in order to ensure they are future-proof and can be applied to safeguard against AGI. In this post, I restrict the scope of my research to specifically investigating SAEs trained on multimodal models. The particular multimodal system I investigate is CLIP, a model trained on image-text pairs. CLIP consists of two encoders: a language model and a vision model that are trained to e...

The Nonlinear Library
AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky

The Nonlinear Library

Play Episode Listen Later Apr 30, 2024 41:01


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transcoders enable fine-grained interpretable circuit analysis for language models, published by Jacob Dunefsky on April 30, 2024 on The AI Alignment Forum. Summary We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed. We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable. One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using transcoders in circuit analysis in this way. We performed a set of case studies on GPT2-small that demonstrate that transcoders can be used to decompose circuits into monosemantic, interpretable units of computation. We provide code for training/running/evaluating transcoders and performing circuit analysis with transcoders, and code for the aforementioned case studies carried out using these tools. We also provide a suite of 12 trained transcoders, one for each layer of GPT2-small. All of the code can be found at https://github.com/jacobdunefsky/transcoder_circuits, and the transcoders can be found at https://huggingface.co/pchlenski/gpt2-transcoders. Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2024) stream and MATS 5.1 extension. Jacob Dunefsky is currently receiving funding from the Long-Term Future Fund for this work. Background and motivation Mechanistic interpretability is fundamentally concerned with reverse-engineering models' computations into human-understandable parts. Much early mechanistic interpretability work (e.g. indirect object identification) has dealt with decomposing model computations into circuits involving small numbers of model components like attention heads or MLP sublayers. But these component-level circuits operate at too coarse a granularity: due to the relatively small number of components in a model, each individual component will inevitably be important to all sorts of computations, oftentimes playing different roles. In other words, components are polysemantic. Therefore, if we want a more faithful and more detailed understanding of the model, we should aim to find fine-grained circuits that decompose the model's computation onto the level of individual feature vectors. As a hypothetical example of the utility that feature-level circuits might provide in the very near-term: if we have a feature vector that seems to induce gender bias in the model, then understanding which circuits this feature vector partakes in (including which earlier-layer features cause it to activate and which later-layer features it activates) would better allow us to understand the side-effects of debiasing methods. More ambitiously, we hope that similar reasoning might apply to a feature that would seem to mediate deception in a future unaligned AI: a fuller understanding of feature-level circuits could help us understand whether this deception feature actually is responsible for the entirety of deception in a model, or help us understand the extent to which alignment methods remove the harmful behavior. Some of the earliest work on SAEs aimed to use them to find such feature-level circuits (e.g. Cunn...

The Nonlinear Library
AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Apr 25, 2024 1:12


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Improving Dictionary Learning with Gated Sparse Autoencoders, published by Neel Nanda on April 25, 2024 on The AI Alignment Forum. Authors: Senthooran Rajamanoharan*, Arthur Conmy*, Lewis Smith, Tom Lieberum, Vikrant Varma, János Kramár, Rohin Shah, Neel Nanda A new paper from the Google DeepMind mech interp team: Improving Dictionary Learning with Gated Sparse Autoencoders! Gated SAEs are a new Sparse Autoencoder architecture that seems to be a significant Pareto-improvement over normal SAEs, verified on models up to Gemma 7B. They are now our team's preferred way to train sparse autoencoders, and we'd love to see them adopted by the community! (Or to be convinced that it would be a bad idea for them to be adopted by the community!) They achieve similar reconstruction with about half as many firing features, and while being either comparably or more interpretable (confidence interval for the increase is 0%-13%). See Sen's Twitter summary, my Twitter summary, and the paper! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - ProLU: A Pareto Improvement for Sparse Autoencoders by Glen M. Taggart

The Nonlinear Library

Play Episode Listen Later Apr 23, 2024 8:59


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: ProLU: A Pareto Improvement for Sparse Autoencoders, published by Glen M. Taggart on April 23, 2024 on The AI Alignment Forum. Abstract This paper presents ProLU, an alternative to ReLU for the activation function in sparse autoencoders that produces a pareto improvement over the standard sparse autoencoder architectures and sparse autoencoders trained with Sqrt(L1) penalty. Introduction SAE Context and Terminology Learnable parameters of a sparse autoencoder: Wenc : encoder weights Wdec : decoder weights benc : encoder bias bdec : decoder bias Training Notation: Encoder/Decoder Let encode(x)=ReLU((xbdec)Wenc+benc)decode(a)=aWdec+bdec so that the full computation done by an SAE can be expressed as SAE(x)=decode(encode(x)) An SAE is trained with gradient descent on where λ is the sparsity penalty coefficient (often "L1 coefficient") and P is the sparsity penalty function, used to encourage sparsity. P is commonly the L1 norm ||a||1 but recently l12 has been shown to produce a Pareto improvement on the L0 and CE metrics. Sqrt(L1) SAEs There has been other work producing pareto improvements to SAEs by taking P(a)=||a||1/21/2 as the penalty function. We will use this as a further baseline to compare against when assessing our models. Motivation: Inconsistent Scaling in Sparse Autoencoders Due to the affine translation, sparse autoencoder features with nonzero encoder biases only perfectly reconstruct feature magnitudes at a single point. This poses difficulties if activation magnitudes for a fixed feature tend to vary over a wide range. This potential problem motivates the concept of scale consistency: A scale consistent response curve The bias maintains its role in noise suppression, but no longer translates activation magnitudes when the feature is active. The lack of gradients for the encoder bias term poses a challenge for learning with gradient descent. This paper will formalize an activation function which gives SAEs this scale-consistent response curve, and motivate and propose two plausible synthetic gradients, and compare scale-consistent models trained with the two synthetic gradients to standard SAEs and SAEs trained with Sqrt(L1) penalty. Scale Consistency Desiderata Notation: Centered Submodule The use of the decoder bias can be viewed as performing centering on the inputs to a centered SAE then reversing the centering on the outputs: SAE(x)=SAEcent(xbdec)+bdec SAEcent(x)=ReLU(xWenc+benc)Wdec Notation: Specified Feature Let Wi denote the weights and bienc the encoder bias for the i-th feature. Then, let SAEi(x)=SAEicent(xbdec)+bdec where SAEicent(x)=ReLU(xWienc+bienc)Widec Conditional Linearity Noise Suppresion Threshold Methods Proportional ReLU (ProLU) We define the Proportional ReLU (ProLU) as: Backprop with ProLU: To use ProLU in SGD-optimized models, we first address the lack of gradients wrt. the b term. ReLU gradients: For comparison and later use, we will first consider ReLU: partial derivatives are well defined for ReLU at all points other than xi=0: Gradients of ProLU: Partials of ProLU wrt. m are similarly well defined: However, they are not well defined wrt. b, so we must synthesize these. Notation: Synthetic Gradients Let fx denote the synthetic partial derivative of f wrt. x, and f the synthetic gradient of f, used for backpropagation as a stand-in for the gradient. Different synthetic gradient types We train two classes of ProLU with different synthetic gradients. These are distinguished by their subscript: ProLUReLU ProLUSTE They are identical in output, but have different synthetic gradients. I.e. ReLU-Like Gradients: ProLUReLU The first synthetic gradient is very similar to the gradient for ReLU. We retain the gradient wrt. m, and define the synthetic gradient wrt. b as follows: Thresh STE Derived Gradients: ProLUSTE The second class of Pro...

The Nonlinear Library
LW - [Full Post] Progress Update #1 from the GDM Mech Interp Team by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Apr 19, 2024 79:14


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: [Full Post] Progress Update #1 from the GDM Mech Interp Team, published by Neel Nanda on April 19, 2024 on LessWrong. This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order. Activation Steering with SAEs Arthur Conmy, Neel Nanda TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce a better steering vector by removing SAE features which are irrelevant ( Section 4). This is one of the first examples of SAEs having any success for enabling better control of language models, and we are excited to continue exploring this in future work. 1. Background and Motivation We are uncertain about how useful mechanistic interpretability research, including SAE research, will be for AI safety and alignment. Unlike RLHF and dangerous capability evaluation (for example), mechanistic interpretability is not currently very useful for downstream applications on models. Though there are ambitious goals for mechanistic interpretability research such as finding safety-relevant features in language models using SAEs, these are likely not tractable on the relatively small base models we study in all our snippets. To address these two concerns, we decided to study activation steering[1] (introduced in this blog post and expanded on in a paper). We recommend skimming the blog post for an explanation of the technique and examples of what it can do. Briefly, activation steering takes vector(s) from the residual stream on some prompt(s), and then adds these to the residual stream on a second prompt. This makes outputs from the second forward pass have properties inherited from the first forward pass. There is early evidence that this technique could help with safety-relevant properties of LLMs, such as sycophancy. We have tentative early research results that suggest SAEs are helpful for improving and interpreting steering vectors, albeit with limitations. We find these results particularly exciting as they provide evidence that SAEs can identify causally meaningful intermediate variables in the model, indicating that they aren't just finding clusters in the data or directions in logit space, which seemed much more likely before we did this research. We plan to continue this research to further validate SAEs and to gain more intuition about what features SAEs do and don't learn in practice. 2. Setup We use SAEs trained on the residual stream of GPT-2 XL at various layers, the model used in the initial activation steering blog post, inspired by the success of residual stream SAEs on GPT-2 Small ( Bloom, 2024) and Pythia models ( Cunningham et. al, 2023). The SAEs have 131072 learned features, L0 of around 60[2], and loss recovered around 97.5% (e.g. splicing in the SAE from Section 3 increases loss from 2.88 to 3.06, compared to the destructive zero ablation intervention resulting in Loss > 10). We don't think this was a particularly high-quality SAE, as the majority of its learned features were dead, and we found limitations with training residual stream SAEs that we will discuss in an upcoming paper. Even despite this, we think the results in this work are tentative evidence for SAEs being useful. It is likely easiest to simpl...

The Nonlinear Library
AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Apr 19, 2024 79:14


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Full Update, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum. This is a series of snippets about the Google DeepMind mechanistic interpretability team's research into Sparse Autoencoders, that didn't meet our bar for a full paper. Please start at the summary post for more context, and a summary of each snippet. They can be read in any order. Activation Steering with SAEs Arthur Conmy, Neel Nanda TL;DR: We use SAEs trained on GPT-2 XL's residual stream to decompose steering vectors into interpretable features. We find a single SAE feature for anger which is a Pareto-improvement over the anger steering vector from existing work (Section 3, 3 minute read). We have more mixed results with wedding steering vectors: we can partially interpret the vectors, but the SAE reconstruction is a slightly worse steering vector, and just taking the obvious features produces a notably worse vector. We can produce a better steering vector by removing SAE features which are irrelevant ( Section 4). This is one of the first examples of SAEs having any success for enabling better control of language models, and we are excited to continue exploring this in future work. 1. Background and Motivation We are uncertain about how useful mechanistic interpretability research, including SAE research, will be for AI safety and alignment. Unlike RLHF and dangerous capability evaluation (for example), mechanistic interpretability is not currently very useful for downstream applications on models. Though there are ambitious goals for mechanistic interpretability research such as finding safety-relevant features in language models using SAEs, these are likely not tractable on the relatively small base models we study in all our snippets. To address these two concerns, we decided to study activation steering[1] (introduced in this blog post and expanded on in a paper). We recommend skimming the blog post for an explanation of the technique and examples of what it can do. Briefly, activation steering takes vector(s) from the residual stream on some prompt(s), and then adds these to the residual stream on a second prompt. This makes outputs from the second forward pass have properties inherited from the first forward pass. There is early evidence that this technique could help with safety-relevant properties of LLMs, such as sycophancy. We have tentative early research results that suggest SAEs are helpful for improving and interpreting steering vectors, albeit with limitations. We find these results particularly exciting as they provide evidence that SAEs can identify causally meaningful intermediate variables in the model, indicating that they aren't just finding clusters in the data or directions in logit space, which seemed much more likely before we did this research. We plan to continue this research to further validate SAEs and to gain more intuition about what features SAEs do and don't learn in practice. 2. Setup We use SAEs trained on the residual stream of GPT-2 XL at various layers, the model used in the initial activation steering blog post, inspired by the success of residual stream SAEs on GPT-2 Small ( Bloom, 2024) and Pythia models ( Cunningham et. al, 2023). The SAEs have 131072 learned features, L0 of around 60[2], and loss recovered around 97.5% (e.g. splicing in the SAE from Section 3 increases loss from 2.88 to 3.06, compared to the destructive zero ablation intervention resulting in Loss > 10). We don't think this was a particularly high-quality SAE, as the majority of its learned features were dead, and we found limitations with training residual stream SAEs that we will discuss in an upcoming paper. Even despite this, we think the results in this work are tentative evidence for SAEs being useful. It is likely ea...

The Nonlinear Library
AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Apr 19, 2024 5:41


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Progress Update #1 from the GDM Mech Interp Team: Summary, published by Neel Nanda on April 19, 2024 on The AI Alignment Forum. Introduction This is a progress update from the Google DeepMind mechanistic interpretability team, inspired by the Anthropic team's excellent monthly updates! Our goal was to write-up a series of snippets, covering a range of things that we thought would be interesting to the broader community, but didn't yet meet our bar for a paper. This is a mix of promising initial steps on larger investigations, write-ups of small investigations, replications, and negative results. Our team's two main current goals are to scale sparse autoencoders to larger models, and to do further basic science on SAEs. We expect these snippets to mostly be of interest to other mech interp practitioners, especially those working with SAEs. One exception is our infrastructure snippet, which we think could be useful to mechanistic interpretability researchers more broadly. We present preliminary results in a range of areas to do with SAEs, from improving and interpreting steering vectors, to improving ghost grads, to replacing SAE encoders with an inference-time sparse approximation algorithm. Where possible, we've tried to clearly state our level of confidence in our results, and the evidence that led us to these conclusions so you can evaluate for yourself. We expect to be wrong about at least some of the things in here! Please take this in the spirit of an interesting idea shared by a colleague at a lab meeting, rather than as polished pieces of research we're willing to stake our reputation on. We hope to turn some of the more promising snippets into more fleshed out and rigorous papers at a later date. We also have a forthcoming paper on an updated SAE architecture that seems to be a moderate Pareto-improvement, stay tuned! How to read this post: This is a short summary post, accompanying the much longer post with all the snippets. We recommend reading the summaries of each snippet below, and then zooming in to whichever snippets seem most interesting to you. They can be read in any order. Summaries Activation Steering with SAEs We analyse the steering vectors used in Turner et. al, 2023 using SAEs. We find that they are highly interpretable, and that in some cases we can get better performance by constructing interpretable steering vectors from SAE features, though in other cases we struggle to. We hope to better disentangle what's going on in future works. Replacing SAE Encoders with Inference-Time Optimisation There are two sub-problems in dictionary learning, learning the dictionary of feature vectors (an SAE's decoder, $W_{dec}$ and computing the sparse coefficient vector on a given input (an SAE's encoder). The SAE's encoder is a linear map followed by a ReLU, which is a weak function with a range of issues. We explore disentangling these problems by taking a trained SAE, throwing away the encoder, keeping the decoder, and learning the sparse coefficients at inference-time. This lets us study the question of how well the SAE encoder is working while holding the quality of the dictionary constant, and better evaluate the quality of different dictionaries. One notable finding is that high L0 SAEs have higher quality dictionaries than low L0 SAEs, even if we learn coefficients with low L0 at inference time. Improving Ghost Grads In their January update, the Anthropic team introduced a new auxiliary loss, "ghost grads", as a potential improvement on resampling for minimising the number of dead features in a SAE. We replicate their work, and find that it under-performs resampling. We present an improvement, multiplying the ghost grads loss by the proportion of dead features, which makes ghost grads competitive. We don't yet see a compelling reason to move away fro...

The Nonlinear Library
AF - A Selection of Randomly Selected SAE Features by CallumMcDougall

The Nonlinear Library

Play Episode Listen Later Apr 1, 2024 6:55


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Selection of Randomly Selected SAE Features, published by CallumMcDougall on April 1, 2024 on The AI Alignment Forum. Epistemic status - self-evident. In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment. Motivation Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution? While a full answer to this question is important and will take deliberate investigation, we note that researchers who've spent large amounts of time interacting with feature dashboards think it's more likely that SAE features capture highly non-trivial information about the underlying models. Evidently, SAEs are the one true answer to ontology identification and as evidence of this, we show how initially uninterpretable features are often quite interpretable with further investigation / tweaking of dashboards. In each case, we describe how we make the best possible use of feature dashboards to ensure we aren't fooling ourselves or reading tea-leaves. Note - to better understand these results, we highly recommend readers who are unfamiliar with SAE Feature Dashboards briefly refer to the relevant section of Anthropic's publication (whose dashboard structure we emulate below). TLDR - to understand what concepts are encoded by features, we look for patterns in the text which causes them to activate most strongly. Case Studies in SAE Features Scripture Feature We open with a feature that seems to activate strongly on examples of sacred text, specifically from the works of Christianity. Even though interpreting SAEs seems bad, and it can really make you mad, seeing features like this reminds us to always look on the bright side of life. Perseverance Feature We register lower confidence in this feature than others, but the top activating examples all seem to present a consistent theme of perseverance and loyalty in the face of immense struggle (this was confirmed with GPT4[1]). We're very excited at how semantic this feature is rather than merely syntactic, since a huge barrier to future progress in dictionary learning is whether we can find features associated with high-level semantic concepts like these. Teamwork Feature We were very surprised with this one, given that the training data for our models was all dated at 2022 or earlier. We welcome any and all theories here. Deciphering Feature Activations with Quantization can be highly informative Most analyses of SAE features have not directly attempted to understand the significance of feature activation strength, but we've found this can be highly informative. Take this feature for example. Due to the apparently highly quantized pattern of activation, we decided to attempt decoding the sequence of max-activating sequences using the Morse code-based mapping {0.0: '/', 0.2: ' ', 1.0: '.', 2.0: '-'}. When we tried this, we found the following pattern: Which translated into Morse code reads as: We weren't sure exactly what to make of this, but more investigation is definitely advisable. Lesson - visualize activation on full prompts to better understand features! One feature which at first appeared uninterpretable is pictured below. Clearly this feature fires in DNA strings, but what is it actually tracking? Showing a larger context after the max activating tokens, we begin to see what might be an interpretable pattern in the max activating examples. We did this one more time, and revealed that this in-fact a feature which fires on DNA sequences from the species Rattus Norvegicus (japanese variants in particular). We leave it...

The Nonlinear Library
LW - A Selection of Randomly Selected SAE Features by CallumMcDougall

The Nonlinear Library

Play Episode Listen Later Apr 1, 2024 6:55


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A Selection of Randomly Selected SAE Features, published by CallumMcDougall on April 1, 2024 on LessWrong. Epistemic status - self-evident. In this post, we interpret a small sample of Sparse Autoencoder features which reveal meaningful computational structure in the model that is clearly highly researcher-independent and of significant relevance to AI alignment. Motivation Recent excitement about Sparse Autoencoders (SAEs) has been mired by the following question: Do SAE features reflect properties of the model, or just capture correlational structure in the underlying data distribution? While a full answer to this question is important and will take deliberate investigation, we note that researchers who've spent large amounts of time interacting with feature dashboards think it's more likely that SAE features capture highly non-trivial information about the underlying models. Evidently, SAEs are the one true answer to ontology identification and as evidence of this, we show how initially uninterpretable features are often quite interpretable with further investigation / tweaking of dashboards. In each case, we describe how we make the best possible use of feature dashboards to ensure we aren't fooling ourselves or reading tea-leaves. Note - to better understand these results, we highly recommend readers who are unfamiliar with SAE Feature Dashboards briefly refer to the relevant section of Anthropic's publication (whose dashboard structure we emulate below). TLDR - to understand what concepts are encoded by features, we look for patterns in the text which causes them to activate most strongly. Case Studies in SAE Features Scripture Feature We open with a feature that seems to activate strongly on examples of sacred text, specifically from the works of Christianity. Even though interpreting SAEs seems bad, and it can really make you mad, seeing features like this reminds us to always look on the bright side of life. Perseverance Feature We register lower confidence in this feature than others, but the top activating examples all seem to present a consistent theme of perseverance and loyalty in the face of immense struggle (this was confirmed with GPT4[1]). We're very excited at how semantic this feature is rather than merely syntactic, since a huge barrier to future progress in dictionary learning is whether we can find features associated with high-level semantic concepts like these. Teamwork Feature We were very surprised with this one, given that the training data for our models was all dated at 2022 or earlier. We welcome any and all theories here. Deciphering Feature Activations with Quantization can be highly informative Most analyses of SAE features have not directly attempted to understand the significance of feature activation strength, but we've found this can be highly informative. Take this feature for example. Due to the apparently highly quantized pattern of activation, we decided to attempt decoding the sequence of max-activating sequences using the Morse code-based mapping {0.0: '/', 0.2: ' ', 1.0: '.', 2.0: '-'}. When we tried this, we found the following pattern: Which translated into Morse code reads as: We weren't sure exactly what to make of this, but more investigation is definitely advisable. Lesson - visualize activation on full prompts to better understand features! One feature which at first appeared uninterpretable is pictured below. Clearly this feature fires in DNA strings, but what is it actually tracking? Showing a larger context after the max activating tokens, we begin to see what might be an interpretable pattern in the max activating examples. We did this one more time, and revealed that this in-fact a feature which fires on DNA sequences from the species Rattus Norvegicus (japanese variants in particular). We leave it as an exerci...

The Nonlinear Library
AF - SAE reconstruction errors are (empirically) pathological by Wes Gurnee

The Nonlinear Library

Play Episode Listen Later Mar 29, 2024 15:37


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAE reconstruction errors are (empirically) pathological, published by Wes Gurnee on March 29, 2024 on The AI Alignment Forum. Summary Sparse Autoencoder (SAE) errors are empirically pathological: when a reconstructed activation vector is distance ϵ from the original activation vector, substituting a randomly chosen point at the same distance changes the next token prediction probabilities significantly less than substituting the SAE reconstruction[1] (measured by both KL and loss). This is true for all layers of the model (~2x to ~4.5x increase in KL and loss over baseline) and is not caused by feature suppression/shrinkage. Assuming others replicate, these results suggest the proxy reconstruction objective is behaving pathologically. I am not sure why these errors occur but expect understanding this gap will give us deeper insight into SAEs while also providing an additional metric to guide methodological progress. Introduction As the interpretability community allocates more resources and increases reliance on SAEs, it is important to understand the limitation and potential flaws of this method. SAEs are designed to find a sparse overcomplete feature basis for a model's latent space. This is done by minimizing the joint reconstruction error of the input data and the L1 norm of the intermediate activations (to promote sparsity): However, the true goal is to find a faithful feature decomposition that accurately captures the true causal variables in the model, and reconstruction error and sparsity are only easy-to-optimize proxy objectives. This begs the questions: how good of a proxy objective is this? Do the reconstructed representations faithfully preserve other model behavior? How much are we proxy gaming? Naively, this training objective defines faithfulness as L2. But, another natural property of a "faithful" reconstruction is that substituting the original activation with the reconstruction should approximately preserve the next-token prediction probabilities. More formally, for a set of tokens T and a model M, let P=M(T) be the model's true next token probabilities. Then let QSAE=M(T|do(xSAE(x))) be the next token probabilities after intervening on the model by replacing a particular activation x (e.g. a residual stream state or a layer of MLP activations) with the SAE reconstruction of x. The more faithful the reconstruction, the lower the KL divergence between P and Q (denoted as DKL(P||QSAE)) should be. In this post, I study how DKL(P||QSAE) compares to several natural baselines based on random perturbations of the activation vectors x which preserve some error property of the SAE construction (e.g., having the same l2 reconstruction error or cosine similarity). I find that the KL divergence is significantly higher (2.2x - 4.5x) for the residual stream SAE reconstruction compared to the random perturbations and moderately higher (0.9x-1.7x) for attention out SAEs. This suggests that the SAE reconstruction is not faithful by our definition, as it does not preserve the next token prediction probabilities. This observation is important because it suggests that SAEs make systematic, rather than random, errors and that continuing to drive down reconstruction error may not actually increase SAE faithfulness. This potentially indicates that current SAEs are missing out on important parts of the learned representations of the model. The good news is that this KL-gap presents a clear target for methodological improvement and a new metric for evaluating SAEs. I intend to explore this in future work. Intuition: how big a deal is this (KL) difference? For some intuition, here are several real examples of the top-25 output token probabilities at the end of a prompt when patching in SAE and ϵ-random reconstructions compared to the original model's next-token distributio...

The Nonlinear Library
LW - SAE reconstruction errors are (empirically) pathological by wesg

The Nonlinear Library

Play Episode Listen Later Mar 29, 2024 15:36


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: SAE reconstruction errors are (empirically) pathological, published by wesg on March 29, 2024 on LessWrong. Summary Sparse Autoencoder (SAE) errors are empirically pathological: when a reconstructed activation vector is distance ϵ from the original activation vector, substituting a randomly chosen point at the same distance changes the next token prediction probabilities significantly less than substituting the SAE reconstruction[1] (measured by both KL and loss). This is true for all layers of the model (~2x to ~4.5x increase in KL and loss over baseline) and is not caused by feature suppression/shrinkage. Assuming others replicate, these results suggest the proxy reconstruction objective is behaving pathologically. I am not sure why these errors occur but expect understanding this gap will give us deeper insight into SAEs while also providing an additional metric to guide methodological progress. Introduction As the interpretability community allocates more resources and increases reliance on SAEs, it is important to understand the limitation and potential flaws of this method. SAEs are designed to find a sparse overcomplete feature basis for a model's latent space. This is done by minimizing the joint reconstruction error of the input data and the L1 norm of the intermediate activations (to promote sparsity): However, the true goal is to find a faithful feature decomposition that accurately captures the true causal variables in the model, and reconstruction error and sparsity are only easy-to-optimize proxy objectives. This begs the questions: how good of a proxy objective is this? Do the reconstructed representations faithfully preserve other model behavior? How much are we proxy gaming? Naively, this training objective defines faithfulness as L2. But, another natural property of a "faithful" reconstruction is that substituting the original activation with the reconstruction should approximately preserve the next-token prediction probabilities. More formally, for a set of tokens T and a model M, let P=M(T) be the model's true next token probabilities. Then let QSAE=M(T|do(xSAE(x))) be the next token probabilities after intervening on the model by replacing a particular activation x (e.g. a residual stream state or a layer of MLP activations) with the SAE reconstruction of x. The more faithful the reconstruction, the lower the KL divergence between P and Q (denoted as DKL(P||QSAE)) should be. In this post, I study how DKL(P||QSAE) compares to several natural baselines based on random perturbations of the activation vectors x which preserve some error property of the SAE construction (e.g., having the same l2 reconstruction error or cosine similarity). I find that the KL divergence is significantly higher (2.2x - 4.5x) for the residual stream SAE reconstruction compared to the random perturbations and moderately higher (0.9x-1.7x) for attention out SAEs. This suggests that the SAE reconstruction is not faithful by our definition, as it does not preserve the next token prediction probabilities. This observation is important because it suggests that SAEs make systematic, rather than random, errors and that continuing to drive down reconstruction error may not actually increase SAE faithfulness. This potentially indicates that current SAEs are missing out on important parts of the learned representations of the model. The good news is that this KL-gap presents a clear target for methodological improvement and a new metric for evaluating SAEs. I intend to explore this in future work. Intuition: how big a deal is this (KL) difference? For some intuition, here are several real examples of the top-25 output token probabilities at the end of a prompt when patching in SAE and ϵ-random reconstructions compared to the original model's next-token distribution (note the use of ...

The Nonlinear Library
AF - Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders by Johnny Lin

The Nonlinear Library

Play Episode Listen Later Mar 25, 2024 12:49


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Announcing Neuronpedia: Platform for accelerating research into Sparse Autoencoders, published by Johnny Lin on March 25, 2024 on The AI Alignment Forum. This posts assumes basic familiarity with Sparse Autoencoders. For those unfamiliar with this technique, we highly recommend the introductory sections of these papers. TL;DR Neuronpedia is a platform for mechanistic interpretability research. It was previously focused on crowdsourcing explanations of neurons, but we've pivoted to accelerating researchers for Sparse Autoencoders (SAEs) by hosting models, feature dashboards, data visualizations, tooling, and more. Important Links Explore: The SAE research focused Neuronpedia. Current SAEs for GPT2-Small: RES-JB: Residuals - Joseph Bloom (294k feats) ATT-KK: Attention Out - Connor Kissane + Robert Kryzanowski (344k feats) Upload: Get your SAEs hosted by Neuronpedia: fill out this

The Nonlinear Library
AF - Understanding SAE Features with the Logit Lens by Joseph Isaac Bloom

The Nonlinear Library

Play Episode Listen Later Mar 11, 2024 26:35


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Understanding SAE Features with the Logit Lens, published by Joseph Isaac Bloom on March 11, 2024 on The AI Alignment Forum. This work was produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort, with support from Neel Nanda and Arthur Conmy. Joseph Bloom is funded by the LTFF, Manifund Regranting Program, donors and LightSpeed Grants. This post makes extensive use of Neuronpedia, a platform for interpretability focusing on accelerating interpretability researchers working with SAEs. Links: SAEs on HuggingFace, Analysis Code Executive Summary This is an informal post sharing statistical methods which can be used to quickly / cheaply better understand Sparse Autoencoder (SAE) features. Firstly, we use statistics (standard deviation, skewness and kurtosis) of the logit weight distributions of features (WuWdec[feature]) to characterize classes of features, showing that many features can be understood as promoting / suppressing interpretable classes of tokens. We propose 3 different kinds of features, analogous to previously characterized " universal neurons": Partition Features, which (somewhat) promote half the tokens and suppress the other half according to capitalization and spaces (example pictured below) Suppression Features, which act like partition features but are more asymmetric. Prediction Features which promote tokens in classes of varying sizes, ranging from promoting tokens that have a close bracket to promoting all present tense verbs. Secondly, we propose a statistical test for whether a feature's output direction is trying to distinguish tokens in some set (eg: "all caps tokens") from the rest. We borrowed this technique from systems biology where it is used at scale frequently. The key limitation here is that we need to know in advance which sets of tokens are promoted / inhibited. Lastly, we demonstrate the utility of the set-based technique by using it to locate features which enrich token categories of interest (defined by regex formulas, NLTK toolkit parts of speech tagger and common baby names for boys/girls). Feature 4467. Above: Feature Dashboard Screenshot from Neuronpedia. It is not immediately obvious from the dashboard what this feature does. Below: Logit Weight distribution classified by whether the token starts with a space, clearly indicating that this feature promotes tokens which lack an initial space character. Introduction In previous work, we trained and open-sourced a set of sparse autoencoders (SAEs) on the residual stream of GPT2 small. In collaboration with Neuronpedia, we've produced feature dashboards, auto-interpretability explanations and interfaces for browsing for ~300k+ features. The analysis in this post is performed on features from the layer 8 residual stream of GPT2 small (for no particular reason). SAEs might enable us to decompose model internals into interpretable components. Currently, we don't have a good way to measure interpretability at scale, but we can generate feature dashboards which show things like how often the feature fires, its direct effect on tokens being sampled (the logit weight distribution) and when it fires (see examples of feature dashboards below). Interpreting the logit weight distribution in feature dashboards for multi-layer models is implicitly using Logit Lens, a very popular technique in mechanistic interpretability. Applying the logit lens to features means that we compute the product of a feature direction and the unembed (WuWdec[feature]), referred to as the "logit weight distribution". Since SAEs haven't been around for very long, we don't yet know what the logit weight distributions typically look like for SAE features. Moreover, we find that the form of logit weight distribution can vary greatly. In most cases we see a vaguely normal distribution and s...

The Nonlinear Library
AF - We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To by robertzk

The Nonlinear Library

Play Episode Listen Later Mar 6, 2024 24:45


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We Inspected Every Head In GPT-2 Small using SAEs So You Don't Have To, published by robertzk on March 6, 2024 on The AI Alignment Forum. This is an interim report that we are currently building on. We hope this update will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Executive Summary In a previous post we trained attention SAEs on every layer of GPT-2 Small and we found that a majority of features are interpretable in all layers. We've since leveraged our SAEs as a tool to explore individual attention heads through the lens of SAE features. Using our SAEs, we inspect the roles of every attention head in GPT-2 small, discovering a wide range of previously unidentified behaviors. We manually examined every one of the 144 attention heads and provide brief descriptions in this spreadsheet. We note that this is a rough heuristic to get a sense of the most salient effects of a head and likely does not capture their role completely. We observe that features become more abstract up to layer 9 and then less so after that. We performed this by interpreting and conceptually grouping the top 10 features attributed to all 144 heads. Working from bottom to top layers, 39 of the 144 heads expressed surprising feature groupings not seen before in a previous head. We provide feature dashboards for each attention head. To validate that our technique captures legitimate phenomena rather than spurious behaviors, we verified that our interpretations are consistent with previously studied heads in GPT-2 small. These include induction heads, previous token heads, successor heads and duplicate token heads. We note that our annotator mostly did not know a priori which heads had previously been studied. To demonstrate that our SAEs can enable novel interpretability insights, we leverage our SAEs to develop a deeper understanding of why there are two induction heads in Layer 5. We show that one does standard induction and the other does "long prefix" induction. We use our technique to investigate the prevalence of attention head polysemanticity. We think that the vast majority of heads (>90%) are performing multiple tasks, but also narrow down a set of 14 candidate heads that are plausibly monosemantic. Introduction In previous work, we trained and open sourced a set of attention SAEs on all 12 layers of GPT-2 Small. We found that random SAE features in each layer were highly interpretable, and highlighted a set of interesting features families. We've since leveraged our SAEs as a tool to interpret the roles of attention heads. The key idea of the technique relies on our SAEs being trained to reconstruct the entire layer, but that contributions to specific heads can be inferred. This allows us to find the top 10 features most salient to a given head, and note whenever there is a pattern that it may suggest a role of that head. We then used this to manually inspect the role of every head in GPT-2 small, and spend the rest of this post exploring various implications of our findings and the technique. In the spirit of An Overview of Early Vision in InceptionV1, we start with a high-level, guided tour of the different behaviors implemented by heads across every layer, building better intuitions for what attention heads learn in a real language model. To validate that the technique is teaching something real about the roles of these heads, we confirm that our interpretations match previously studied heads. We note that our annotator mostly did not know a priori which heads had previously been studied. We find: Induction heads ( 5.1, 5.5, 6.9, 7.2, 7.10) Previous token heads ( 4.11) Copy suppression head ( 10.7) Duplicate token heads ( 3.0) Successor head ( 9.1) In addition to building intuition about wh...

The Nonlinear Library
LW - Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders by Evan Anders

The Nonlinear Library

Play Episode Listen Later Feb 27, 2024 30:14


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Examining Language Model Performance with Reconstructed Activations using Sparse Autoencoders, published by Evan Anders on February 27, 2024 on LessWrong. Note: The second figure in this post originally contained a bug pointed out by @LawrenceC, which has since been fixed. Summary Sparse Autoencoders (SAEs) reveal interpretable features in the activation spaces of language models, but SAEs don't reconstruct activations perfectly. We lack good metrics for evaluating which parts of model activations SAEs fail to reconstruct, which makes it hard to evaluate SAEs themselves. In this post, we argue that SAE reconstructions should be tested using well-established benchmarks to help determine what kinds of tasks they degrade model performance on. We stress-test a recently released set of SAEs for each layer of the gpt2-small residual stream using randomly sampled tokens from Open WebText and the Lambada benchmark where the model must predict a specific next token. The SAEs perform well on prompts with context sizes up to the training context size, but their performance degrades on longer prompts. In contexts shorter than or equal to the training context, the SAEs that we study generally perform well. We find that the performance of our late-layer SAEs is worse than early-layer SAEs, but since the SAEs all have the same width, this may just be because there are more features to resolve in later layers and our SAEs don't resolve them. In contexts longer than the training context, SAE performance is poor in general, but it is poorest in earlier layers and best in later layers. Introduction Last year, Anthropic and EleutherAI/Lee Sharkey's MATS stream showed that sparse autoencoders (SAEs) can decompose language model activations into human-interpretable features. This has led to a significant uptick in the number of people training SAEs and analyzing models with them. However, SAEs are not perfect autoencoders and we still lack a thorough understanding of where and how they miss information. But how do we know if an SAE is "good" other than the fact that it has features we can understand? SAEs try to reconstruct activations in language models - but they don't do this perfectly. Imperfect activation reconstruction can lead to substantial downstream cross-entropy (CE) loss increases. Generally "good" SAEs retrieve 80-99% of the CE loss (compared to a generous baseline of zero ablation), but only retrieving 80% of the CE loss is enough to substantially degrade the performance of a model to that of a much smaller model (per scaling laws). The second basic metric often used in SAE evaluation is the average per-token ℓ0 norm of the hidden layer of the autoencoder. Generally this is something in the range of ~10-60 in a "good" autoencoder, which means that the encoder is sparse. Since we don't know how many features are active per token in natural language, it's useful to at least ask how changes in ℓ0 relate to changes in SAE loss values. If high-loss data have drastically different ℓ0 from the SAE's average performance during training, that can be evidence of either off-distribution data (compared to the training data) or some kind of data with more complex information. The imperfect performance of SAEs on these metrics could be explained in a couple of ways: The fundamental assumptions of SAEs are mostly right, but we're bad at training SAEs. Perhaps if we learn to train better SAEs, these problems will become less bad. Perhaps we need to accept higher ℓ0 norms (more features active per token). This would not be ideal for interpretability, though. Perhaps there's part of the signal which is dense or hard for an SAE to learn and so we are systematically missing some kind of information. Maybe a more sophisticated sparsity enforcement could help with this. The fundamental assumption...

The Nonlinear Library
LW - Do sparse autoencoders find "true features"? by Demian Till

The Nonlinear Library

Play Episode Listen Later Feb 22, 2024 17:08


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Do sparse autoencoders find "true features"?, published by Demian Till on February 22, 2024 on LessWrong. In this post I'll discuss an apparent limitation of sparse autoencoders (SAEs) in their current formulation as they are applied to discovering the latent features within AI models such as transformer-based LLMs. In brief, I'll cover the following: I'll argue that the L1 regularisation used to promote sparsity when training SAEs may cause neurons in the sparse layer to learn to represent common combinations of features rather than the individual features that we want them to discover As well as making it more difficult to understand what the actual latent features are, I'll also argue that this limitation may result in some less common latent features not being discovered at all, not even within combinations I'll then explain why I think that the phenomenon of feature splitting observed in Anthropic's SAE paper appears to demonstrate that this limitation does indeed have a large impact on the features discovered by SAEs Finally I'll propose an approach for overcoming this limitation and discuss how we can test whether it really brings us closer to finding the real latent features Rough definition of "true features" We intend for SAEs to discover the "true features" (a term I'm borrowing from Anthropic's SAE paper) used by the target model e.g. a transformer-based LLM. There isn't a universally accepted definition of what "true features" are, but for now I'll use the term somewhat loosely to refer to something like: linear directions in an activation space at a hidden layer within a target model which encode some reasonably monosemantic quantity such as the model's "confidence" in some concept being in play they should play a causal role in the functioning of the target model. So for example if we were to activate or deactivate the feature while the target model is processing a given input sequence then we should expect the outputs to change accordingly in some reasonably understandable way they should be in their most atomic form, so that e.g an arbitrary linear combination of two "true feature" directions is not necessarily itself a "true feature" direction even though it may satisfy the previous criteria There may be other ways of thinking about features but this should give us enough to work with for our current purposes. Why SAEs are incentivised to discover combinations of features rather than individual features Consider a toy setup where one of the hidden layers in the target model has 3 "true features" represented by the following directions in its activation space: Additionally, suppose that feature 1 and feature 2 occur far more frequently than feature 3, and that all features can potentially co-occur in a given activation vector. For the sake of simplicity let's also suppose for now that when features 1 & 2 occur together they tend to both activate with some roughly fixed proportions. For example, an activation vector in which both features 1 and 2 are present (but not feature 3) might look like the following: Now suppose we train an SAE with 3 neurons in the sparse layer on activation vectors from this hidden layer such as the one above. The desirable outcome is that each of the 3 neurons in the sparse layer learns one of the 3 "true features". If this happens then the directions learnt by SAE would mirror the directions of the "true features" in the target model, looking something like: However depending on the respective frequencies of feature 3 vs features 1 & 2, as well as the value of the L1 regularisation weight, I will argue shortly that what may happen is that two of the neurons learn to detect when each of features 1 & 2 respectively occur by themselves, while the third neuron learns to detect when they both occur together. In this case the di...

The Nonlinear Library
LW - Fixing Feature Suppression in SAEs by Benjamin Wright

The Nonlinear Library

Play Episode Listen Later Feb 16, 2024 16:31


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Fixing Feature Suppression in SAEs, published by Benjamin Wright on February 16, 2024 on LessWrong. Produced as part of the ML Alignment Theory Scholars Program - Winter 2023-24 Cohort as part of Lee Sharkey's stream. Sparse autoencoders are a method of resolving superposition by recovering linearly encoded "features" inside activations. Unfortunately, despite the great recent success of SAEs at extracting human interpretable features, they fail to perfectly reconstruct the activations. For instance, Cunningham et al. (2023) note that replacing the residual stream of layer 2 of Pythia-70m with the reconstructed output of an SAE increased the perplexity of the model on the Pile from 25 to 40. It is important for interpretability that the features we extract accurately represent what the model is doing. In this post, I show how and why SAEs have a reconstruction gap due to 'feature suppression'. Then, I look at a few ways to fix this while maintaining SAEs interpretability. By modifying and fine-tuning a pre-trained SAE, we achieve a 9% decrease in mean square error and a 24% reduction in the perplexity increase upon patching activations into the LLM. Finally, I connect a theoretical example to the observed amounts of feature suppression in Pythia 70m, confirming that features are suppressed primarily based on the strength of their activations, not on their frequency of activation. Feature Suppression The architecture of an SAE is: f(xx)=ReLU(Wexx+bbe) yy=Wdf(xx)+bbd The loss function usually combines a MSE reconstruction loss with a sparsity term, like L(xx,f(xx),yy)=||yyxx||2/d+c|f(xx)|, where d is the dimension of x. When training the SAE on this loss, the decoder's weight matrix is fixed to have unit norm for each feature (column). The reason for feature suppression is simple: The training loss has two terms, only one of which is reconstruction. Therefore, reconstruction isn't perfect. In particular, the loss function pushes for smaller f(xx) values, leading to suppressed features and worse reconstruction. An illustrative example of feature suppression As an example, consider the trivial case where there is only one binary feature in one dimension. That is, xx=1 with probability p and xx=0 otherwise. Then, ideally the optimal SAE would extract feature activations of f(x){0,1} and have a decoder with Wd=1. However, if we were to train an SAE optimizing the loss function L(xx,f(xx),yy)=||yyxx||2+c|f(xx)|, we get a different result. If we ignore bias terms for simplicity of argument, and say that the encoder outputs feature activation aa if xx=1 and 0 otherwise, then the optimization problem becomes: aa=argminpL(1,aa,aa)+(1p)L(0,0,0)=argmin(aa1)2+|aa|c=argminaa2+(c2)aa+1 aa=1c2 Therefore the feature is scaled by a factor of 1c/2 compared to optimal. This is an example of feature suppression. If we allow the ground truth feature to have an activation strength g upon activation and dimension d, this factor becomes: aa=1cd2g In other words, instead of having the ground truth activation g, the SAE learns an activation of gcd2, a constant amount less. Features with activation strengths below cd2 would be completely killed off by the SAE. Feature suppression is a significant problem in current SAEs To experimentally verify that feature suppression affects SAEs, we first trained SAEs on the residual stream output of each layer of Pythia-70m with an L1 sparsity penalty (coefficient 2e-3) on 6 epochs of 100 million tokens of OpenWebText, with batch size 64 and learning rate 1e-3, resulting in roughly 13-80 feature activations per token. The residual stream of Pythia-70m had a dimension size of 512 and we used a dictionary size of 2048, for a four times scale up. If feature suppression had a noticeable effect, we'd see that the SAE reconstructions had noticeably smaller L2 norm...

The Nonlinear Library
AF - Attention SAEs Scale to GPT-2 Small by Connor Kissane

The Nonlinear Library

Play Episode Listen Later Feb 3, 2024 13:13


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Attention SAEs Scale to GPT-2 Small, published by Connor Kissane on February 3, 2024 on The AI Alignment Forum. This is an interim report that we are currently building on. We hope this update + open sourcing our SAEs will be useful to related research occurring in parallel. Produced as part of the ML Alignment & Theory Scholars Program - Winter 2023-24 Cohort Executive Summary In a previous post, we showed that sparse autoencoders (SAEs) work on the attention layer outputs of a two layer transformer. We scale our attention SAEs to GPT-2 Small, and continue to find sparse interpretable features in every layer. This makes us optimistic about our ongoing efforts scaling further, especially since we didn't have to do much iterating We open source our SAEs. Load them from Hugging Face or this colab notebook The SAEs seem good, often recovering more than 80% of the loss relative to zero ablation, and are sparse with less than 20 features firing on average. The majority of the live features are interpretable We continue to find the same three feature families that we found in the two layer model: induction features, local context features, and high level context features. This suggests that some of our lessons interpreting features in smaller models may generalize We also find new, interesting feature families that we didn't find in the two layer model, providing hints about fundamentally different capabilities in GPT-2 Small See our feature interface to browse the first 30 features for each layer Introduction In Sparse Autoencoders Work on Attention Layer Outputs we showed that we can apply SAEs to extract sparse interpretable features from the last attention layer of a two layer transformer. We have since applied the same technique to a 12-layer model, GPT-2 Small, and continue to find sparse, interpretable features in every layer. Our SAEs often recover more than 80% of the loss[1], and are sparse with less than 20 features firing on average. We perform shallow investigations of the first 30 features from each layer, and we find that the majority (often 80%+) of non-dead SAEs features are interpretable. interactive visualizations for each layer. We open source our SAEs in hope that they will be useful to other researchers currently working on dictionary learning. We are particularly excited about using these SAEs to better understand attention circuits at the feature level. See the SAEs on Hugging Face or load them using this colab notebook. Below we provide the key metrics for each SAE: L0 norm loss recovered dead features % alive features interpretable L0 3 99% 13% 97% L1 20 78% 49% 87% L2 16 90% 20% 95% L3 15 84% 8% 75% L4 15 88% 5% 100% L5 20 85% 40% 82% L6 19 82% 28% 75% L7 19 83% 58% 70% L8 20 76% 37% 64% L9 21 83% 48% 85% L10 16 85% 41% 81% L11 8 89% 84% 66% It's worth noting that we didn't do much differently to train these,[2] leaving us optimistic about the tractability of scaling attention SAEs to even bigger models. Excitingly, we also continue to identify feature families. We find features from all three of the families that we identified in the two layer model: induction features, local context features, and high level context features. This provides us hope that some of our lessons from interpreting features in smaller models will continue to generalize. We also find new, interesting feature families in GPT-2 Small, suggesting that attention SAEs can provide valuable hints about new[3] capabilities that larger models have learned. Some new features include: Successor features, which activate when predicting the next item in a sequence such as "15, 16" -> "17" (which are partly coming from Successor Heads in the model), and boost the logits of the next item. Name mover features, which predict a name in the context, such as in the IOI task Duplicate token f...

História em Meia Hora
Monarquia Brasileira

História em Meia Hora

Play Episode Listen Later Jan 24, 2024 35:40


Além do Adriano Imperador, que fez sucesso na Itália, você sabe quais foram os outros imperadores do Brasil? Por mais de 60 anos, Dom Pedro I e Dom Pedro II foram os responsáveis por gerir a nossa nação, em um período definido por importantes marcos na história do Brasil.  Separe trinta minutos do seu dia e aprenda com o professor Vítor Soares (@profvitorsoares) sobre a Monarquia Brasileira. - Se você quiser ter acesso a episódios exclusivos e quiser ajudar o História em Meia Hora a continuar de pé, clique no link: www.apoia.se/historiaemmeiahora  - Compre o livro "História em Meia Hora - Grandes Civilizações"! https://www.loja.literatour.com.br/produto/pre-venda-livro-historia-em-meia-hora-grandes-civilizacoesversao-capa-dura/ - Compre nossas camisas, moletons e muito mais coisas com temática História na Lolja! www.lolja.com.br/creators/historia-em-meia-hora/ - PIX e contato: historiaemmeiahora@gmail.com   Apresentação: Prof. Vítor Soares. Roteiro: Prof. Vítor Soares e Prof. Victor Alexandre (@profvictoralexandre) - REFERÊNCIAS USADAS: - CARVALHO, José Murilo de. Cidadania no Brasil: o longo caminho. 27ª ed. Rio de Janeiro: Civilização Brasileira, 2021.   - CARVALHO, José Murilo de. D. Pedro II. 1ª ed. São Paulo: Companhia das Letras, 2007   - COSTA, Emilia Viotti da. Da Monarquia à República. 9ª ed. São Paulo: Editora Unesp, 2010.    - GUIMARÃES, Lucia Maria Paschoal. Ação, reação e transação: a pena de aluguel e a historiografia. IN: CARVALHO, José Murilo de. Nação e cidadania no Império: novos horizontes. Rio de Janeiro: Civilização Brasileira, 2007.    - MAGALHÃES JÚNIOR, Raimundo. Três panfletários do segundo reinado. Academia Brasileira de Letras, 2009.    -RIBEIRO, Filipe Nicoletti. Império das incertezas: política e partidos nas décadas finais da monarquia brasileira (1868-1889). 2015. Dissertação (Mestrado em História Social) - Faculdade de Filosofia, Letras e Ciências Humanas, University of São Paulo, São Paulo, 2015.   - SABA, Roberto N.P.F. As”eleições do cacete” e o problema da manipulação eleitoral no Brasil monárquico. Almanack. Guarulhos, n.02, p.126-145, 2º semestre de 2011.   - SAES, Décio. Monarquia e Capitalismo. Revista de Sociologia e Política. Nº1, 1993.   - SECRETO, Maria Verônica. (Des)medidos: a revolta dos quebra-quilos (1874-1876). Rio de Janeiro: FAPERJ, 2011.   - VASCONCELLOS, Zacarias de Góes e. Da natureza e limites do poder moderador. 2ª ed. Rio Grande do Sul: Clube Rebouças, 2022.