Podcast appearances and mentions of neel nanda

70PODCASTS
270EPISODES
35mAVG DURATION
1MONTHLY NEW EPISODE
Sep 15, 2025LATEST

POPULARITY

20172018201920202021202220232024

Best podcasts about neel nanda

The Nonlinear Library

130 episodes with neel nanda

CHEATIES with Lace Larrabee and Katherine Blanford

3 episodes with neel nanda

Put Your Hands Together with Cam and Rhea

5 episodes with neel nanda

Man School 202

2 episodes with neel nanda

The Tasteless Gentlemen

2 episodes with neel nanda

Mango Bae

2 episodes with neel nanda

American Desis Podcast

4 episodes with neel nanda

The Nonlinear Library: LessWrong

39 episodes with neel nanda

Machine Learning Street Talk

2 episodes with neel nanda

Latest podcast episodes about neel nanda

Neel Nanda on leading a Google DeepMind team at 26 – and advice if you want to work at an AI company (part 2)

80,000 Hours Podcast with Rob Wiblin

Play Episode Listen Later Sep 15, 2025 106:49

At 26, Neel Nanda leads an AI safety team at Google DeepMind, has published dozens of influential papers, and mentored 50 junior researchers — seven of whom now work at major AI companies. His secret? “It's mostly luck,” he says, but “another part is what I think of as maximising my luck surface area.”Video, full transcript, and links to learn more: https://80k.info/nn2This means creating as many opportunities as possible for surprisingly good things to happen:Write publicly.Reach out to researchers whose work you admire.Say yes to unusual projects that seem a little scary.Nanda's own path illustrates this perfectly. He started a challenge to write one blog post per day for a month to overcome perfectionist paralysis. Those posts helped seed the field of mechanistic interpretability and, incidentally, led to meeting his partner of four years.His YouTube channel features unedited three-hour videos of him reading through famous papers and sharing thoughts. One has 30,000 views. “People were into it,” he shrugs.Most remarkably, he ended up running DeepMind's mechanistic interpretability team. He'd joined expecting to be an individual contributor, but when the team lead stepped down, he stepped up despite having no management experience. “I did not know if I was going to be good at this. I think it's gone reasonably well.”His core lesson: “You can just do things.” This sounds trite but is a useful reminder all the same. Doing things is a skill that improves with practice. Most people overestimate the risks and underestimate their ability to recover from failures. And as Neel explains, junior researchers today have a superpower previous generations lacked: large language models that can dramatically accelerate learning and research.In this extended conversation, Neel and host Rob Wiblin discuss all that and some other hot takes from Neel's four years at Google DeepMind. (And be sure to check out part one of Rob and Neel's conversation!)What did you think of the episode? https://forms.gle/6binZivKmjjiHU6dA Chapters:Cold open (00:00:00)Who's Neel Nanda? (00:01:12)Luck surface area and making the right opportunities (00:01:46)Writing cold emails that aren't insta-deleted (00:03:50)How Neel uses LLMs to get much more done (00:09:08)“If your safety work doesn't advance capabilities, it's probably bad safety work” (00:23:22)Why Neel refuses to share his p(doom) (00:27:22)How Neel went from the couch to an alignment rocketship (00:31:24)Navigating towards impact at a frontier AI company (00:39:24)How does impact differ inside and outside frontier companies? (00:49:56)Is a special skill set needed to guide large companies? (00:56:06)The benefit of risk frameworks: early preparation (01:00:05)Should people work at the safest or most reckless company? (01:05:21)Advice for getting hired by a frontier AI company (01:08:40)What makes for a good ML researcher? (01:12:57)Three stages of the research process (01:19:40)How do supervisors actually add value? (01:31:53)An AI PhD – with these timelines?! (01:34:11)Is career advice generalisable, or does everyone get the advice they don't need? (01:40:52)Remember: You can just do things (01:43:51)This episode was recorded on July 21.Video editing: Simon Monsour and Luke MonsourAudio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic ArmstrongMusic: Ben CordellCamera operator: Jeremy ChevillotteCoordination, transcriptions, and web: Katy Moore

ai advice navigating video writing reach write luck ml deepmind neel nanda google deepmind neel nanda rob wiblin

#222 – Neel Nanda on the race to read AI minds

80,000 Hours Podcast with Rob Wiblin

Play Episode Listen Later Sep 8, 2025 181:11

We don't know how AIs think or why they do what they do. Or at least, we don't know much. That fact is only becoming more troubling as AIs grow more capable and appear on track to wield enormous cultural influence, directly advise on major government decisions, and even operate military equipment autonomously. We simply can't tell what models, if any, should be trusted with such authority.Neel Nanda of Google DeepMind is one of the founding figures of the field of machine learning trying to fix this situation — mechanistic interpretability (or “mech interp”). The project has generated enormous hype, exploding from a handful of researchers five years ago to hundreds today — all working to make sense of the jumble of tens of thousands of numbers that frontier AIs use to process information and decide what to say or do.Full transcript, video, and links to learn more: https://80k.info/nn1Neel now has a warning for us: the most ambitious vision of mech interp he once dreamed of is probably dead. He doesn't see a path to deeply and reliably understanding what AIs are thinking. The technical and practical barriers are simply too great to get us there in time, before competitive pressures push us to deploy human-level or superhuman AIs. Indeed, Neel argues no one approach will guarantee alignment, and our only choice is the “Swiss cheese” model of accident prevention, layering multiple safeguards on top of one another.But while mech interp won't be a silver bullet for AI safety, it has nevertheless had some major successes and will be one of the best tools in our arsenal.For instance: by inspecting the neural activations in the middle of an AI's thoughts, we can pick up many of the concepts the model is thinking about — from the Golden Gate Bridge, to refusing to answer a question, to the option of deceiving the user. While we can't know all the thoughts a model is having all the time, picking up 90% of the concepts it is using 90% of the time should help us muddle through, so long as mech interp is paired with other techniques to fill in the gaps.This episode was recorded on July 17 and 21, 2025.Interested in mech interp? Apply by September 12 to be a MATS scholar with Neel as your mentor! http://tinyurl.com/neel-mats-appWhat did you think? https://forms.gle/xKyUrGyYpYenp8N4AChapters:Cold open (00:00)Who's Neel Nanda? (01:02)How would mechanistic interpretability help with AGI (01:59)What's mech interp? (05:09)How Neel changed his take on mech interp (09:47)Top successes in interpretability (15:53)Probes can cheaply detect harmful intentions in AIs (20:06)In some ways we understand AIs better than human minds (26:49)Mech interp won't solve all our AI alignment problems (29:21)Why mech interp is the 'biology' of neural networks (38:07)Interpretability can't reliably find deceptive AI – nothing can (40:28)'Black box' interpretability — reading the chain of thought (49:39)'Self-preservation' isn't always what it seems (53:06)For how long can we trust the chain of thought (01:02:09)We could accidentally destroy chain of thought's usefulness (01:11:39)Models can tell when they're being tested and act differently (01:16:56)Top complaints about mech interp (01:23:50)Why everyone's excited about sparse autoencoders (SAEs) (01:37:52)Limitations of SAEs (01:47:16)SAEs performance on real-world tasks (01:54:49)Best arguments in favour of mech interp (02:08:10)Lessons from the hype around mech interp (02:12:03)Where mech interp will shine in coming years (02:17:50)Why focus on understanding over control (02:21:02)If AI models are conscious, will mech interp help us figure it out (02:24:09)Neel's new research philosophy (02:26:19)Who should join the mech interp field (02:38:31)Advice for getting started in mech interp (02:46:55)Keeping up to date with mech interp results (02:54:41)Who's hiring and where to work? (02:57:43)Host: Rob WiblinVideo editing: Simon Monsour, Luke Monsour, Dominic Armstrong, and Milo McGuireAudio engineering: Ben Cordell, Milo McGuire, Simon Monsour, and Dominic ArmstrongMusic: Ben CordellCamera operator: Jeremy ChevillotteCoordination, transcriptions, and web: Katy Moore

ai lessons advice race minds models swiss limitations mats agi golden gate bridge neel mech probes google deepmind saes interpretability neel nanda

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

LessWrong Curated Podcast

Play Episode Listen Later Jul 18, 2025 11:13

Anna and Ed are co-first authors for this work. We're presenting these results as a research update for a continuing body of work, which we hope will be interesting and useful for others working on related topics.TL;DR We investigate why models become misaligned in diverse contexts when fine-tuned on narrow harmful datasets (emergent misalignment), rather than learning the specific narrow task. We successfully train narrowly misaligned models using KL regularization to preserve behavior in other domains. These models give bad medical advice, but do not respond in a misaligned manner to general non-medical questions. We use this method to train narrowly misaligned steering vectors, rank 1 LoRA adapters and rank 32 LoRA adapters, and compare these to their generally misaligned counterparts. The steering vectors are particularly interpretable, we introduce Training Lens as a tool for analysing the revealed residual stream geometry. The general misalignment solution is consistently more [...] ---Outline:(00:27) TL;DR(02:03) Introduction(04:03) Training a Narrowly Misaligned Model(07:13) Measuring Stability and Efficiency(10:00) ConclusionThe original text contained 7 footnotes which were omitted from this narration. --- First published: July 14th, 2025 Source: https://www.lesswrong.com/posts/gLDSqQm8pwNiq7qst/narrow-misalignment-is-hard-emergent-misalignment-is-easy --- Narrated by TYPE III AUDIO. ---Images from the article:

training images efficiency narrow outline tl emergent misalignment conclusionthe neel nanda edward turner

“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda

LessWrong Curated Podcast

Play Episode Listen Later May 5, 2025 13:15

(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.) TL;DR: I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise. Interpretability seems a valuable tool here and remains worth investing in, as it will hopefully increase the reliability we can achieve. However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy. It is not the one thing that will save us, and it still won't be enough for high reliability. Introduction There's a common, often implicit, argument made in AI safety discussions: interpretability is presented as the only reliable path forward for detecting deception in advanced AI - among many other sources it was argued for in [...] ---Outline:(00:55) Introduction(02:57) High Reliability Seems Unattainable(05:12) Why Won't Interpretability be Reliable?(07:47) The Potential of Black-Box Methods(08:48) The Role of Interpretability(12:02) ConclusionThe original text contained 5 footnotes which were omitted from this narration. --- First published: May 4th, 2025 Source: https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai --- Narrated by TYPE III AUDIO.

ai reliable outline tl deceptive interpretability conclusionthe neel nanda

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

LessWrong Curated Podcast

Play Episode Listen Later Apr 16, 2025 21:00

Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future. Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant [...] ---Outline:(01:28) The Evaluation(02:29) Visual Errors(04:03) Physical Reasoning Errors(06:09) Why do LLM's struggle with physical tasks?(07:37) Improving on physical tasks may be difficult(10:14) Potential Implications of Uneven Automation(11:48) Conclusion(12:24) Appendix(12:44) Visual Errors(14:36) Physical Reasoning Errors--- First published: April 14th, 2025 Source: https://www.lesswrong.com/posts/r3NeiHAEWyToers4F/frontier-ai-models-still-fail-at-basic-physical-tasks-a --- Narrated by TYPE III AUDIO. ---Images from the article:Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

LessWrong Curated Podcast

Play Episode Listen Later Apr 12, 2025 57:32

Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn't consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.TL;DR To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts [...] ---Outline:(01:08) TL;DR(02:38) Introduction(02:41) Motivation(06:09) Our Task(08:35) Conclusions and Strategic Updates(13:59) Comparing different ways to train Chat SAEs(18:30) Using SAEs for OOD Probing(20:21) Technical Setup(20:24) Datasets(24:16) Probing(26:48) Results(30:36) Related Work and Discussion(34:01) Is it surprising that SAEs didn't work?(39:54) Dataset debugging with SAEs(42:02) Autointerp and high frequency latents(44:16) Removing High Frequency Latents from JumpReLU SAEs(45:04) Method(45:07) Motivation(47:29) Modifying the sparsity penalty(48:48) How we evaluated interpretability(50:36) Results(51:18) Reconstruction loss at fixed sparsity(52:10) Frequency histograms(52:52) Latent interpretability(54:23) Conclusions(56:43) AppendixThe original text contained 7 footnotes which were omitted from this narration. --- First published: March 26th, 2025 Source: https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft --- Narrated by TYPE III AUDIO. ---Images from the article:

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda

LessWrong Curated Podcast

Play Episode Listen Later Mar 23, 2025 6:58

TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically. Introduction I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big picture: "What's the theory of change for interpretability?", "Is this really going to help with alignment?", "Does any of this matter if we can't ensure all labs take alignment seriously?". And I think people take my answers to these way too seriously. These are great questions, and I'm happy to try answering them. But I've noticed a bit of a pathology: people seem to assume that because I'm (hopefully!) good at the research, I'm automatically well-qualified [...] ---Outline:(00:32) Introduction(02:45) Factors of Good Strategic Takes(05:41) Conclusion--- First published: March 22nd, 2025 Source: https://www.lesswrong.com/posts/P5zWiPF5cPJZSkiAK/good-research-takes-are-not-sufficient-for-good-strategic --- Narrated by TYPE III AUDIO.

research strategic factors sufficient outline tl neel nanda

Do AI As Engineering Instead

Fluidity

Play Episode Listen Later Dec 15, 2024 15:47

Current AI practice is not engineering, even when it aims for practical applications, because it is not based on scientific understanding. Enforcing engineering norms on the field could lead to considerably safer systems. https://betterwithout.ai/AI-as-engineering This episode has a lot of links! Here they are. Michael Nielsen's “The role of ‘explanation' in AI”. https://michaelnotebook.com/ongoing/sporadica.html#role_of_explanation_in_AI Subbarao Kambhampati's “Changing the Nature of AI Research”. https://dl.acm.org/doi/pdf/10.1145/3546954 Chris Olah and his collaborators: “Thread: Circuits”. distill.pub/2020/circuits/ “An Overview of Early Vision in InceptionV1”. distill.pub/2020/circuits/early-vision/ Dai et al., “Knowledge Neurons in Pretrained Transformers”. https://arxiv.org/pdf/2104.08696.pdf Meng et al.: “Locating and Editing Factual Associations in GPT.” rome.baulab.info “Mass-Editing Memory in a Transformer,” https://arxiv.org/pdf/2210.07229.pdf François Chollet on image generators putting the wrong number of legs on horses: twitter.com/fchollet/status/1573879858203340800 Neel Nanda's “Longlist of Theories of Impact for Interpretability”, https://www.lesswrong.com/posts/uK6sQCNMw8WKzJeCQ/a-longlist-of-theories-of-impact-for-interpretability Zachary C. Lipton's “The Mythos of Model Interpretability”. https://arxiv.org/abs/1606.03490 Meng et al., “Locating and Editing Factual Associations in GPT”. https://arxiv.org/pdf/2202.05262.pdf Belrose et al., “Eliciting Latent Predictions from Transformers with the Tuned Lens”. https://arxiv.org/abs/2303.08112 “Progress measures for grokking via mechanistic interpretability”. https://arxiv.org/abs/2301.05217 Conmy et al., “Towards Automated Circuit Discovery for Mechanistic Interpretability”. https://arxiv.org/abs/2304.14997 Elhage et al., “Softmax Linear Units,” transformer-circuits.pub/2022/solu/index.html Filan et al., “Clusterability in Neural Networks,” https://arxiv.org/pdf/2103.03386.pdf Cammarata et al., “Curve circuits,” distill.pub/2020/circuits/curve-circuits/ You can support the podcast and get episodes a week early, by supporting the Patreon: https://www.patreon.com/m/fluidityaudiobooks If you like the show, consider buying me a coffee: https://www.buymeacoffee.com/mattarnold Original music by Kevin MacLeod. This podcast is under a Creative Commons Attribution Non-Commercial International 4.0 License.

Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)

Machine Learning Street Talk

Play Episode Listen Later Dec 7, 2024 222:36

Neel Nanda, a senior research scientist at Google DeepMind, leads their mechanistic interpretability team. In this extensive interview, he discusses his work trying to understand how neural networks function internally. At just 25 years old, Nanda has quickly become a prominent voice in AI research after completing his pure mathematics degree at Cambridge in 2020. Nanda reckons that machine learning is unique because we create neural networks that can perform impressive tasks (like complex reasoning and software engineering) without understanding how they work internally. He compares this to having computer programs that can do things no human programmer knows how to write. His work focuses on "mechanistic interpretability" - attempting to uncover and understand the internal structures and algorithms that emerge within these networks. SPONSOR MESSAGES: *** CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. https://centml.ai/pricing/ Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on ARC and AGI, they just acquired MindsAI - the current winners of the ARC challenge. Are you interested in working on ARC, or getting involved in their events? Goto https://tufalabs.ai/ *** SHOWNOTES, TRANSCRIPT, ALL REFERENCES (DONT MISS!): https://www.dropbox.com/scl/fi/36dvtfl3v3p56hbi30im7/NeelShow.pdf?rlkey=pq8t7lyv2z60knlifyy17jdtx&st=kiutudhc&dl=0 We riff on: * How neural networks develop meaningful internal representations beyond simple pattern matching * The effectiveness of chain-of-thought prompting and why it improves model performance * The importance of hands-on coding over extensive paper reading for new researchers * His journey from Cambridge to working with Chris Olah at Anthropic and eventually Google DeepMind * The role of mechanistic interpretability in AI safety NEEL NANDA: https://www.neelnanda.io/ https://scholar.google.com/citations?user=GLnX3MkAAAAJ&hl=en https://x.com/NeelNanda5 Interviewer - Tim Scarfe TOC: 1. Part 1: Introduction [00:00:00] 1.1 Introduction and Core Concepts Overview 2. Part 2: Outside Interview [00:06:45] 2.1 Mechanistic Interpretability Foundations 3. Part 3: Main Interview [00:32:52] 3.1 Mechanistic Interpretability 4. Neural Architecture and Circuits [01:00:31] 4.1 Biological Evolution Parallels [01:04:03] 4.2 Universal Circuit Patterns and Induction Heads [01:11:07] 4.3 Entity Detection and Knowledge Boundaries [01:14:26] 4.4 Mechanistic Interpretability and Activation Patching 5. Model Behavior Analysis [01:30:00] 5.1 Golden Gate Claude Experiment and Feature Amplification [01:33:27] 5.2 Model Personas and RLHF Behavior Modification [01:36:28] 5.3 Steering Vectors and Linear Representations [01:40:00] 5.4 Hallucinations and Model Uncertainty 6. Sparse Autoencoder Architecture [01:44:54] 6.1 Architecture and Mathematical Foundations [02:22:03] 6.2 Core Challenges and Solutions [02:32:04] 6.3 Advanced Activation Functions and Top-k Implementations [02:34:41] 6.4 Research Applications in Transformer Circuit Analysis 7. Feature Learning and Scaling [02:48:02] 7.1 Autoencoder Feature Learning and Width Parameters [03:02:46] 7.2 Scaling Laws and Training Stability [03:11:00] 7.3 Feature Identification and Bias Correction [03:19:52] 7.4 Training Dynamics Analysis Methods 8. Engineering Implementation [03:23:48] 8.1 Scale and Infrastructure Requirements [03:25:20] 8.2 Computational Requirements and Storage [03:35:22] 8.3 Chain-of-Thought Reasoning Implementation [03:37:15] 8.4 Latent Structure Inference in Language Models

Podcast appearances and mentions of neel nanda

Best podcasts about neel nanda

The Nonlinear Library

CHEATIES with Lace Larrabee and Katherine Blanford

Put Your Hands Together with Cam and Rhea

Man School 202

The Tasteless Gentlemen

Mango Bae

American Desis Podcast

The Nonlinear Library: LessWrong

Machine Learning Street Talk

Latest news about neel nanda

Latest podcast episodes about neel nanda

Neel Nanda on leading a Google DeepMind team at 26 – and advice if you want to work at an AI company (part 2)

#222 – Neel Nanda on the race to read AI minds

“Narrow Misalignment is Hard, Emergent Misalignment is Easy” by Edward Turner, Anna Soligo, Senthooran Rajamanoharan, Neel Nanda

“Interpretability Will Not Reliably Find Deceptive AI” by Neel Nanda

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda

Do AI As Engineering Instead

Neel Nanda - Mechanistic Interpretability (Sparse Autoencoders)

AF - Showing SAE Latents Are Not Atomic Using Meta-SAEs by Bart Bussmann

AF - Extracting SAE task features for ICL by Dmitrii Kharlapenko

AF - Self-explaining SAE features by Dmitrii Kharlapenko

LW - Understanding Positional Features in Layer 0 SAEs by bilalchughtai

AF - BatchTopK: A Simple Improvement for TopK-SAEs by Bart Bussmann

AF - JumpReLU SAEs + Early Access to Gemma 2 SAEs by Neel Nanda

AF - Stitching SAEs of different sizes by Bart Bussmann

LW - AI Alignment Research Engineer Accelerator (ARENA): Call for applicants v4.0 by James Fox

AF - An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2 by Neel Nanda

LW - How ARENA course material gets made by CallumMcDougall

AF - OthelloGPT learned a bag of heuristics by jylin04

LW - So you want to work on technical AI safety by gw

LW - Building intuition with spaced repetition systems by Jacob G-W

AF - Mechanistic Interpretability Workshop Happening at ICML 2024! by Neel Nanda

AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky

AF - Refusal in LLMs is mediated by a single direction by Andy Arditi

AF - Superposition is not "just" neuron polysemanticity by Lawrence Chan

AF - Improving Dictionary Learning with Gated Sparse Autoencoders by Neel Nanda

AF - Progress Update #1 from the GDM Mech Interp Team: Full Update by Neel Nanda

LW - [Full Post] Progress Update #1 from the GDM Mech Interp Team by Neel Nanda

AF - Progress Update #1 from the GDM Mech Interp Team: Summary by Neel Nanda

LW - The Best Tacit Knowledge Videos on Every Subject by Parker Conley

AF - AtP*: An efficient and scalable method for localizing LLM behaviour to components by Neel Nanda

AF - Laying the Foundations for Vision and Multimodal Mechanistic Interpretability & Open Problems by Sonia Joseph

AF - Understanding SAE Features with the Logit Lens by Joseph Isaac Bloom

LW - A Chess-GPT Linear Emergent World Representation by karvonenadam

LW - Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by Joseph Bloom

AF - Open Source Sparse Autoencoders for all Residual Stream Layers of GPT2-Small by Joseph Isaac Bloom

Nikki Haley is The Worst

AF - Sparse Autoencoders Work on Attention Layer Outputs by Connor Kissane

AF - Fact Finding: Do Early Layers Specialise in Local Processing? (Post 5) by Neel Nanda

AF - Intro to Superposition & Sparse Autoencoders (Colab exercises) by CallumMcDougall

LW - Polysemantic Attention Head in a 4-Layer Transformer by Jett

EA - AI Alignment Research Engineer Accelerator (ARENA): call for applicants by TheMcDouglas

EA - EAGxVirtual: Speaker announcements, timings, and other updates by Sasha Berezhnoi

LW - Neel Nanda on the Mechanistic Interpretability Researcher Mindset by Michaël Trazzi

AF - Paper Walkthrough: Automated Circuit Discovery with Arthur Conmy by Neel Nanda

Neel Nanda - Mechanistic Interpretability

E26: [Bonus Episode] Connor Leahy on AGI, GPT-4, and Cognitive Emulation w/ FLI Podcast

Episode 220 w Neel Nanda & Friends

Neel Nanda