Podcasts about MLPS

  • 113PODCASTS
  • 216EPISODES
  • 32mAVG DURATION
  • 1EPISODE EVERY OTHER WEEK
  • May 13, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about MLPS

Latest podcast episodes about MLPS

Lead-Lag Live
Pipeline Powerhouses: Mastering MLP Investments

Lead-Lag Live

Play Episode Listen Later May 13, 2025 29:24 Transcription Available


Dive into the often misunderstood world of Master Limited Partnerships (MLPs) with Jay Hatfield of Infrastructure Capital as he clarifies exactly what makes these unique investment vehicles tick. Far from simply being "pipeline stocks," MLPs represent a sophisticated investment opportunity combining advantageous tax structures with stable cash flows and attractive yields.Jay breaks down the fundamental economics driving pipeline companies, explaining why they remain remarkably resilient even during periods of energy price volatility. Unlike direct energy producers, these infrastructure businesses operate primarily through long-term contracts and acreage dedications, creating predictable revenue streams regardless of short-term commodity fluctuations. Currently yielding around 7% with 5% annual distribution growth, today's MLPs target double-digit total returns while maintaining conservative financial policies.The conversation highlights how natural gas infrastructure stands at the intersection of several major global trends. As electricity demand surges from AI development, electric vehicles, and broader electrification, natural gas remains essential for grid stability—something even renewable-heavy regions like Spain and Portugal have learned through experience. Meanwhile, policy shifts under the Trump administration supporting LNG exports create substantial growth runways for companies transporting America's abundant natural gas resources to global markets hungry for cleaner energy alternatives.Perhaps most compelling for investors is the portfolio diversification MLPs offer, showing only 60-70% correlation to broader markets while providing meaningful income. The industry's evolution over recent years has created stronger, more resilient companies with national operations, investment-grade balance sheets, and sustainable distribution policies. For retirement-focused investors especially, these characteristics make MLPs worth serious consideration as part of a balanced portfolio strategy.Ready to explore how MLPs might fit into your investment approach? Visit infracapfunds.com to learn more about AMZA and other specialized ETFs designed to capture opportunities in this dynamic sector. Sign up to The Lead-Lag Report on Substack and get 30% off the annual subscription today by visiting http://theleadlag.report/leadlaglive. Support the show

This Week in Machine Learning & Artificial Intelligence (AI) Podcast
Exploring the Biology of LLMs with Circuit Tracing with Emmanuel Ameisen - #727

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

Play Episode Listen Later Apr 14, 2025 94:06


In this episode, Emmanuel Ameisen, a research engineer at Anthropic, returns to discuss two recent papers: "Circuit Tracing: Revealing Language Model Computational Graphs" and "On the Biology of a Large Language Model." Emmanuel explains how his team developed mechanistic interpretability methods to understand the internal workings of Claude by replacing dense neural network components with sparse, interpretable alternatives. The conversation explores several fascinating discoveries about large language models, including how they plan ahead when writing poetry (selecting the rhyming word "rabbit" before crafting the sentence leading to it), perform mathematical calculations using unique algorithms, and process concepts across multiple languages using shared neural representations. Emmanuel details how the team can intervene in model behavior by manipulating specific neural pathways, revealing how concepts are distributed throughout the network's MLPs and attention mechanisms. The discussion highlights both capabilities and limitations of LLMs, showing how hallucinations occur through separate recognition and recall circuits, and demonstrates why chain-of-thought explanations aren't always faithful representations of the model's actual reasoning. This research ultimately supports Anthropic's safety strategy by providing a deeper understanding of how these AI systems actually work. The complete show notes for this episode can be found at https://twimlai.com/go/727.

AXRP - the AI X-risk Research Podcast
40 - Jason Gross on Compact Proofs and Interpretability

AXRP - the AI X-risk Research Podcast

Play Episode Listen Later Mar 28, 2025 156:05


How do we figure out whether interpretability is doing its job? One way is to see if it helps us prove things about models that we care about knowing. In this episode, I speak with Jason Gross about his agenda to benchmark interpretability in this way, and his exploration of the intersection of proofs and modern machine learning. Patreon: https://www.patreon.com/axrpodcast Ko-fi: https://ko-fi.com/axrpodcast Transcript: https://axrp.net/episode/2025/03/28/episode-40-jason-gross-compact-proofs-interpretability.html   Topics we discuss, and timestamps: 0:00:40 - Why compact proofs 0:07:25 - Compact Proofs of Model Performance via Mechanistic Interpretability 0:14:19 - What compact proofs look like 0:32:43 - Structureless noise, and why proofs 0:48:23 - What we've learned about compact proofs in general 0:59:02 - Generalizing 'symmetry' 1:11:24 - Grading mechanistic interpretability 1:43:34 - What helps compact proofs 1:51:08 - The limits of compact proofs 2:07:33 - Guaranteed safe AI, and AI for guaranteed safety 2:27:44 - Jason and Rajashree's start-up 2:34:19 - Following Jason's work   Links to Jason: Github: https://github.com/jasongross Website: https://jasongross.github.io Alignment Forum: https://www.alignmentforum.org/users/jason-gross   Links to work we discuss: Compact Proofs of Model Performance via Mechanistic Interpretability: https://arxiv.org/abs/2406.11779 Unifying and Verifying Mechanistic Interpretability: A Case Study with Group Operations: https://arxiv.org/abs/2410.07476 Modular addition without black-boxes: Compressing explanations of MLPs that compute numerical integration: https://arxiv.org/abs/2412.03773 Stage-Wise Model Diffing: https://transformer-circuits.pub/2024/model-diffing/index.html Causal Scrubbing: a method for rigorously testing interpretability hypotheses: https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition (aka the Apollo paper on APD): https://arxiv.org/abs/2501.14926 Towards Guaranteed Safe AI: https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-45.pdf     Episode art by Hamish Doodles: hamishdoodles.com

Lead-Lag Live
Navigating Volatile Markets Through Income with Jay Hatfield

Lead-Lag Live

Play Episode Listen Later Mar 20, 2025 52:04 Transcription Available


Amid rising market turbulence, finding stable income sources has become increasingly crucial for investors seeking portfolio resilience. In this compelling discussion, Jay Hatfield draws on his 35 years of Wall Street experience to illuminate the path forward for income-focused investing strategies that can weather economic uncertainty.Hatfield challenges conventional wisdom with his razor-sharp macroeconomic analysis, demonstrating why tariffs are actually deflationary rather than inflationary and how this misunderstanding creates opportunities for well-positioned investors. His forecast that the 10-year Treasury will drop to 3.75% as the Federal Reserve finally acknowledges economic slowdown provides a framework for strategic positioning across asset classes.The discussion reveals why traditional S&P 500 portfolios yielding just 1.3% simply can't generate meaningful income in today's environment. Instead, Hatfield outlines a comprehensive approach using preferred stocks (PFFA yielding ~9%), high-yield bonds (BNDS yielding ~8%), dividend-paying stocks, and reformed MLPs to create substantial income streams while managing risk. His insights on small caps are particularly compelling – currently trading at significant discounts to large caps, they offer both attractive income and growth potential as rates decline and M&A activity accelerates.What sets this conversation apart is Hatfield's practical approach to portfolio construction. Most investors unknowingly carry excessive technology exposure through their index funds and individual holdings, leaving them vulnerable to tech sector volatility. By strategically incorporating income-producing assets, investors can create more balanced portfolios that generate consistent returns regardless of market conditions. As Hatfield notes, "staying out of trouble is about 90% of the battle" when it comes to long-term investment success.DISCLAIMER – PLEASE READ: This is a sponsored episode for which Lead-Lag Publishing, LLC has been paid a fee. Lead-Lag Publishing, LLC does not guarantee the accuracy or completeness of the information provided in the episode or make any representation as to its quality. All statements and expressions provided in this episode are the sole opinion of Infrastructure Capital and Lead-Lag Publishing, LLC expressly disclaims any responsibility for action taken in connection with the information provided in the discussion. The content in this program is for informational purposes only. You should not construe any information or other material as investment, financial, tax, or other advice. The views expressed by the participants are solely their own. A participant may have taken or recommended any investment position discussed, but may close such position or alter its recommendation at any time without notice. Nothing contained in this program constitutes a solicitation, recommendation, endorsement, or offer to buy or sell any securities or other financial instruments in any jurisdiction. Please consult your own investment or financial advisor for advice related to all investment decisions. Sign up to The Lead-Lag Report on Substack and get 30% off the annual subscription today by visiting http://theleadlag.report/leadlaglive. Foodies unite…with HowUdish!It's social media with a secret sauce: FOOD! The world's first network for food enthusiasts. HowUdish connects foodies across the world!Share kitchen tips and recipe hacks. Discover hidden gem food joints and street food. Find foodies like you, connect, chat and organize meet-ups!HowUdish makes it simple to connect through food anywhere in the world.So, how do YOU dish? Download HowUdish on the Apple App Store today:

Lead-Lag Live
Will Rhind on High Income Pass-Through Securities, Yield Boost Strategies, and Strategic Portfolio Diversification

Lead-Lag Live

Play Episode Listen Later Feb 20, 2025 54:59 Transcription Available


Prepare to transform your income strategy as we explore the world of High Income Pass-Through Securities (HIPS) with our expert guest, Will. Discover how these unique investment vehicles can serve as robust alternatives during high inflation by tapping into the power of REITs, MLPs, closed-end funds, and business development companies. These securities not only offer the potential for higher income levels than traditional fixed-income sources but also come with significant tax advantages. Join us as we uncover the strategic design behind HIPS and their resilience during tumultuous times, including the 2020 COVID-19 pandemic, all while emphasizing the critical role of diversification.We spotlight the remarkable stability of the HIPS income portfolio during market upheavals, highlighting its ability to maintain consistent income distributions when the going gets tough. You'll gain insights into the tax efficiency of HIPS and explore the nature of return on capital, debunking common misconceptions about pass-through securities' expense ratios. Together, we'll navigate the intricate relationship between inflation, wage stagnation, and the rising demand for yield-generating investment products, offering a comprehensive understanding of why HIPS stand out as a compelling option in uncertain economic climates.Additionally, venture into the innovative Yield Boost strategy, an options-selling approach that combines high yield generation with downside protection through selling out-of-the-money put options. We'll use Tesla as a case study to demonstrate how Yield Boost achieves impressive returns while minimizing NAV erosion. By comparing this strategy with HIPS, we reveal how both methods can cater to those seeking consistent income and total return in their investment portfolios. Tune in for a deep dive into crafting a well-rounded portfolio that meets fixed liabilities while offering income certainty and robust total returns.DISCLAIMER – PLEASE READ: This is a sponsored episode for which Lead-Lag Publishing, LLC has been paid a fee. Lead-Lag Publishing, LLC does not guarantee the accuracy or completeness of the information provided in the episode or make any representation as to its quality. All statements and expressions provided in this episode are the sole opinion of GraniteShares and Lead-Lag Publishing, LLC expressly disclaims any responsibility for action taken in connection with the information provided in the discussion. The content in this program is for informational purposes only. You should not construe any information or other material as investment, financial, tax, or other advice. The views expressed by the participants are solely their own. A participant may have taken or recommended any investment position discussed, but may close such position or alter its recommendation at any time without notice. Nothing contained in this program constitutes a solicitation, recommendation, endorsement, or offer to buy or sell any securities or other financial instruments in any jurisdiction. Please consult your own investment or financial advisor for advice related to all investment decisions. Sign up to The Lead-Lag Report on Substack and get 30% off the annual subscription today by visiting http://theleadlag.report/leadlaglive. Foodies unite…with HowUdish!It's social media with a secret sauce: FOOD! The world's first network for food enthusiasts. HowUdish connects foodies across the world!Share kitchen tips and recipe hacks. Discover hidden gem food joints and street food. Find foodies like you, connect, chat and organize meet-ups!HowUdish makes it simple to connect through food anywhere in the world.So, how do YOU dish? Download HowUdish on the Apple App Store today:

Machine Learning Guide
MLG 033 Transformers

Machine Learning Guide

Play Episode Listen Later Feb 9, 2025 42:14


Try a walking desk while studying ML or working on your projects! 3Blue1Brown videos Background & Motivation: RNN Limitations: Sequential processing prevents full parallelization—even with attention tweaks—making them inefficient on modern hardware. Breakthrough: “Attention Is All You Need” replaced recurrence with self-attention, unlocking massive parallelism and scalability. Core Architecture: Layer Stack: Consists of alternating self-attention and feed-forward (MLP) layers, each wrapped in residual connections and layer normalization. Positional Encodings: Since self-attention is permutation invariant, add sinusoidal or learned positional embeddings to inject sequence order. Self-Attention Mechanism: Q, K, V Explained: Query (Q): The representation of the token seeking contextual info. Key (K): The representation of tokens being compared against. Value (V): The information to be aggregated based on the attention scores. Multi-Head Attention: Splits Q, K, V into multiple “heads” to capture diverse relationships and nuances across different subspaces. Dot-Product & Scaling: Computes similarity between Q and K (scaled to avoid large gradients), then applies softmax to weigh V accordingly. Masking: Causal Masking: In autoregressive models, prevents a token from “seeing” future tokens, ensuring proper generation. Padding Masks: Ignore padded (non-informative) parts of sequences to maintain meaningful attention distributions. Feed-Forward Networks (MLPs): Transformation & Storage: Post-attention MLPs apply non-linear transformations; many argue they're where the “facts” or learned knowledge really get stored. Depth & Expressivity: Their layered nature deepens the model's capacity to represent complex patterns. Residual Connections & Normalization: Residual Links: Crucial for gradient flow in deep architectures, preventing vanishing/exploding gradients. Layer Normalization: Stabilizes training by normalizing across features, enhancing convergence. Scalability & Efficiency Considerations: Parallelization Advantage: Entire architecture is designed to exploit modern parallel hardware, a huge win over RNNs. Complexity Trade-offs: Self-attention's quadratic complexity with sequence length remains a challenge; spurred innovations like sparse or linearized attention. Training Paradigms & Emergent Properties: Pretraining & Fine-Tuning: Massive self-supervised pretraining on diverse data, followed by task-specific fine-tuning, is the norm. Emergent Behavior: With scale comes abilities like in-context learning and few-shot adaptation, aspects that are still being unpacked. Interpretability & Knowledge Distribution: Distributed Representation: “Facts” aren't stored in a single layer but are embedded throughout both attention heads and MLP layers. Debate on Attention: While some see attention weights as interpretable, a growing view is that real “knowledge” is diffused across the network's parameters.

SL Advisors Talks Energy
Energy Policies Are Moving Right

SL Advisors Talks Energy

Play Episode Listen Later Jan 8, 2025 5:40


Data center demand for natural gas was the big energy story last year.  Wells Fargo referred to “a momentous year for midstream” with this as the biggest driver. It's been consistently cited by JPMorgan and Morgan Stanley. Wells Fargo calculated that C-corps outperformed MLPs by 23%. This was a substantial difference and means that the […]

Adam and Jordana
Council Member Vetaw rips MLPS budget proposal

Adam and Jordana

Play Episode Listen Later Dec 13, 2024 14:22


Minneapolis Council Member LaTrisha Vetaw joins Adam.

The Business of Healthcare Podcast
The Business of Healthcare Podcast, Episode 122: Medical-Legal Partnerships

The Business of Healthcare Podcast

Play Episode Listen Later Oct 28, 2024 19:01


In episode 122 of The Business of Healthcare Podcast, Dr. Quinton Nottingham with Virginia Tech University joins host Dan Karnuta for a discussion about medical-legal partnerships. Their discussion explores what MLPs are, their operational frameworks and the ways these patient-centered partnerships can address social determinants of health to improve health outcomes. Nottingham is head of the Business Information Technology Department within the Pamplin College of Business at Virginia Tech. Karnuta is director of the Professional Program in Healthcare Management at The University of Texas at Dallas Naveen Jindal School of Management.

Yet Another MBA G.O.A.T. Podcast
The Business of Healthcare Podcast, Episode 122: Medical-Legal Partnerships

Yet Another MBA G.O.A.T. Podcast

Play Episode Listen Later Oct 28, 2024 19:01


In episode 122 of The Business of Healthcare Podcast, Dr. Quinton Nottingham with Virginia Tech University joins host Dan Karnuta for a discussion about medical-legal partnerships. Their discussion explores what MLPs are, their operational frameworks and the ways these patient-centered partnerships can address social determinants of health to improve health outcomes. Nottingham is head of the Business Information Technology Department within the Pamplin College of Business at Virginia Tech. Karnuta is director of the Professional Program in Healthcare Management at The University of Texas at Dallas Naveen Jindal School of Management.

ETF Edge
A second look at “preferred” stocks and MLPs 9/30/24

ETF Edge

Play Episode Listen Later Sep 30, 2024 14:35


As active management and income-seeking come together more tightly, “preferred” stocks and energy MLPs are sitting at the nexus.          

Greater Possibilities
Fueling the future

Greater Possibilities

Play Episode Listen Later Aug 16, 2024 35:54


In the four years since the pandemic started, master limited partnerships (MLPs) have outperformed the broad market. Why? And what's going to drive performance going forward? Brian Watson joins the podcast to discuss how MLPs went from “boring entities” to key players in conversations around artificial intelligence, bitcoin mining, and electric vehicles. (Invesco Distributors, Inc.)

AMA Journal of Ethics
Ethics Talk: What Do MLPs Offer Undocumented Patients?

AMA Journal of Ethics

Play Episode Listen Later Aug 1, 2024 28:52


Lynette Martins joins Ethics Talk to discuss how medical-legal partnerships can help undocumented patient-clients.  Recorded April 11, 2024.   Read the full August issue on Standards in Medical-Legal Partnerships for free at JournalOfEthics.org

Lori & Julia
7/10 Wednesday Hr 1: Kim K replacing Vanna?!

Lori & Julia

Play Episode Listen Later Jul 10, 2024 38:15


We debate on whether or not Kim Kardashian will be a replacement for Vanna White. Is Ellen done after her MLPS show?WTF is with these commercials?? Kate Winslet opens up about some iconic Titanic scenes and Laugh Tracks are terrible. Learn more about your ad choices. Visit podcastchoices.com/adchoicesSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.

Lori & Julia
7/10 Wednesday Hr 1: Kim K replacing Vanna?!

Lori & Julia

Play Episode Listen Later Jul 10, 2024 41:15


We debate on whether or not Kim Kardashian will be a replacement for Vanna White. Is Ellen done after her MLPS show? WTF is with these commercials?? Kate Winslet opens up about some iconic Titanic scenes and Laugh Tracks are terrible. Learn more about your ad choices. Visit megaphone.fm/adchoices

Excess Returns
Energy Infrastructure Investing with Greg Reid

Excess Returns

Play Episode Listen Later May 30, 2024 57:49


In this episode of Excess Returns, we speak with Greg Reid, President of Real Assets at Westwood Group Holdings and lead portfolio manager for their $2 billion energy investment team. We cover the outlook for energy and the major areas of the energy market. We discuss the future of clean energy and why traditional fuels may play a bigger role in it than renewables. We also dig deep into energy infrastructure investing and MLPs and Greg's process for selecting investments in the space and their new ETF that pairs energy investing with covered call writing to boost income. SEE LATEST EPISODES ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://excessreturnspod.com FIND OUT MORE ABOUT VALIDEA ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.validea.com⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ FIND OUT MORE ABOUT VALIDEA CAPITAL ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.valideacapital.com⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ FOLLOW JACK Twitter: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://twitter.com/practicalquant⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ LinkedIn: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.linkedin.com/in/jack-forehand-8015094⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ FOLLOW JUSTIN Twitter: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://twitter.com/jjcarbonneau⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ LinkedIn: ⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠https://www.linkedin.com/in/jcarbonneau⁠⁠⁠⁠⁠

Papers Read on AI
KAN: Kolmogorov-Arnold Networks

Papers Read on AI

Play Episode Listen Later May 6, 2024 93:54


Inspired by the Kolmogorov-Arnold representation theorem, we propose Kolmogorov-Arnold Networks (KANs) as promising alternatives to Multi-Layer Perceptrons (MLPs). While MLPs have fixed activation functions on nodes ("neurons"), KANs have learnable activation functions on edges ("weights"). KANs have no linear weights at all -- every weight parameter is replaced by a univariate function parametrized as a spline. We show that this seemingly simple change makes KANs outperform MLPs in terms of accuracy and interpretability. For accuracy, much smaller KANs can achieve comparable or better accuracy than much larger MLPs in data fitting and PDE solving. Theoretically and empirically, KANs possess faster neural scaling laws than MLPs. For interpretability, KANs can be intuitively visualized and can easily interact with human users. Through two examples in mathematics and physics, KANs are shown to be useful collaborators helping scientists (re)discover mathematical and physical laws. In summary, KANs are promising alternatives for MLPs, opening opportunities for further improving today's deep learning models which rely heavily on MLPs. 2024: Ziming Liu, Yixuan Wang, Sachin Vaidya, Fabian Ruehle, James Halverson, Marin Soljavci'c, Thomas Y. Hou, Max Tegmark https://arxiv.org/pdf/2404.19756v2

GPT Reviews
Cohere on Amazon

GPT Reviews

Play Episode Listen Later May 1, 2024 14:53


Cohere Command R & R+ now available on Amazon for enterprise-grade workloads and multilingual support. Big tech companies dominating AI lobbying efforts in Washington, potentially leading to weak regulations. Multi-token prediction proposed as a new way of training large language models, resulting in higher sample efficiency and faster inference. KANs, a new type of neural network with learnable activation functions on edges or weights, outperform MLPs in accuracy and interpretability, and can help scientists discover mathematical and physical laws. Contact:  sergi@earkind.com Timestamps: 00:34 Introduction 01:54 Cohere Command R & R+ now available on Amazon 03:25 There's an AI Lobbying Frenzy in Washington. Big Tech Is Dominating 05:22 THE 150X PGVECTOR SPEEDUP: A YEAR-IN-REVIEW 06:31 Fake sponsor 08:04 Better & Faster Large Language Models via Multi-token Prediction 09:55 KAN: Kolmogorov-Arnold Networks 11:51 Iterative Reasoning Preference Optimization 13:43 Outro

The Nonlinear Library
AF - Transcoders enable fine-grained interpretable circuit analysis for language models by Jacob Dunefsky

The Nonlinear Library

Play Episode Listen Later Apr 30, 2024 41:01


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Transcoders enable fine-grained interpretable circuit analysis for language models, published by Jacob Dunefsky on April 30, 2024 on The AI Alignment Forum. Summary We present a method for performing circuit analysis on language models using "transcoders," an occasionally-discussed variant of SAEs that provide an interpretable approximation to MLP sublayers' computations. Transcoders are exciting because they allow us not only to interpret the output of MLP sublayers but also to decompose the MLPs themselves into interpretable computations. In contrast, SAEs only allow us to interpret the output of MLP sublayers and not how they were computed. We demonstrate that transcoders achieve similar performance to SAEs (when measured via fidelity/sparsity metrics) and that the features learned by transcoders are interpretable. One of the strong points of transcoders is that they decompose the function of an MLP layer into sparse, independently-varying, and meaningful units (like neurons were originally intended to be before superposition was discovered). This significantly simplifies circuit analysis, and so for the first time, we present a method for using transcoders in circuit analysis in this way. We performed a set of case studies on GPT2-small that demonstrate that transcoders can be used to decompose circuits into monosemantic, interpretable units of computation. We provide code for training/running/evaluating transcoders and performing circuit analysis with transcoders, and code for the aforementioned case studies carried out using these tools. We also provide a suite of 12 trained transcoders, one for each layer of GPT2-small. All of the code can be found at https://github.com/jacobdunefsky/transcoder_circuits, and the transcoders can be found at https://huggingface.co/pchlenski/gpt2-transcoders. Work performed as a part of Neel Nanda's MATS 5.0 (Winter 2024) stream and MATS 5.1 extension. Jacob Dunefsky is currently receiving funding from the Long-Term Future Fund for this work. Background and motivation Mechanistic interpretability is fundamentally concerned with reverse-engineering models' computations into human-understandable parts. Much early mechanistic interpretability work (e.g. indirect object identification) has dealt with decomposing model computations into circuits involving small numbers of model components like attention heads or MLP sublayers. But these component-level circuits operate at too coarse a granularity: due to the relatively small number of components in a model, each individual component will inevitably be important to all sorts of computations, oftentimes playing different roles. In other words, components are polysemantic. Therefore, if we want a more faithful and more detailed understanding of the model, we should aim to find fine-grained circuits that decompose the model's computation onto the level of individual feature vectors. As a hypothetical example of the utility that feature-level circuits might provide in the very near-term: if we have a feature vector that seems to induce gender bias in the model, then understanding which circuits this feature vector partakes in (including which earlier-layer features cause it to activate and which later-layer features it activates) would better allow us to understand the side-effects of debiasing methods. More ambitiously, we hope that similar reasoning might apply to a feature that would seem to mediate deception in a future unaligned AI: a fuller understanding of feature-level circuits could help us understand whether this deception feature actually is responsible for the entirety of deception in a model, or help us understand the extent to which alignment methods remove the harmful behavior. Some of the earliest work on SAEs aimed to use them to find such feature-level circuits (e.g. Cunn...

The Nonlinear Library
LW - Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition by cmathw

The Nonlinear Library

Play Episode Listen Later Apr 9, 2024 31:22


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Gated Attention Blocks: Preliminary Progress toward Removing Attention Head Superposition, published by cmathw on April 9, 2024 on LessWrong. This work represents progress on removing attention head superposition. We are excited by this approach but acknowledge there are currently various limitations. In the short term, we will be working on adjacent problems are excited to collaborate with anyone thinking about similar things! Produced as part of the ML Alignment & Theory Scholars Program - Summer 2023 Cohort Summary: In transformer language models, attention head superposition makes it difficult to study the function of individual attention heads in isolation. We study a particular kind of attention head superposition that involves constructive and destructive interference between the outputs of different attention heads. We propose a novel architecture - a 'gated attention block' - which resolves this kind of attention head superposition in toy models. In future, we hope this architecture may be useful for studying more natural forms of attention head superposition in large language models. Our code can be found here. Background Mechanistic interpretability aims to reverse-engineer what neural networks have learned by decomposing a network's functions into human-interpretable algorithms. This involves isolating the individual components within the network that implement particular behaviours. This has proven difficult, however, because networks make use of polysemanticity and superposition to represent information. Polysemanticity in a transformer's multi-layer perceptron (MLPs) layers is when neurons appear to represent many unrelated concepts (Gurnee et al., 2023). We also see this phenomena within the transformer's attention mechanism, when a given attention head performs qualitatively different functions based on its destination token and context (Janiak et al., 2023). Superposition occurs when a layer in a network (an 'activation space') represents more features than it has dimensions. This means that features are assigned to an overcomplete set of directions as opposed to being aligned with e.g. the neuron basis. The presence of polysemanticity means that the function of a single neuron or attention head cannot be defined by the features or behaviours it expresses on a subset of its training distribution because it may serve different purposes on different subsets of the training distribution. Relatedly, superposition makes it misleading to study the function of individual neurons or attention heads in isolation from other neurons or heads. Both of these phenomena promote caution around assigning specific behaviours to individual network components (neurons or attention heads), due to there both being a diversity in behaviours across a training distribution and in their interaction with other components in the network. Although polysemanticity and superposition make the isolated components of a network less immediately interpretable, understanding of the correct functional units of analysis has improved. Progress has been made on both understanding features as directions within an activation space (Elhage et al., 2023) and resolving feature superposition by applying sparse autoencoders to identify highly-interpretable features (Sharkey et al., 2022; Cunningham et al., 2023; Bricken et al., 2023). Attention head superposition for OV-Incoherent Skip Trigrams Superposition in the context of attention heads is less understood. It is however conceivable that an attention block could make use of a similar compression scheme to implement more behaviours than the number of attention heads in the block. Prior work introduced a task to study attention head superposition in the form of OV-Incoherent Skip Trigrams (Jermyn et al., 2023; Conerly et al., 2023). These are s...

Punk & Oi! Worldwide
Punk & Oi! Worldwide Top Releases of 2023 part 1

Punk & Oi! Worldwide

Play Episode Listen Later Jan 27, 2024 73:13


POWW Top Releases of 2023 part 1: This episode features my list of best LPs and MLPs of 2023. I cover several great releases of 2023 and run down my list of the best. I also play ten tracks from the list as I go through the countdown.

Super-Spiked Podcast
Super-Spiked Videopods (EP35): Growth, Returns, and Sub-Sector Themes

Super-Spiked Podcast

Play Episode Listen Later Jan 20, 2024 23:19


WATCH the video on YouTube by clicking the RED button above.LISTEN to audio only via the Substack player by clicking the BLUEbutton above.STREAM audio only on Apple Podcasts, Spotify, or your favorite podcast player app.DOWNLOAD a pdf of the slide deck by clicking the blue Download button below.Our first two written posts of 2024 focused on the Big Themes and Tactical Questions we see for the traditional energy sector. In this video we bring those together with an expanded discussion on a number of sub-sectors including the international oil companies (IOCs) & Canadian “Big-4” oils, US “Big-3” downstream, US/Canada midstream (includes pipelines and MLPs), and gassy E&Ps. Traditional energy exhibits a massive and diverse set of opportunities and one of our 2024 aims is to provide our perspectives on where differentiated opportunities exist. The world has now recovered from the deep COVID trough. The recovery trade in traditional energy ended in 2022. Balance sheets are fixed and profitability structurally improved (versus last decade). The challenge now for individual companies across the various sub-sectors is to articulate and demonstrate a differentiated approach to meeting the world's massive unmet energy needs through a strategy that is both profitable and durable, or, recognizes a lack of durability by liquidating, selling, or otherwise distributing essentially all cash back to investors. If this week's video is not enough for you, Arjun also appeared on Lykeion's (Geopolitics of Commodities) podcast hosted by Scott Smitson. The 55-minute discussion (link) covered global energy, Europe's energy polices, under appreciated aspects of the energy transition, the role of government in energy policy, near-term geopolitical risk and spare capacity, and more.Arjun also joined Tom Loughery and Reed Barrett of FLOW on a 54-minute webinar ( link, password S4Pz0+4v). Key topic items included our SuperVol framework, Tom's view on the “second-half” of shale, the role of early versus late stage private equity, exploration, Super Major/large-cap E&P vs SMID-cap E&P strategies, and what our “phasing-in profitable growth” theme really means. As always, we appreciate and look forward to your comments, critiques, and, if you wish, praise.

The Nonlinear Library
AF - Sparse Autoencoders Work on Attention Layer Outputs by Connor Kissane

The Nonlinear Library

Play Episode Listen Later Jan 16, 2024 32:06


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Work on Attention Layer Outputs, published by Connor Kissane on January 16, 2024 on The AI Alignment Forum. This post is the result of a 2 week research sprint project during the training phase of Neel Nanda's MATS stream. Executive Summary We replicate Anthropic's MLP Sparse Autoencoder (SAE) paper on attention outputs and it works well: the SAEs learn sparse, interpretable features, which gives us insight into what attention layers learn. We study the second attention layer of a two layer language model (with MLPs). Specifically, rather than training our SAE on attn_output, we train our SAE on "hook_z" concatenated over all attention heads (aka the mixed values aka the attention outputs before a linear map - see notation here). This is valuable as we can see how much of each feature's weights come from each head, which we believe is a promising direction to investigate attention head superposition, although we only briefly explore that in this work. We open source our SAE, you can use it via this Colab notebook . Shallow Dives: We do a shallow investigation to interpret each of the first 50 features. We estimate 82% of non-dead features in our SAE are interpretable (24% of the SAE features are dead). See this feature interface to browse the first 50 features. Deep dives: To verify our SAEs have learned something real, we zoom in on individual features for much more detailed investigations: the "'board' is next by induction" feature, the local context feature of "in questions starting with 'Which'", and the more global context feature of "in texts about pets". We go beyond the techniques from the Anthropic paper, and investigate the circuits used to compute the features from earlier components, including analysing composition with an MLP0 SAE. We also investigate how the features are used downstream, and whether it's via MLP1 or the direct connection to the logits. Automation: We automatically detect and quantify a large "{token} is next by induction" feature family. This represents ~5% of the living features in the SAE. Though the specific automation technique won't generalize to other feature families, this is notable, as if there are many "one feature per vocab token" families like this, we may need impractically wide SAEs for larger models. Introduction In Anthropic's SAE paper, they find that training sparse autoencoders (SAEs) on a one layer model's MLP activations finds interpretable features, providing a path to breakdown these high dimensional activations into units that we can understand. In this post, we demonstrate that the same technique works on attention layer outputs and learns sparse, interpretable features! To see how interpretable our SAE is we perform shallow investigations of the first 50 features of our SAE (i.e. randomly chosen features). We found that 76% are not dead (i.e. activate on at least some inputs), and within the alive features we think 82% are interpretable. To get a feel for the features we find see our interactive visualizations of the first 50. Here's one example:[1] Shallow investigations are limited and may be misleading or illusory, so we then do some deep dives to more deeply understand multiple individual features including: "'board' is next, by induction" - one of many "{token} is next by induction" features "In questions starting with 'Which'" - a local context feature, which interestingly is computed by multiple heads "In pet context" - one of many high level context features Similar to the Anthropic paper's "Detailed Investigations", we understand when these features activate and how they affect downstream computation. However, we also go beyond Anthropic's techniques, and look into the upstream circuits by which these features are computed from earlier components. An attention layer (with frozen att...

The Nonlinear Library
AF - Mech Interp Challenge: January - Deciphering the Caesar Cipher Model by CallumMcDougall

The Nonlinear Library

Play Episode Listen Later Jan 1, 2024 4:35


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Challenge: January - Deciphering the Caesar Cipher Model, published by CallumMcDougall on January 1, 2024 on The AI Alignment Forum. I'm writing this post to discuss solutions to the November challenge, and present the challenge for this January. If you've not read the first post in this sequence, I'd recommend starting there - it outlines the purpose behind these challenges, and recommended prerequisite material. January Problem The problem for this month is interpreting a model which has been trained to classify a sequence according to the Caeser cipher shift value which was used to encode it. The sequences have been generated by taking English sentences containing only lowercase letters & punctuation, and choosing a random value X between 0 and 25 to rotate the letters (e.g. if the value was 3, then a becomes d, b becomes e, and so on, finishing with z becoming c). The model was trained using cross entropy loss to predict the shift value X for the text it's been fed, at every sequence position (so for a single sequence, the correct value will be the same at every sequence position, but since the model has bidirectional attention, it will find it easier to predict the value of X at later sequence positions). There are 3 different modes to the problem, to give you some more options! Each mode corresponds to a different dataset, but the same task & same model architecture. Easy mode In easy mode, the data was generated by: Choosing the 100 most frequent 3-letter words in the English Language (as approximated from a text file containing the book "Hitchhiker's Guide To The Galaxy") Choosing words from this len-100 list, with probabilities proportional to their frequency in the book Separating these words with spaces The model uses single-character tokenization. The vocabulary size is 27: each lowercase letter, plus whitespace. Medium mode This is identical to easy, the only difference is that the words are drawn from this len-100 list uniformly, rather than according to their true frequencies. Hard mode In hard mode, the data was generated from random slices of OpenWebText (i.e. natural language text from the internet). It was processed by converting all uppercase characters to lowercase, then removing all characters except for the 26 lowercase letters plus the ten characters "n .,:;?!'" (i.e. newline, space, and 8 common punctuation characters). In all 3 modes, the model's architecture is the same, and it was trained the same way. The model is attention only. It has 2 attention layers, with 2 heads per layer. It was trained with weight decay, and an Adam optimizer with linearly decaying learning rate. I don't expect this problem to be as difficult as some of the others in this sequence, however the presence of MLPs does provide a different kind of challenge. You can find more details on the Streamlit page, or this Colab notebook. Feel free to reach out if you have any questions! November Problem - Solutions The single attention head implements uniform attention to all previous tokens in the sequence. The OV matrix is essentially one-dimensional: it projects each token with value s onto su, where u is some vector in the residual stream learned by the model. The component of the residual stream in this direction then represents the cumulative mean (note, the cumulative mean rather than the cumulative sum, because attention is finite - for example, we expect the component to be the same after the sequences (1, 1, 2) and (1, 1, 2, 1, 1, 2) because net attention to each different token value will be the same). The model's "positive cumsum prediction direction" aligns closely with u, and vice-versa for the "negative cumsum prediction direction" - this allows the model to already get >50% accuracy before the MLP even comes into play. But without the MLP, the mod...

On Investing
Where's the Economy Headed: Soft Landing or Recession?

On Investing

Play Episode Listen Later Dec 22, 2023 38:05


In this episode, Kathy Jones and Liz Ann Sonders discuss their outlooks for 2024 in light of recent events and share what investors should be watching for in the next few weeks. Liz Ann also interviews Keith McCullough, founder and CEO of Hedgeye Risk Management. Keith explains the evolution of hedge funds and how he views the changing investment landscape. He emphasizes the importance of transparency and accountability in investing and explains his unique approach to financial markets, which he calls "quantimental." McCullough also shares his insights on market cycles and the current economic outlook, including the possibility of a recession. He discusses the Federal Reserve's response to recessions and the risks associated with market concentration. Finally, he highlights opportunities in the market and shares his favorite investment areas.On Investing is an original podcast from Charles Schwab. For more on the show, visit Schwab.com/OnInvesting.If you enjoy the show, please leave a rating or review on Apple Podcasts. Important DisclosuresThe information provided here is for general informational purposes only and should not be considered an individualized recommendation or personalized investment advice. The investment strategies mentioned here may not be suitable for everyone. Each investor needs to review an investment strategy for his or her own particular situation before making any investment decision. All expressions of opinion are subject to change without notice in reaction to shifting market conditions. Data contained herein from third-party providers is obtained from what are considered reliable sources. However, its accuracy, completeness, or reliability cannot be guaranteed. Examples provided are for illustrative purposes only and not intended to be reflective of results you can expect to achieve.Past performance is no guarantee of future results and the opinions presented cannot be viewed as an indicator of future performance.All corporate names and market data shown above are for illustrative purposes only and are not a recommendation, offer to sell, or a solicitation of an offer to buy any security. Supporting documentation for any claims or statistical information is available upon request. The comments, views, and opinions expressed in the presentation are those of the speakers and do not necessarily represent the views of Charles Schwab.Investing involves risk, including loss of principal.The policy analysis provided by the Charles Schwab & Co., Inc., does not constitute and should not be interpreted as an endorsement of any political party.International investments involve additional risks, which include differences in financial accounting standards, currency fluctuations, geopolitical risk, foreign taxes and regulations, and the potential for illiquid markets. Investing in emerging markets may accentuate these risks.Fixed income securities are subject to increased loss of principal during periods of rising interest rates. Fixed income investments are subject to various other risks including changes in credit quality, market valuations, liquidity, prepayments, early redemption, corporate events, tax ramifications, and other factors. Lower rated securities are subject to greater credit risk, default risk, and liquidity risk.Schwab does not recommend the use of technical analysis as a sole means of investment research.Small cap investments are subject to greater volatility than those in other asset categories. Investments in securities of MLPs involve risks that differ from an investment in common stock. MLPs are controlled by their general partners, which generally have conflicts of interest and limited fiduciary duties to the MLP, which may permit the general partner to favor its own interests over the MLPs.Digital currencies [such as bitcoin] are highly volatile and not backed by any central bank or government. Digital currencies lack many of the regulations and consumer protections that legal-tender currencies and regulated securities have. Due to the high level of risk, investors should view digital currencies as a purely speculative instrument.Diversification strategies do not ensure a profit and do not protect against losses in declining markets.Commodity-related products, including futures, carry a high level of risk and are not suitable for all investors. Commodity-related products may be extremely volatile, illiquid and can be significantly affected by underlying commodity prices, world events, import controls, worldwide competition, government regulations, and economic conditions, regardless of the length of time shares are held. Investments in commodity-related products may subject the fund to significantly greater volatility than investments in traditional securities and involve substantial risks, including risk of loss of a significant portion of their principal value. Commodity-related products are also subject to unique tax implications such as additional tax forms and potentially higher tax rates on certain ETFs.The information and content provided herein is general in nature and is for informational purposes only. It is not intended, and should not be construed, as a specific recommendation, individualized tax, legal, or investment advice. Tax laws are subject to change, either prospectively or retroactively. Where specific advice is necessary or appropriate, individuals should contact their own professional tax and investment advisors or other professionals (CPA, Financial Planner, Investment Manager) to help answer questions about specific situations or needs prior to taking any action based upon this information.Forecasts contained herein are for illustrative purposes only, may be based upon proprietary research and are developed through analysis of historical public data.(1223-30CE)

The College Investor Audio Show
The Basic Tax Guide For MLPs

The College Investor Audio Show

Play Episode Listen Later Nov 10, 2023 10:21


MLPs (or master limited partnerships) can be lucrative investments, but they can complicate your taxes and tax return. The post The Basic Tax Guide For MLPs appeared first on The College Investor.

The College Investor Audio Show
The Basic Tax Guide For MLPs

The College Investor Audio Show

Play Episode Listen Later Nov 10, 2023 10:21


MLPs (or master limited partnerships) can be lucrative investments, but they can complicate your taxes and tax return.

The Nonlinear Library
AF - Mech Interp Challenge: November - Deciphering the Cumulative Sum Model by TheMcDouglas

The Nonlinear Library

Play Episode Listen Later Nov 2, 2023 3:51


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mech Interp Challenge: November - Deciphering the Cumulative Sum Model, published by TheMcDouglas on November 2, 2023 on The AI Alignment Forum. I'm writing this post to discuss solutions to the October challenge, and present the challenge for this November. If you've not read the first post in this sequence , I'd recommend starting there - it outlines the purpose behind these challenges, and recommended prerequisite material. November Problem The problem for this month is interpreting a model which has been trained to classify the cumulative sum of a sequence. The model is fed sequences of integers, and is trained to classify the cumulative sum at a given sequence position. There are 3 possible classifications: 0 (if the cumsum is negative), 1 (if the cumsum is zero), 2 (if the cumsum is positive). For example, if the sequence is: Then the classifications would be: The model is not attention only . It has one attention layer with a single head, and one MLP layer. It does not have layernorm at the end of the model. It was trained with weight decay, and an Adam optimizer with linearly decaying learning rate. I don't expect this problem to be as difficult as some of the others in this sequence, however the presence of MLPs does provide a different kind of challenge. You can find more details on the Streamlit page . Feel free to reach out if you have any questions! October Problem - Solutions In the second half of the sequence, the attention heads perform the algorithm "attend back to (and copy) the first token which is larger than me". For example, in a sequence like: we would have the second 3 token attending back to the first 5 token (because it's the first one that's larger than itself), the second 5 attending back to 7, etc. The SEP token just attends to the smallest token. Some more refinements to this basic idea: The two attending heads split responsibilities across the vocabulary. Head 0.0 is the less important head; it deals with values in the range 28-37 (roughly). Head 0.1 deals with most other values. In subsequences x < y < z where the three numbers are close together, x will often attend to z rather than to y . So why isn't this an adversarial example, i.e. why does the model still correctly predict y follows x ? Answer - the OV circuit shows that when we attend to source token s , we also boost things slightly less thn s , and suppress things slightly more than s . So in the case of x < y < z , we have: Attention to y will boost y a lot, and suppress z a bit. Attention to z will boost z a lot, and boost y a bit. So even if z gets slightly more attention than y , it might still be the case that y gets predicted with higher probability. Sequences with large jumps are adversarial examples (because they're rare in the training data, which was randomly generated from choosing subsets without replacement). Best Submissions We received more submissions for this month's problem than any other in the history of the series, so thanks to everyone who attempted! The best solution to this problem was by Vlad K , who correctly identified the model's tendency to produce unexpected attention patterns when 3 numbers are close together, and figured out how the model manages to produce correct classifications anyway. Best of luck for this and future challenges! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - Sparse Autoencoders Find Highly Interpretable Directions in Language Models by Logan Riggs Smith

The Nonlinear Library

Play Episode Listen Later Sep 21, 2023 8:04


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Sparse Autoencoders Find Highly Interpretable Directions in Language Models, published by Logan Riggs Smith on September 21, 2023 on The AI Alignment Forum. This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability. Paper Overview Sparse Autoencoders & Superposition To reverse engineer a neural network, we'd like to first break it down into smaller units (features) that can be analysed in isolation. Using individual neurons as these units can be useful but neurons are often polysemantic, activating for several unrelated types of feature so just looking at neurons is insufficient. Also, for some types of network activations, like the residual stream of a transformer, there is little reason to expect features to align with the neuron basis so we don't even have a good place to start. Toy Models of Superposition investigates why polysemanticity might arise and hypothesise that it may result from models learning more distinct features than there are dimensions in the layer, taking advantage of the fact that features are sparse, each one only being active a small proportion of the time. This suggests that we may be able to recover the network's features by finding a set of directions in activation space such that each activation vector can be reconstructed from a sparse linear combinations of these directions. We attempt to reconstruct these hypothesised network features by training linear autoencoders on model activation vectors. We use a sparsity penalty on the embedding, and tied weights between the encoder and decoder, training the models on 10M to 50M activation vectors each. For more detail on the methods used, see the paper. Automatic Interpretation We use the same automatic interpretation technique that OpenAI used to interpret the neurons in GPT2 to analyse our features, as well as alternative methods of decomposition. This was demonstrated in a previous post but we now extend these results across the all 6 layers in Pythia-70M, showing a clear improvement over all baselines in all but the final layers. Case studies later in the paper suggest that the features are still meaningful in these later layers but that automatic interpretation struggles to perform well. IOI Feature Identification We are able to use less-than-rank one ablations to precisely edit activations to restore uncorrupted behaviour on the IOI task. With normal activation patching, patches occur at a module-wide level, while here we perform interventions of the form x'=¯x+∑i∈F(ci-¯ci)fi where ¯x is the embedding of the corrupted datapoint, F is the set of patched features, and ci and ¯ci are the activations of feature fi on the clean and corrupted datapoint respectively. We show that our features are able to better able to precisely reconstruct the data than other activation decomposition methods (like PCA), and moreover that the finegrainedness of our edits increases with dictionary sparsity. Unfortunately, as our autoencoders are not able to perfectly reconstruct the data, they have a positive minumum KL-divergence from the base model, while PCA does not. Dictionary Features are Highly Monosemantic & Causal (Left) Histogram of activations for a specific dictionary feature. The majority of activations are for apostrophe (in blue), where the y-axis the is number of datapoints that activate in that bin. (Right) Histogram of the drop in logits (ie how much the LLM predicts a spe...

SL Advisors Talks Energy
Pipelines Returning More Cash

SL Advisors Talks Energy

Play Episode Listen Later Sep 13, 2023 4:59


The old MLP model rarely saw stock buybacks. Traditionally the General Partner (GP) would sell assets to the MLP it controlled in a non-arms-length transaction. The MLP would issue equity and debt to pay for them. MLPs were sellers of their own units, not buyers. Today's midstream energy infrastructure sector has left that model behind. […]

SL Advisors Talks Energy
Fewer MLPs And American Exceptionalism

SL Advisors Talks Energy

Play Episode Listen Later Aug 23, 2023 5:39


The diminishing number of MLPs has started to draw attention from sell-side analysts. Morgan Stanley's Robert Kad wrote in his Midstream Weekly that consolidation was likely to, “impact active manager mandates that have been dedicated to the sector.” The shrinking pool of MLPs and its impact on MLP-dedicated funds has been a developing problem for […]

SL Advisors Talks Energy
An Uncontroversial MLP Merger

SL Advisors Talks Energy

Play Episode Listen Later Aug 20, 2023 5:20


Energy Transfer's (ET) acquisition of Crestwood (CEQP) highlights the shortcomings of the proposed merger of Oneok (OKE) with Magellan Midstream (MMP). Because ET and CEQP are both MLPs, combining the two entities doesn't constitute a taxable event for unitholders. This contrasts with OKE/MMP where MMP unitholders will face the recapture of deferred income tax on […]

The Nonlinear Library
LW - Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Jul 20, 2023 3:42


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla, published by Neel Nanda on July 20, 2023 on LessWrong. Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik Informal TLDR We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla (70B) for converting the knowledge of a multiple choice question's answer into outputting the correct letter. These techniques basically continued to work, and nothing fundamentally broke at scale (though it was a massive infra pain!). We then tried to dig further into the semantics of the circuit - going beyond "these specific heads and layers matter and most don't" and trying to understand the learned algorithm, and which features were implemented This kind of tracked the feature "this is the nth item in the list" but was pretty messy. However, my personal guess is that this stuff is just pretty messy at all scales, and we can productively study how clean/messy this stuff is at smaller and more tractable scales. I now feel mildly more optimistic that focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops. It also seems super nice to get better at automatically finding these circuits, since this was a many month manual slog! See Tom's and my Twitter summaries for more. Note that I (Neel) am cross-posting this on behalf of the team, and neither a main research contributor nor main advisor for the project. Key Figures An overview of the weird kinds of heads found, like the "attend to B if it is correct" head! The losses under different mutations of the letters - experiments to track down exactly which features were used. Eg replacing the labels with random letters or numbers preserves the "nth item in the list" feature while shuffling ABCD lets us track the "line labelled B" feature The queries and keys of a crucial correct letter head - it's so linearly separable! We can near loss-lessly compress it to just 3 dimensions and interpret just those three dimensions. See an interactive 3D plot here Abstract Circuit analysis is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer label given knowledge of the correct answer text. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of output nodes (attention heads and MLPs). We further study the correct letter category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an Nth item in an enumeration feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of correct letter heads on multiple choice question answering. ...

The Nonlinear Library
AF - Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Jul 20, 2023 3:42


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla, published by Neel Nanda on July 20, 2023 on The AI Alignment Forum. Cross-posting a paper from the Google DeepMind mech interp team, by: Tom Lieberum, Matthew Rahtz, János Kramár, Neel Nanda, Geoffrey Irving, Rohin Shah, Vladimir Mikulik Informal TLDR We tried standard mech interp techniques (direct logit attribution, activation patching, and staring at attention patterns) on an algorithmic circuit in Chinchilla (70B) for converting the knowledge of a multiple choice question's answer into outputting the correct letter. These techniques basically continued to work, and nothing fundamentally broke at scale (though it was a massive infra pain!). We then tried to dig further into the semantics of the circuit - going beyond "these specific heads and layers matter and most don't" and trying to understand the learned algorithm, and which features were implemented This kind of tracked the feature "this is the nth item in the list" but was pretty messy. However, my personal guess is that this stuff is just pretty messy at all scales, and we can productively study how clean/messy this stuff is at smaller and more tractable scales. I now feel mildly more optimistic that focusing on mech interp work on small models is just fine, and extremely worth it for the much faster feedback loops. It also seems super nice to get better at automatically finding these circuits, since this was a many month manual slog! See Tom's and my Twitter summaries for more. Note that I (Neel) am cross-posting this on behalf of the team, and neither a main research contributor nor main advisor for the project. Key Figures An overview of the weird kinds of heads found, like the "attend to B if it is correct" head! The losses under different mutations of the letters - experiments to track down exactly which features were used. Eg replacing the labels with random letters or numbers preserves the "nth item in the list" feature while shuffling ABCD lets us track the "line labelled B" feature The queries and keys of a crucial correct letter head - it's so linearly separable! We can near loss-lessly compress it to just 3 dimensions and interpret just those three dimensions. See an interactive 3D plot here Abstract Circuit analysis is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit analysis in the 70B Chinchilla model, aiming to test the scalability of circuit analysis. In particular, we study multiple-choice question answering, and investigate Chinchilla's capability to identify the correct answer label given knowledge of the correct answer text. We find that the existing techniques of logit attribution, attention pattern visualization, and activation patching naturally scale to Chinchilla, allowing us to identify and categorize a small set of 'output nodes' (attention heads and MLPs). We further study the correct letter' category of attention heads aiming to understand the semantics of their features, with mixed results. For normal multiple-choice question answers, we significantly compress the query, key and value subspaces of the head without loss of performance when operating on the answer labels for multiple-choice questions, and we show that the query and key subspaces represent an Nth item in an enumeration' feature to at least some extent. However, when we attempt to use this explanation to understand the heads' behaviour on a more general distribution including randomized answer labels, we find that it is only a partial explanation, suggesting there is more to learn about the operation of 'correct letter' heads on multiple choice q...

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0
MPT-7B and The Beginning of Context=Infinity — with Jonathan Frankle and Abhinav Venigalla of MosaicML

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Play Episode Listen Later May 20, 2023 66:43


We are excited to be the first podcast in the world to release an in-depth interview on the new SOTA in commercially licensed open source models - MosiacML MPT-7B!The Latent Space crew will be at the NYC Lux AI Summit next week, and have two meetups in June. As usual, all events are on the Community page! We are also inviting beta testers for the upcoming AI for Engineers course. See you soon!One of GPT3's biggest limitations is context length - you can only send it up to 4000 tokens (3k words, 6 pages) before it throws a hard error, requiring you to bring in LangChain and other retrieval techniques to process long documents and prompts. But MosaicML recently open sourced MPT-7B, the newest addition to their Foundation Series, with context length going up to 84,000 tokens (63k words, 126 pages):This transformer model, trained from scratch on 1 trillion tokens of text and code (compared to 300B for Pythia and OpenLLaMA, and 800B for StableLM), matches the quality of LLaMA-7B. It was trained on the MosaicML platform in 9.5 days on 440 GPUs with no human intervention, costing approximately $200,000. Unlike many open models, MPT-7B is licensed for commercial use and it's optimized for fast training and inference through FlashAttention and FasterTransformer.They also released 3 finetuned models starting from the base MPT-7B: * MPT-7B-Instruct: finetuned on dolly_hhrlhf, a dataset built on top of dolly-5k (see our Dolly episode for more details). * MPT-7B-Chat: finetuned on the ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets.* MPT-7B-StoryWriter-65k+: it was finetuned with a context length of 65k tokens on a filtered fiction subset of the books3 dataset. While 65k is the advertised size, the team has gotten up to 84k tokens in response when running on a single node A100-80GB GPUs. ALiBi is the dark magic that makes this possible. Turns out The Great Gatsby is only about 68k tokens, so the team used the model to create new epilogues for it!On top of the model checkpoints, the team also open-sourced the entire codebase for pretraining, finetuning, and evaluating MPT via their new MosaicML LLM Foundry. The table we showed above was created using LLM Foundry in-context-learning eval framework itself!In this episode, we chatted with the leads of MPT-7B at Mosaic: Jonathan Frankle, Chief Scientist, and Abhinav Venigalla, Research Scientist who spearheaded the MPT-7B training run. We talked about some of the innovations they've brought into the training process to remove the need for 2am on-call PagerDutys, why the LLM dataset mix is such an important yet dark art, and why some of the traditional multiple-choice benchmarks might not be very helpful for the type of technology we are building.Show Notes* Introducing MPT-7B* Cerebras* Lottery Ticket Hypothesis* Hazy Research* ALiBi* Flash Attention* FasterTransformer* List of naughty words for C4 https://twitter.com/code_star/status/1661386844250963972* What is Sparsity?* Hungry Hungry Hippos* BF16 FPp.s. yes, MPT-7B really is codenamed LLongboi!Timestamps* Introductions [00:00:00]* Intro to Mosaic [00:03:20]* Training and Creating the Models [00:05:45]* Data Choices and the Importance of Repetition [00:08:45]* The Central Question: What Mix of Data Sets Should You Use? [00:10:00]* Evaluation Challenges of LLMs [0:13:00]* Flash Attention [00:16:00]* Fine-tuning for Creativity [00:19:50]* Open Source Licenses and Ethical Considerations [00:23:00]* Training Stability Enhancement [00:25:15]* Data Readiness & Training Preparation [00:30:00]* Dynamic Real-time Model Evaluation [00:34:00]* Open Science for Affordable AI Research [00:36:00]* The Open Approach [00:40:15]* The Future of Mosaic [00:44:11]* Speed and Efficiency [00:48:01]* Trends and Transformers [00:54:00]* Lightning Round and Closing [1:00:55]TranscriptAlessio: [00:00:00] Hey everyone. Welcome to the Latent Space podcast. This is Alessio partner and CTO-in-Residence at Decibel Partners. I'm joined by my co-host, Swyx, writer and editor of Latent Space.Swyx: Hey, and today we have Jonathan and Abhi from Mosaic ML. Welcome to our studio.Jonathan: Guys thank you so much for having us. Thanks so much.Swyx: How's it feel?Jonathan: Honestly, I've been doing a lot of podcasts during the pandemic, and it has not been the same.Swyx: No, not the same actually. So you have on your bio that you're primarily based in Boston,Jonathan: New York. New York, yeah. My Twitter bio was a probability distribution over locations.Swyx: Exactly, exactly. So I DMd you because I was obviously very interested in MPT-7B and DMd you, I was like, for the 0.2% of the time that you're in San Francisco, can you come please come to a podcast studio and you're like, I'm there next week.Jonathan: Yeah, it worked out perfectly. Swyx: We're really lucky to have you, I'll read off a few intros that people should know about you and then you can fill in the blanks.So Jonathan, you did your BS and MS at Princeton in programming languages and then found your way into ML for your PhD at MiT where you made a real splash with the lottery ticket hypothesis in 2018, which people can check up on. I think you've done a few podcasts about it over the years, which has been highly influential, and we'll talk about sparse models at Mosaic. You have also had some side [00:01:30] quest. You taught programming for lawyers and you did some law and privacy stuff in, in DC and also did some cryptography stuff. Um, and you've been an assistant professor at Harvard before earning your PhD.Jonathan:  I've yet to start.Swyx: You, you yet to start. Okay. But you just got your PhD.Jonathan:. I technically just got my PhD. I was at Mosaic which delayed my defense by about two years. It was, I was at 99% done for two years. Got the job at Harvard, Mosaic started, and I had better things to do than write my dissertation for two years. Swyx: You know, you know, this is very out of order.Jonathan: Like, oh, completely out of order, completely backwards. Go talk to my advisor about that. He's also an advisor at Mosaic and has been from the beginning. And, you know, go talk to him about finishing on time.Swyx: Great, great, great. And just to fill it out, Abhi, you did your BS and MS and MIT, you were a researcher at Cerebras, and you're now a research scientist at Mosaic. Just before we go into Mosaic stuff, I'm actually very curious about Cereus and, uh, just that, that space in general. Um, what are they doing that people should know about?Abhinav: Yeah, absolutely. Um, I think the biggest thing about CEREUS is that they're really building, you know, kind of the NextGen computing platform beyond, like GPUs.Um, they're trying to build a system that uses an entire wafer, you know, rather than cutting up a wafer into smaller chips and trying to train a model on that entire system, or actually more recently on many such wafers. Um, so it's, and it's really extraordinary. I think it's like the first time ever that kind of wafer scale computing has ever really worked. And so it's a really exciting time to be there, trying to figure out how we can map ML workloads to work, um, on a much, much bigger chip.Swyx: And do you use like [00:03:00] a different programming language or framework to do that? Or is that like..Abhinav: Yeah, so I mean, things have changed a bit since I was there.I think, um, you can actually run just normal tensor flow and pie torch on there. Um, so they've built a kind of software stack that compiles it down. So it actually just kind of works naturally. But yeah.Jonathan : Compiled versions of Python is a hot topic at the moment with Mojo as well. Swyx: And then Mosaic, you, you spearheaded the MPT-7B effort.INTRO TO MOSAIC [00:03:20]Abhinav: Uh, yeah. Yeah, so it's kind of like, it's been maybe six months, 12 months in the making. We kind of started working on LMs sort of back in the summer of last year. Um, and then we came with this blog post where we kind of profiled a lot of LMs and saw, hey, the cost of training is actually a lot lower than what people might think.Um, and then since then, you know, being inspired by kind of, you know, meta's release, so the LLaMA models and lots of other open source work, we kind of started working towards, well, what if we were to release a really good kind of 7 billion parameter model? And that's what MPT is. Alessio:You know, we mentioned some of the podcasts you had done, Jonathan, I think in one of them you mentioned Mosaic was not planning on building a  model and releasing and obviously you eventually did. So what are some of the things that got you there that maybe obviously LLaMA you mentioned was an inspiration. You now have both the training and like inference products that you offer. Was this more of a research challenge in a way, uh, that you wanted to do?Or how did the idea come to be?Jonathan: I think there were a couple of things. So we still don't have a first class model. We're not an open AI where, you know, our businesses come to use our one great model. Our business is built around customers creating their own models. But at the end of the day, if customers are gonna create their own models, we have to have the tools to help them do that, and to have the tools to help them do that and know that they work we have to create our own models to start. We have to know that we can do something great if customers are gonna do something great. And one too many people may have challenged me on Twitter about the fact that, you know, mosaic claims all these amazing numbers, but, you know, I believe not to, you know, call out Ross Whiteman here, but, you know, I believe he said at some point, you know, show us the pudding.Um, and so Ross, you know, please let me know how the pudding tastes. But in all seriousness, like I think there is something, this is a demo in some sense. This is to say we did this in 9.5 days for a really reasonable cost, straight through 200, an intervention. 200 K. Yep. Um, you can do this too.Swyx: Uh, and just to reference the numbers that you're putting out, this is the, the last year you were making a lot of noise for trading GPT 3 under 450 K, which is your, your initial estimate.Um, and then it went down to a 100 K and stable diffusion 160 k going down to less than 50 K as well.Jonathan: So I will be careful about that 100 K number. That's certainly the challenge I've given Abhi to hit. Oh, I wouldn't make the promise that we've hit yet, but you know, it's certainly a target that we have.And I, you know, Abhi may kill me for saying this. I don't think it's crazy. TRAINING AND CREATING THE MODELS [00:05:45] Swyx: So we definitely want to get into like estimation math, right? Like what, what needs to happen for those big order magnitude changes to in, in infrastructure costs. But, uh, let's kind of stick to the MPT-7B story. Yeah. Tell us everything.Like you have, uh, three different models. One of them. State of the art essentially on context length. Let's talk about the process of training them, the, uh, the decisions that you made. Um, I can go into, you know, individual details, but I just wanna let you let you rip.Abhinav: Yeah, so I mean, I think, uh, we started off with the base model, which is kind of for all practical purposes, a recreation of LLaMA 7B.Um, so it's a 7 billion perimeter model trained on the trillion tokens. Um, and our goal was like, you know, we should do it efficiently. We should be able to do it like, kind of hands free so we don't have to babysit the runs as they're doing them. And it could be kind of a, a launching point for these fine tune models and those fine tune models, you know, on, on the one hand they're kind of really fun for the community, like the story writer model, which has like a 65,000 length context window and you can even kind of extrapolate beyond that. Um, but they're, they're also kind of just tr inspirations really. So you could kind of start with an MPT-7B base and then build your own custom, you know, downstream. If you want a long context code model, you could do that with our platform. If you wanted one that was for a particular language, you could do that too.But yeah, so we picked kind of the three variance chat and instruct and story writer just kind of like inspirations looking at what people were doing in the community today. Yeah. Alessio: And what's the beginning of the math to come up with? You know, how many tokens you wanna turn it on? How many parameters do you want in a bottle? 7 billion and 30 billion seem to be kind of like two of the magic numbers going around right now. Abhinav: Yeah, definitely. Definitely. Yeah, I think like there's sort of these scaling laws which kind of tell you how to best spend your training compute if that's all you cared about. So if you wanna spend $200,000 exactly in the most efficient way, there'd be a recipe for doing that.Um, and that we usually go by the Chinchilla laws. Now for these models, we actually didn't quite do that because we wanted to make sure that people could actually run these at home and that they [00:07:30] were good for inference. So we trained them kind of beyond those chinchilla points so that we're almost over-training them.I think there's like a joke going on online that they're like long boy and that that came up internally because we were training them for really, really long durations. So that 7B model, the chinchilla point might be 140 billion tokens. Instead, we trained a trillion, so almost seven times longer than you normally would.Swyx: So longboi was the code name. So is it, is it the trading method? Is it the scaling law that you're trying to coin or is it the code name for the 64 billion?Jonathan: Uh, 64. It was just an internal joke for the, for training on way more tokens than you would via chinchilla. Okay. Um, we can coin it long boy and it, it really stuck, but just to, you know, long boys filled with two ELs at the beginning.Yeah. Cause you know, we wanted the lLLaMA thing in there as well. Jonathan: Yeah, yeah, yeah. Our darn CEO we have to rein him in that guy, you know, you can't, yeah. I'm gonna take away his Twitter password at some point. Um, but you know, he had to let that one out publicly. And then I believe there was a YouTube video where someone happened to see it mentioned before the model came out and called it the Long G boy or something like that.Like, so you know, now it's out there in the world. It's out there. It's like Sydnee can't put it back inSwyx: There's a beautiful picture which I think Naveen tweeted out, which, um, shows a long boy on a whiteboard.Jonathan: That was the origin of Long Boy. In fact, the legs of the lLLaMA were the two Ls and the long boy.DATA CHOICES AND THE IMPORTANCE OF REPETITION [00:08:45]Swyx: Well, talk to me about your data choices, right? Like this is your passion project. Like what can you tell us about it?Jonathan: Yeah, I think Abhi wanted to kill me by the end for trying to use all the GPUs on data and none of them on actually training the model. Um, at the end of the day, We know that you need to train these models and [00:09:00] lots of data, but there are a bunch of things we don't know.Number one is what kinds of different data sources matter. The other is how much does repetition really matter? And really kind of repetition can be broken down into how much does quality versus quantity matter. Suppose I had the world's best 10 billion tokens of data. Would it be better to train on that a hundred times or better to train on a trillion tokens of low quality, fresh data?And obviously there's, there's a middle point in between. That's probably the sweet spot. But how do you even know what good quality data is? And. So, yeah, this is, nobody knows, and I think the more time I spent, we have a whole data team, so me and several other people, the more time that we spent on this, you know, I came away thinking, gosh, we know nothing.Gosh, if I were back in academia right now, I would definitely go and, you know, write a paper about this because I have no idea what's going on.Swyx: You would write a paper about it. I'm interested in such a paper. I haven't come across any that exists. Could you frame the central question of such a paper?THE CENTRAL QUESTION: WHAT MIX OF DATA SETS SHOULD YOU USE? [00:10:00]Jonathan: Yeah. The central question is what mix of data sets should you use? Okay. Actually I've, you know, you had mentioned my law school stuff. I went back to Georgetown Law where I used to teach, um, in the midst of creating this model, and I actually sat down with a class of law students and asked them, I gave them our exact data sets, our data mixes, um, like how many tokens we had, and I said, Create the best data set for your model.Knowing they knew nothing about large language models, they just know that data goes in and it's going to affect the behavior. Um, and I was like, create a mix and they basically covered all the different trade-offs. Um, you probably want a lot of English language [00:10:30] text to start with. You get that from the web, but do you want it to be multilingual?If so, you're gonna have a lot less English text. Maybe it'll be worse. Do you wanna have code in there? There are all these beliefs that code leads to models being better at logical reasoning, of which I've seen zero evidence. Rep. It's not, um, I mean, really made a great code model, but code models leading to better chain of thought reasoning on the part of language or code being in the training set leading to better chain of thought reasoning.People claim this all the time, but I've still never seen any real evidence beyond that. You know, one of the generations of the GPT three model started supposedly from Code Da Vinci. Yes. And so there's a belief that, you know, maybe that helped. But again, no evidence. You know, there's a belief that spending a lot of time on good sources like Wikipedia is good for the model.Again, no evidence. At the end of the day, we tried a bunch of different data mixes and the answer was that there are some that are better or worse than others. We did find that the pile, for example, was a really solid data mix, but you know, there were stronger data mixes by our evaluation metrics. And I'll get back to the evaluation question in a minute cuz that's a really important one.This data set called c4, which is what the original T five model was trained on, is weirdly good. And everybody, when I posted on this on Twitter, like Stella Beaterman from Luther mentioned this, I think someone else mentioned this as well. C4 does really well in the metrics and we have no idea why we de-duplicated it against our evaluation set.So it's not like it memorized the data, it is just one web scrape from 2019. If you actually look at the T five paper and see how it was pre-processed, it looks very silly. Mm-hmm. They removed anything that had the word JavaScript in it because they didn't want to get like no JavaScript [00:12:00] warnings. They removed anything with curly braces cuz they didn't wanna get JavaScript in it.They looked at this list of bad words, um, and removed anything that had those bad words. If you actually look at the list of bad words, words like gay are on that list. And so there's, you know, it is a very problematic, you know, list of words, but that was the cleaning that leads to a data set that seems to be unbeatable.So that to me says that we know nothing about data. We, in fact used a data set called mc four as well, which is they supposedly did the same pre-processing of C4 just on more web calls. The English portion is much worse than C4 for reasons that completely escape us. So in the midst of all that, Basically I set two criteria.One was I wanted to be at least as good as mc four English, like make sure that we're not making things actively worse. And mc four English is a nice step up over other stuff that's out there. And two was to go all in on diversity after that, making sure that we had some code, we had some scientific papers, we had Wikipedia, because people are gonna use this model for all sorts of different purposes.But I think the most important thing, and I'm guessing abhi had a million opinions on this, is you're only as good as your evaluation. And we don't know how to evaluate models for the kind of generation we ask them to do. So past a certain point, you have to kinda shrug and say, well, my evaluation's not even measuring what I care about.Mm-hmm. So let me just make reasonable choices. EVALUATION CHALLENGES OF LLMs [0:13:00]Swyx: So you're saying MMLU, big bench, that kind of stuff is not. Convincing for youJonathan: A lot of this stuff is you've got two kinds of tasks. Some of these are more of multiple choice style tasks where there is a right answer. Um, either you ask the model to spit out A, B, C, or D or you know, and if you're more [00:13:30] sophisticated, you look at the perplexity of each possible answer and pick the one that the model is most likely to generate.But we don't ask these models to do multiple choice questions. We ask them to do open-ended generation. There are also open-ended generation tasks like summarization. You compare using things like a blue score or a rouge score, which are known to be very bad ways of comparing text. At the end of the day, there are a lot of great summaries of a paper.There are a lot of great ways to do open form generation, and so humans are, to some extent, the gold standard. Humans are very expensive. It turns out we can't put them into our eval pipeline and just have the humans look at our model every, you know, 10 minutes? Not yet. Not yet. Maybe soon. Um, are you volunteering Abhi?Abhinav: I, I, I just know we have a great eval team who's, uh, who's helping us build new metrics. So if they're listening,Jonathan:  But it's, you know, evaluation of large language models is incredibly hard and I don't think any of these metrics really truly capture. What we expect from the models in practice.Swyx: Yeah. And we might draw wrong conclusions.There's been a debate recently about the emergence phenomenon, whether or not it's a mirage, right? I don't know if you guys have opinions about that process. Abhinav: Yeah, I think I've seen like this paper and all and all, even just kind of plots from different people where like, well maybe it's just a artifact of power, like log scaling or metrics or, you know, we're meshing accuracy, which is this a very like harsh zero one thing.Yeah. Rather than kind of something more continuous. But yeah, similar to what Jonathan was saying about evals. Like there there's one issue of like you just like our diversity of eval metrics, like when we put these models up, even like the chat ones, the instruct ones, people are using 'em for such a variety of tasks.There's just almost no way we get ahead of time, like measuring individual dimensions. And then also particularly like, you know, at the 7B scale, [00:15:00] um, these models still are not super great yet at the really hard tasks, like some of the hardest tasks in MMLU and stuff. So sometimes they're barely scoring like the above kind of random chance, you know, like on really, really hard tasks.So potentially as we. You know, aim for higher and higher quality models. Some of these things will be more useful to us. But we kind of had to develop MPT 7B kind of flying a little bit blind on, on what we knew it was coming out and just going off of like, you know, a small set of common sensor reasoning tasks.And of course, you know, just comparing, you know, those metrics versus other open source models. Alessio: I think fast training in inference was like one of the goals, right? So there's always the trade off between doing the hardest thing and like. Doing all the other things quickly.Abhinav: Yeah, absolutely. Yeah, I mean, I think like, you know, even at the 7B scale, you know, uh, people are trying to run these things on CPUs at home.You know, people are trying to port these to their phones, basically prioritizing the fact that the small scale would lead to our adoption. That was like a big, um, big thing going on. Alessio: Yeah. and you mentioned, um, flash attention and faster transformer as like two of the core things. Can you maybe explain some of the benefits and maybe why other models don't use it?FLASH ATTENTION [00:16:00]Abhinav: Yeah, absolutely. So flash attention is this basically faster implementation of full attention. Um, it's like a mathematical equivalent developed by like actually some of our collaborators, uh, at Stanford. Uh, the hazy research. Hazy research, yeah, exactly.Jonathan: What is, what, what, what's the name hazy research mean?Abhinav: I actually have no idea.Swyx: I have no clue. All these labs have fun names. I always like the stories behind them.Abhinav: Yeah, absolutely. We really, really liked flash attention. We, I think, had to integrate into repo even as [00:16:30] as early as September of last year. And it really just helps, you know, with training speed and also inference speed and we kind of bake that into model architecture.And this is kind of unique amongst all the other hugging face models you see out there. So ours actually, you can toggle between normal torch attention, which will work anywhere and flash attention, which will work on GPUs right out of the box. And that way I think you get almost like a 2x speed up at training time and somewhere between like 50% to a hundred percent speed up at inference time as well.So again, this is just like, we really, really wanted people to use these and like, feel like an improvement and we, we have the team to, to help deliver that. Swyx: Another part, um, of your choices was alibi position, encodings, which people are very interested in, maybe a lot of people just, uh, to sort of take in, in coatings as, as a given.But there's actually a lot of active research and honestly, it's a lot of, um, it's very opaque as well. Like people don't know how to evaluate encodings, including position encodings, but may, may, could you explain, um, alibi and, um, your choice?Abhinav: Yeah, for sure. The alibi and uh, kind of flash attention thing all kind of goes together in interesting ways.And even with training stability too. What alibi does really is that it eliminates the need to have positional embeddings in your model. Where previously, if you're a token position one, you have a particular embedding that you add, and you can't really go beyond your max position, which usually is like about 2000.With alibies, they get rid of that. Instead, just add a bias to the attention map itself. That's kind of like this slope. And if at inference time you wanna go much, much larger, they just kind of stretch that slope out to a longer, longer number of positions. And because the slope is kind of continuous and you can interpret it, it all works out now.Now one of [00:18:00] the, the funny things we found is like with flash attention, it saved so much memory and like improved performance so much that even as early as I kind of last year, like we were profiling models with, with very long context lines up to like, you know, the 65 k that you seen in release, we just never really got around to using it cuz we didn't really know what we might use it for.And also it's very hard to train stably. So we started experimenting with alibi integration, then we suddenly found that, oh wow, stability improves dramatically and now we can actually work together with alibi in a long context lens. That's how we got to like our story writer model where we can stably train these models out to very, very long context lenses and, and use them performantly.Jonathan: Yeah.Swyx: And it's also why you don't have a firm number. Most people now have a firm number on the context line. Now you're just like, eh, 65 to 85Abhinav: Oh yeah, there's, there's a, there's a big age to be 64 K or 65 k. 65 k plus.Swyx: Just do powers of twos. So 64 isn't, you know. Jonathan: Right, right. Yeah. Yeah. But we could, I mean, technically the context length is infinite.If you give me enough memory, um, you know, we can just keep going forever. We had a debate over what number to say is the longest that we could handle. We picked 84 cakes. It's the longest I expect people to see easily in practice. But, you know, we played around for even longer than that and I don't see why we couldn't go longer.Swyx: Yeah. Um, and so for those who haven't read the blog posts, you put the Great Gatsby in there and, uh, asked it to write an epilogue, which seemed pretty impressive.Jonathan: Yeah. There are a bunch of epilogues floating around internally at Mosaic. Yeah. That wasn't my favorite. I think we all have our own favorites.Yeah. But there are a bunch of really, really good ones. There was one where, you know, it's Gatsby's funeral and then Nick starts talking to Gatsby's Ghost, and Gatsby's father shows up and, you know, then he's [00:19:30] at the police station with Tom. It was very plot heavy, like this is what comes next. And a bunch of that were just very Fitzgerald-esque, like, you know, beautiful writing.Um, but it was cool to just see that Wow, the model seemed to actually be working with. You know, all this input. Yeah, yeah. Like it's, it's exciting. You can think of a lot of things you could do with that kind of context length.FINE-TUNING FOR CREATIVITY [00:19:50]Swyx: Is there a trick to fine tuning for a creative task rather than, um, factual task?Jonathan: I don't know what that is, but probably, yeah, I think, you know, the person, um, Alex who did this, he did fine tune the model explicitly on books. The goal was to try to get a model that was really a story writer. But, you know, beyond that, I'm not entirely sure. Actually, it's a great question. Well, no, I'll ask you back.How would you measure that? Swyx: Uh, God, human feedback is the solve to all things. Um, I think there is a labeling question, right? Uh, in computer vision, we had a really, really good episode with Robo Flow on the segment. Anything model where you, you actually start human feedback on like very, I think it's something like 0.5% of the, the overall, uh, final, uh, uh, labels that you had.But then you sort augment them and then you, you fully automate them, um, which I think could be applied to text. It seems intuitive and probably people like snorkel have already raised ahead on this stuff, but I just haven't seen this applied in the language domain yet.Jonathan: It, I mean there are a lot of things that seem like they make a lot of sense in machine learning that never work and a lot of things that make zero sense that seem to work.So, you know, I've given up trying to even predict. Yeah, yeah. Until I see the data or try it, I just kind shg my shoulders and you know, you hope for the best. Bring data or else, right? Yeah, [00:21:00] exactly. Yeah, yeah, yeah.Alessio: The fine tuning of books. Books three is like one of the big data sets and there was the whole.Twitter thing about trade comments and like, you know, you know, I used to be a community moderator@agenius.com and we've run into a lot of things is, well, if you're explaining lyrics, do you have the right to redistribute the lyrics? I know you ended up changing the license on the model from a commercial use Permitted.Swyx: Yeah let's let them. I'm not sure they did. Jonathan: So we flipped it for about a couple hours. Swyx: Um, okay. Can we, can we introduce the story from the start Just for people who are under the loop. Jonathan: Yeah. So I can tell the story very simply. So, you know, the book three data set does contain a lot of books. And it is, you know, as I discovered, um, it is a data set that provokes very strong feelings from a lot of folks.Um, that was one, one guy from one person in particular, in fact. Um, and that's about it. But it turns out one person who wants a lot of attention can, you know, get enough attention that we're talking about it now. And so we had a, we had a discussion internally after that conversation and we talked about flipping the license and, you know, very late at night I thought, you know, maybe it's a good thing to do.And decided, you know, actually probably better to just, you know, Stan Pat's license is still Apache too. And one of the conversations we had was kind of, we hadn't thought about this cuz we had our heads down, but the Hollywood writer Strike took place basically the moment we released the model. Mm-hmm.Um, we were releasing a model that could do AI generated creative content. And that is one of the big sticking points during the strike. Oh, the optics are not good. So the optics aren't good and that's not what we want to convey. This is really, this is a demo of the ability to do really long sequence lengths and.Boy, you know, [00:22:30] that's, that's not timing that we appreciated. And so we talked a lot internally that night about like, oh, we've had time to read the news. We've had time to take a breath. We don't really love this. Came to the conclusion that it's better to just leave it as it is now and learn the lesson for the future.But certainly that was one of my takeaways is this stuff, you know, there's a societal context around this that it's easy to forget when you're in the trenches just trying to get the model to train. And you know, in hindsight, you know, I might've gone with a different thing than a story writer. I might've gone with, you know, coder because we seem to have no problem putting programmers out of work with these models.Swyx: Oh yeah. Please, please, you know, take away this stuff from me.OPEN SOURCE LICENSES AND ETHICAL CONSIDERATIONS [00:23:00]Jonathan: Right. You know, so it's, I think, you know, really. The copyright concerns I leave to the lawyers. Um, that's really, if I learned one thing teaching at a law school, it was that I'm not a lawyer and all this stuff is a little complicated, especially open source licenses were not designed for this kind of world.They were designed for a world of forcing people to be more open, not forcing people to be more closed. And I think, you know, that was part of the impetus here, was to try to use licenses to make things more closed. Um, which is, I think, against the grain of the open source ethos. So that struck me as a little bit strange, but I think the most important part is, you know, we wanna be thoughtful and we wanna do the right thing.And in that case, you know, I hope with all that interesting licensing fund you saw, we're trying to be really thoughtful about this and it's hard. I learned a lot from that experience. Swyx: There's also, I think, an open question of fair use, right? Is training on words of fair use because you don't have a monopoly on words, but some certain arrangements of words you do.And who is to say how much is memorization by a model versus actually learning and internalizing and then. Sometimes happening to land at the right, the [00:24:00] same result.Jonathan: And if I've learned one lesson, I'm not gonna be the person to answer that question. Right, exactly. And so my position is, you know, we will try to make this stuff open and available.Yeah. And, you know, let the community make decisions about what they are or aren't comfortable using. Um, and at the end of the day, you know, it still strikes me as a little bit weird that someone is trying to use these open source licenses to, you know, to close the ecosystem and not to make things more open.That's very much against the ethos of why these licenses were created.Swyx: So the official mosaic position, I guess is like, before you use TC MPC 7B for anything commercial, check your own lawyers now trust our lawyers, not mosaic's lawyers.Jonathan: Yeah, okay. Yeah. I'm, you know, our lawyers are not your lawyers.Exactly. And, you know, make the best decision for yourself. We've tried to be respectful of the content creators and, you know, at the end of the day, This is complicated. And this is something that is a new law. It's a new law. It's a new law that hasn't been established yet. Um, but it's a place where we're gonna continue to try to do the right thing.Um, and it's, I think, one of the commenters, you know, I really appreciated this said, you know, well, they're trying to do the right thing, but nobody knows what the right thing is to even do, you know, the, I guess the, the most right thing would've been to literally not release a model at all. But I don't think that would've been the best thing for the community either.Swyx: Cool.Well, thanks. Well handled. Uh, we had to cover it, just causeJonathan:  Oh, yes, no worries. A big piece of news. It's been on my mind a lot.TRAINING STABILITY ENHANCEMENT [00:25:15]Swyx: Yeah. Yeah. Well, you've been very thoughtful about it. Okay. So a lot of these other ideas in terms of architecture, flash, attention, alibi, and the other data sets were contributions from the rest of the let's just call it open community of, of machine learning advancements. Uh, but Mosaic in [00:25:30] particular had some stability improvements to mitigate loss spikes, quote unquote, uh, which, uh, I, I took to mean, uh, your existing set of tools, uh, maybe we just co kind of covered that. I don't wanna sort of put words in your mouth, but when you say things like, uh, please enjoy my empty logbook.How much of an oversell is that? How much, you know, how much is that marketing versus how much is that reality?Abhinav: Oh yeah. That, that one's real. Yeah. It's like fully end-to-end. Um, and I think.Swyx: So maybe like what, what specific features of Mosaic malibu?Abhinav: Totally, totally. Yeah. I think I'll break it into two parts.One is like training stability, right? Knowing that your model's gonna basically get to the end of the training without loss spikes. Um, and I think, you know, at the 7B scale, you know, for some models like it ha it's not that big of a deal. As you train for longer and longer durations, we found that it's trickier and trickier to avoid these lost spikes.And so we actually spent a long time figuring out, you know, what can we do about our initialization, about our optimizers, about the architecture that basically prevents these lost spikes. And you know, even in our training run, if you zoom in, you'll see small intermittent spikes, but they recover within a few hundred steps.And so that's kind of the magical bit. Our line is one of defenses we recover from Las Vegas, like just naturally, right? Mm-hmm. Our line two defense was that we used determinism and basically really smart resumption strategies so that if something catastrophic happened, we can resume very quickly, like a few batches before.And apply some of these like, uh, interventions. So we had these kinds of preparations, like a plan B, but we didn't have to use them at all for MPT 7B training. So, that was kind of like a lucky break. And the third part of like basically getting all the way to the empty law book is having the right training infrastructure.[00:27:00]So this is basically what, like is, one of the big selling points of the platform is that when you try to train these models on hundreds of GPUs, not many people outside, you know, like deep industry research owners, but the GPUs fail like a lot. Um, I would say like almost once every thousand a 100 days.So for us on like a big 512 cluster every two days, basically the run will fail. Um, and this is either due to GPUs, like falling off the bus, like that's, that's a real error we see, or kind of networking failures or something like that. And so in those situations, what people have normally done is they'll have an on-call team that's just sitting round the clock, 24-7 on slack, once something goes wrong.And if then they'll basically like to try to inspect the cluster, take nodes out that are broken, restart it, and it's a huge pain. Like we ourselves did this for a few months. And as a result of that, because we're building such a platform, we basically step by step automated every single one of those processes.So now when a run fails, we have this automatic kind of watch talk that's watching. It'll basically stop the job. Test the nodes cord in anyone's that are broken and relaunch it. And because our software's all deterministic has fast resumption stuff, it just continues on gracefully. So within that log you can see sometimes I think maybe at like 2:00 AM or something, the run failed and within a few minutes it's back up and running and all of us are just sleeping peacefully.Jonathan: I do wanna say that was hard one. Mm-hmm. Um, certainly this is not how things were going, you know, many months ago, hardware failures we had on calls who were, you know, getting up at two in the morning to, you know, figure out which node had died for what reason, restart the job, have to cord the node. [00:28:30] Um, we were seeing catastrophic loss spikes really frequently, even at the 7B scale that we're just completely derailing runs.And so this was step by step just ratcheting our way there. As Abhi said, to the point where, Many models are training at the moment and I'm sitting here in the studio and not worrying one bit about whether the runs are gonna continue. Yeah. Swyx: I'm, I'm not so much of a data center hardware kind of guy, but isn't there existing software to do this for CPUs and like, what's different about this domain? Does this question make sense at all?Jonathan: Yeah, so when I think about, like, I think back to all the Google fault tolerance papers I read, you know, as an undergrad or grad student mm-hmm. About, you know, building distributed systems. A lot of it is that, you know, Each CPU is doing, say, an individual unit of work.You've got a database that's distributed across your cluster. You wanna make sure that one CPU failing can't, or one machine failing can't, you know, delete data. So you, you replicate it. You know, you have protocols like Paxos where you're literally, you've got state machines that are replicated with, you know, with leaders and backups and things like that.And in this case, you were performing one giant computation where you cannot afford to lose any node. If you lose a node, you lose model state. If you lose a node, you can't continue. It may be that, that in the future we actually, you know, create new versions of a lot of our distributed training libraries that do have backups and where data is replicated so that if you lose a node, you can detect what node you've lost and just continue training without having to stop the run, you know?Pull from a checkpoint. Yeah. Restart again on different hardware. But for now, we're certainly in a world where if anything dies, that's the end of the run and you have to go back and recover from it. [00:30:00]DATA READINESS & TRAINING PREPARATION [00:30:00]Abhinav: Yeah. Like I think a big part, a big word there is like synchronous data pluralism, right? So like, we're basically saying that on every step, every GP is gonna do some work.They're gonna stay in sync with each other and average their, their gradients and continue. Now that there are algorithmic techniques to get around this, like you could say, oh, if a GP dies, just forget about it. All the data that's gonna see, we'll just forget about it. We're not gonna train on it.But, we don't like to do that currently because, um, it makes us give up determinism, stuff like that. Maybe in the future, as you go to extreme scales, we'll start looking at some of those methods. But at the current time it's like, we want determinism. We wanted to have a run that we could perfectly replicate if we needed to.And it was, the goal is figure out how to run it on a big cluster without humans having to babysit it. Babysit it. Alessio: So as you mentioned, these models are kind of the starting point for a lot of your customers To start, you have a. Inference product. You have a training product. You previously had a composer product that is now kind of not rolled into, but you have like a super set of it, which is like the LLM foundry.How are you seeing that change, you know, like from the usual LOP stack and like how people train things before versus now they're starting from, you know, one of these MPT models and coming from there. Like worship teams think about as they come to you and start their journey.Jonathan: So I think there's a key distinction to make here, which is, you know, when you say starting from MPT models, you can mean two things.One is actually starting from one of our checkpoints, which I think very few of our customers are actually going to do, and one is starting from our configuration. You can look at our friends at Rep for that, where, you know, MPT was in progress when Refl [00:31:30] came to us and said, Hey, we need a 3 billion parameter model by next week on all of our data.We're like, well, here you go. This is what we're doing, and if it's good enough for us, um, hopefully it's good enough for you. And that's basically the message we wanna send to our customers. MPT is basically clearing a path all the way through where they know that they can come bring their data, they can use our training infrastructure, they can use all of our amazing orchestration and other tools that abhi just mentioned, for fault tolerance.They can use Composer, which is, you know, still at the heart of our stack. And then the l l M Foundry is really the specific model configuration. They can come in and they know that thing is gonna train well because we've already done it multiple times. Swyx: Let's dig in a little bit more on what should people have ready before they come talk to you? So data architecture, eval that they're looking, etc.Abhinav: Yeah, I, I mean, I think we'll accept customers at any kind of stage in their pipeline. You know, like I'd say science, there's archetypes of people who have built products around like some of these API companies and reach a stage or maturity level where it's like we want our own custom models now, either for the purpose of reducing cost, right?Like our inference services. Quite a bit cheaper than using APIs or because they want some kind of customization that you can't really get from the other API providers. I'd say the most important things to have before training a big model. You know, you wanna have good eval metrics, you know, some kind of score that you can track as you're training your models and scaling up, they can tell you you're progressing.And it's really funny, like a lot of times customers will be really excited about training the models, right? It's really fun to like launch shelves on hundreds of gfs, just all around. It's super fun. But then they'll be like, but wait, what are we gonna measure? Not just the training loss, right? I mean, it's gotta be more than that.[00:33:00]So eval metrics is like a, it's a good pre-req also, you know, your data, you know, either coming with your own pre-training or fine-tune data and having like a strategy to clean it or we can help clean it too. I think we're, we're building a lot of tooling around that. And I think once you have those two kinds of inputs and sort of the budget that you want, we can pretty much walk you through the rest of it, right?Like that's kind of what we do. Recently we helped build CR FM's model for biomedical language a while back. Jonathan: Um, we can. That's the center of research for foundation models. Abhi: Exactly, exactly.Jonathan: Spelling it out for people. Of course.Abhinav: No, absolutely. Yeah, yeah. No, you've done more of these than I have.Um, I think, uh, basically it's sort of, we can help you figure out what model I should train to scale up so that when I go for my big run company, your here run, it's, uh, it's predictable. You can feel confident that it's gonna work, and you'll kind of know what quality you're gonna get out before you have to spend like a few hundred thousand dollars.DYNAMIC REAL-TIME MODEL EVALUATION [00:34:00]Alessio: The rap Reza from rap was on the podcast last week and, uh, they had human eval and then that, uh, I'm Jon Eval, which is like vibe based. Jonathan: And I, I do think the vibe based eval cannot be, you know, underrated really at the, I mean, at the end of the day we, we did stop our models and do vibe checks and we did, as we monitor our models, one of our evals was we just had a bunch of prompts and we would watch the answers as the model trained and see if they changed cuz honestly, You know, I don't really believe in any of these eval metrics to capture what we care about.Mm-hmm. But when you ask it, uh, you know, I don't know. I think one of our prompts was to suggest games for a three-year-old and a seven-year-old. That would be fun to play. Like that was a lot more [00:34:30] valuable to me personally, to see how that answer evolved and changed over the course of training. So, you know, and human eval, just to clarify for folks, human human eval is an automated evaluation metric.There's no humans in it at all. There's no humans in it at all. It's really badly named. I got so confused the first time that someone brought that to me and I was like, no, we're not bringing humans in. It's like, no, it's, it's automated. They just called it a bad name and there's only a hundred cents on it or something.Abhinav: Yeah. Yeah. And, and it's for code specifically, right?Jonathan: Yeah. Yeah. It's very weird. It's a, it's a weird, confusing name that I hate, but you know, when other metrics are called hella swag, like, you know, you do it, just gotta roll with it at this point. Swyx: You're doing live evals now. So one, one of the tweets that I saw from you was that it is, uh, important that you do it paralyzed.Uh, maybe you kind of wanna explain, uh, what, what you guys did.Abhinav: Yeah, for sure. So with LLM Foundry, there's many pieces to it. There's obviously the core training piece, but there's also, you know, tools for evaluation of models. And we've kind of had one of the, I think it's like the, the fastest like evaluation framework.Um, basically it's multi GPU compatible. It runs with Composer, it can support really, really big models. So basically our framework runs so fast that even Azure models are training. We can run these metrics live during the training. So like if you have a dashboard like weights and biases, you kind of watch all these evil metrics.We have, like, 15 or 20 of them honestly, that we track during the run and add negligible overhead. So we can actually watch as our models go and feel confident. Like, it's not like we wait until the very last day to, to test if the models good or notJonathan: That's amazing. Yeah. I love that we've gotten this far into the conversation.We still haven't talked about efficiency and speed. Those are usually our two watch words at Mosaic, which is, you know, that's great. That says that we're [00:36:00] doing a lot of other cool stuff, but at the end of the day, um, you know, Cost comes first. If you can't afford it, it doesn't matter. And so, you know, getting things down cheap enough that, you know, we can monitor in real time, getting things down cheap enough that we can even do it in the first place.That's the basis for everything we do.OPEN SCIENCE FOR AFFORDABLE AI RESEARCH [00:36:00]Alessio: Do you think a lot of the questions that we have around, you know, what data sets we should use and things like that are just because training was so expensive before that, we just haven't run enough experiments to figure that out. And is that one of your goals is trying to make it cheaper so that we can actually get the answers?Jonathan: Yeah, that's a big part of my personal conviction for being here. I think I'm, I'm still in my heart, the second year grad student who was jealous of all his friends who had GPUs and he didn't, and I couldn't train any models except in my laptop. And that, I mean, the lottery ticket experiments began on my laptop that I had to beg for one K 80 so that I could run amist.And I'm still that person deep down in my heart. And I'm a believer that, you know, if we wanna do science and really understand these systems and understand how to make them work well, understand how they behave, understand what makes them safe and reliable. We need to make it cheap enough that we can actually do science, and science involves running dozens of experiments.When I finally, you know, cleaned out my g c s bucket from my PhD, I deleted a million model checkpoints. I'm not kidding. There were over a million model checkpoints. That is the kind of science we need, you know, that's just what it takes. In the same way that if you're in a biology lab, you don't just grow one cell and say like, eh, the drug seems to work on that cell.Like, there's a lot more science you have to do before you really know.Abhinav: Yeah. And I think one of the special things about Mosaic's kind of [00:37:30] position as well is that we have such, so many customers all trying to train models that basically we have the incentive to like to devote all these resources and time to do this science.Because when we learn which pieces actually work, which ones don't, we get to help many, many people, right? And so that kind of aggregation process I think is really important for us. I remember way back there was a paper about Google that basically would investigate batch sizes or something like that.And it was this paper that must have cost a few million dollars during all the experience. And it was just like, wow, what a, what a benefit to the whole community. Now, like now we all get to learn from that and we get, we get to save. We don't have to spend those millions of dollars anymore. So I think, um, kind of mosaical science, like the insights we get on, on data, on pre-screening architecture, on all these different things, um, that's why customers come to us.Swyx: Yeah, you guys did some really good stuff on PubMed, G B T as well. That's the first time I heard of you. Of you. And that's also published to the community.Abhinav: Yeah, that one was really fun. We were like, well, no one's really trained, like fully from scratch domain specific models before. Like, what if we just did a biomed one?Would it still work? And, uh, yeah, I'd be really excited. That did, um, we'll probably have some follow up soon, I think, later this summer.Jonathan: Yeah. Yes. Stay tuned on that. Um, but I, I will say just in general, it's a really important value for us to be open in some sense. We have no incentive not to be open. You know, we make our money off of helping people train better.There's no cost to us in sharing what we learn with the community. Cuz really at the end of the day, we make our money off of those custom models and great infrastructure and, and putting all the pieces together. That's honestly where the Mosaic name came from. Not off of like, oh, we've got, you know, this one cool secret trick [00:39:00] that we won't tell you, or, you know, closing up.I sometimes, you know, in the past couple weeks I've talked to my friends at places like Brain or, you know, what used to be Brain Now Google DeepMind. Oh, I R I P Brain. Yeah. R i p Brian. I spent a lot of time there and it was really a formative time for me. Um, so I miss it, but. You know, I kind of feel like we're one of the biggest open research labs left in industry, which is a very sad state of affairs because we're not very big.Um, but at least can you say how big the team is actually? Yeah. We were about 15 researchers, so we're, we're tiny compared to, you know, the huge army of researchers I remember at Brain or at fair, at Deep Mind back, you know, when I was there during their heydays. Um, you know, but everybody else is kind of, you know, closed up and isn't saying very much anymore.Yeah. And we're gonna keep talking and we're gonna keep sharing and, you know, we will try to be that vanguard to the best of our ability. We're very small and I, I can't promise we're gonna do what those labs used to do in terms of scale or quantity of research, but we will share what we learn and we will try to create resources for the community.Um, I, I dunno, I just, I believe in openness fundamentally. I'm an academic at heart and it's sad to me to watch that go away from a lot of the big labs. THE OPEN APPROACH [00:40:15]Alessio: We just had a live pod about the, you know, open AI snow mode, uh, post that came out and it was one of the first time I really dove into Laura and some of the this new technologies, like how are you thinking about what it's gonna take for like the open approach to really work?Obviously today, GPT four is still, you know, part of like that state-of-the-art model for a [00:40:30] lot of tasks. Do you think some of the innovation and kind of returning methods that we have today are enough if enough people like you guys are like running these, these research groups that are open? Or do you think we still need a step function improvement there?Jonathan: I think one important point here is the idea of coexistence. I think when you look at, I don't know who won Linux or Windows, the answer is yes. Microsoft bought GitHub and has a Windows subsystem for Linux. Linux runs a huge number of our servers and Microsoft is still a wildly profitable company.Probably the most successful tech company right now. So who won open source or closed source? Yes. Um, and I think that's a similar world that we're gonna be in here where, you know, it's gonna be different things for different purposes. I would not run Linux on my laptop personally cuz I like connecting to wifi and printing things.But I wouldn't run Windows on one of my surfers. And so I do think what we're seeing with a lot of our customers is, do they choose opening IR mosaic? Yes. There's a purpose for each of these. You have to send your data off to somebody else with open eyes models. That's a risk. GPT four is amazing and I would never promise someone that if they come to Mosaic, they're gonna get a GPT four quality model.That's way beyond our means and not what we're trying to do anyway. But there's also a whole world for, you know, domain specific models, context specific models that are really specialized, proprietary, trained on your own data that can do things that you could never do with one of these big models. You can customize in crazy ways like G B T four is not gonna hit 65 K context length for a very long time, cuz they've already trained that [00:42:00] model and you know, they haven't even released the 32 K version yet.So we can, you know, we can do things differently, you know, by being flexible. So I think the answer to all this is yes. But we can't see the open source ecosystem disappear. And that's the scariest thing for me. I hear a lot of talk in academia about, you know, whatever happened to that academic research on this field called information retrieval?Well, in 1999 it disappeared. Why? Because Google came along and who cares about information retrieval research when you know you have a Google Scale, you know, Web Scale database. So you know, there's a balance here. We need to have both. Swyx: I wanna applaud you, Elaine. We'll maybe edit it a little like crowd applause, uh, line.Cuz I, I think that, um, that is something that as a research community, as people interested in progress, we need to see these things instead of just, uh, seeing marketing papers from the advertising GPT 4.Jonathan: Yeah. I, I think I, you know, to get on my soapbox for 10 more seconds. Go ahead. When I talk to policymakers about, you know, the AI ecosystem, the usual fear that I bring up is, Innovation will slow because of lack of openness.I've been complaining about this for years and it's finally happened. Hmm. Why is Google sharing, you know, these papers? Why is Open AI sharing these papers? There are a lot of reasons. You know, I have my own beliefs, but it's not something we should take for granted that everybody's sharing the work that they do and it turns out well, I think we took it for granted for a while and now it's gone.I think it's gonna slow down the pace of progress. In a lot of cases, each of these labs has a bit of a monoculture and being able to pass ideas [00:43:30] back and forth was a lot of what kept, you know, scientific progress moving. So it's imperative not just, you know, for the open source community and for academia, but for the progress of technology.That we have a vibrant open source research community.THE FUTURE OF MOSAIC [00:44:11]Swyx: There's a preview of the ecosystem and commentary that we're, we're gonna do. But I wanna close out some stuff on Mosaic. You launched a bunch of stuff this month. A lot of stuff, uh, actually was, I was listening to you on Gradient descent, uh, and other podcasts we know and love.Uh, and you said you also said you were not gonna do inference and, and, and last week you were like, here's Mosaic ML inference. Oops. So maybe just a, at a high level, what was Mosaic ml and like, what is it growing into? Like how do you conceptualize this? Jonathan: Yeah, and I will say gradient, when graded dissent was recorded, we weren't doing inference and had no plans to do it.It took a little while for the podcast to get out. Um, in the meantime, basically, you know, one thing I've learned at a startup, and I'm sure abhi can comment on this as well, focus is the most important thing. We have done our best work when we've been focused on doing one thing really well and our worst work when we've tried to do lots of things.Yeah. So, We don't want to do inference, we don't want to have had to do inference. Um, and at the end of the day, our customers were begging us to do it because they wanted a good way to serve the models and they liked our ecosystem. And so in some sense, we got dragged into it kicking and screaming. We're very excited to have a product.We're going to put our best foot forward and make something really truly amazing. But there is, you know, that's something that we were reluctant to do. You know, our customers convinced us it would be good for our business. It's been wonderful for business and we are gonna put everything into this, but you know, back when grading dissent came out, I [00:45:00] was thinking like, or when we recorded it or focused, oh God, like focus is the most important thing.I've learned that the hard way multiple times that Mosaic, abhi can tell you like, you know, I've made a lot of mistakes on not focusing enough. Um, boy inference, that's a whole second thing, and a whole different animal from training. And at the end of the day, when we founded the company, our belief was that inference was relatively well served at that time.There were a lot of great inference companies out there. Um, training was not well served, especially efficient training. And we had something to add there. I think we've discovered that as the nature of the models have changed, the nature of what we had to add to inference changed a lot and there became an opportunity for us to contribute something.But that was not the plan. But now we do wanna be the place that people come when they wanna train these big, complex, difficult models and know that it's gonna go right the first time and they're gonna have something they can servee right away. Um, you know, really the rep example of, you know, with 10 days to go saying, Hey, can you please train that model?And, you know, three or four days later the model was trained and we were just having fun doing interesting, fine tuning work in it for the rest of the 10 days, you know. That also requires good inference. Swyx: That's true, that's true. Like, so running evals and, and fine tuning. I'm just putting my business hat on and you know, and Alessio as well, like, uh, I've actually had fights with potential co-founders about this on the primary business.Almost like being training, right? Like essentially a one-time cost.Jonathan: Who told you it was a one time cost? What, who, who told you that?Swyx: No, no, no, no. Correct me. Jonathan: Yeah. Yeah. Let me correct you in two ways. Um, as our CEO Navine would say, if he were here, when you create version 1.0 of your software, do you then fire all the engineers?Of [00:46:30] course not. You never, like, MPT has a thousand different things we wanted to do that we never got to. So, you know, there will be future models.Abhinav: And, and the data that's been trained on is also changing over time too, right? If you wanna ask anything about, I guess like May of 2023, we'll have to retrain it further and so on.Right? And I think this is especially true for customers who run like the kind of things that need to be up to date on world knowledge. So I, I think like, you know, the other thing I would say too is that, The malls we have today are certainly not the best malls we'll ever produce. Right. They're gonna get smaller, they're gonna get faster, they're gonna get cheaper, they're gonna get lower latency, they're gonna get higher quality.Right? And so you always want the next gen version of MPT and the one after that and one after that. There's a reason that even the GPT series goes three, four, and we know there's gonna be a five. Right? Um, so I I I also don't see as a, as a one-time cost.Jonathan: Yeah. Yeah. And I, if you wanna cite a stat on this, there are very, very

The Nonlinear Library
LW - Infinite-width MLPs as an "ensemble prior" by Vivek Hebbar

The Nonlinear Library

Play Episode Listen Later May 13, 2023 9:35


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Infinite-width MLPs as an "ensemble prior", published by Vivek Hebbar on May 12, 2023 on LessWrong. Summary: A simple toy model suggests that infinitely wide MLPs generalize in an "ensemble-ish" way which is exponentially less data-efficient than Solomonoff induction. It's probably fixable by different initializations and/or regularizations, so I note it here mostly as a mathematical curiosity / interesting prior. The analysis seems to be qualitatively consistent with empirical results on generalization vs width in small MLPs. Notes: The generalization behavior of these neural nets can be analyzed with the Neural Tangent Kernel, which is widely studied. This post is meant to probe the qualitative nature of this behavior through a toy model. I'm unsure whether my particular analysis exists elsewhere. The deficiency of the standard initialization at infinite width seems to be well-known and empirically supported in NTK-related literature, along with ways of fixing it. Core claims The standard initialization uses weights which are proportional to 1/√input_dimension. This has the effect of keeping the activations at roughly the same scale across layers. However, in the infinite width case, it ends up making the gradients in early layers infinitely smaller than those in the last layer. Hence, training an infinite-width MLP is equivalent to running a regression using the features represented by the last-layer neurons at initialization. These features never change during training, since the early gradients are all zero. If we train without regularization, we will tend to get something very "ensemble-ish", "smooth", and "dumb". I will first summarize this claim in a table, then spend the rest of the post going through the reasoning behind it. Solomonoff InductionInfinite width MLP, low L2-norm solutionBayesian update over programsLinear regression over circuitsPuts most of its weight on a small number of programs, each of which perfectly fits the data on its ownSpreads weight over a broad ensemble, including circuits which have only a small correlation with truthThe amount of data required to make the correct program dominate is O(K), where K is the program lengthThe amount of data to make the correct circuit dominate is O(2C), where C is some "complexity measure" (defined later). This is exponentially less data-efficient than Solomonoff induction.Calling it "superintelligent" is an understatementGeneralizes poorly on many tasksHighly amenable to "sharp" solutionsFavors smooth solutions, only creates "sharp" solutions if certain conditions are met by the training data. If we train an infinitely wide MLP from the standard initialization, only the last layer's weights change. So it is equivalent to a linear regression over an infinite set of random "features", these features being the activation patterns of the last layer neurons at initialization. If the MLP is deep enough, some of these last-layer neurons are contain the output of very intelligent circuits. However, if we train our infinite width MLP, these intelligent circuits will hardly be used by the regression, even if they are very useful. That is, the sum of the weights drawing from them in the last layer will be very small. The reason I believe this is the toy model in the next section. Toy model Let's call each last-layer neuron a "feature". As discussed earlier, their behavior never changes due to how the gradients pan out at infinite width. In a "real" infinite network, these features will be "useful" and "intelligent" to various degrees, but we will simplify this greatly in the toy model, by using just two types of features. The toy model asks: "Suppose that some features already compute the correct answer for every training datapoint, and that the rest of the features are random garbage. Will the trained network...

The Nonlinear Library
LW - An Analogy for Understanding Transformers by TheMcDouglas

The Nonlinear Library

Play Episode Listen Later May 13, 2023 16:57


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An Analogy for Understanding Transformers, published by TheMcDouglas on May 13, 2023 on LessWrong. Thanks to the following people for feedback: Tilman Rauker, Curt Tigges, Rudolf Laine, Logan Smith, Arthur Conmy, Joseph Bloom, Rusheb Shah, James Dao. TL;DR I present an analogy for the transformer architecture: each vector in the residual stream is a person standing in a line, who is holding a token, and trying to guess what token the person in front of them is holding. Attention heads represent questions that people in this line can ask to everyone standing behind them (queries are the questions, keys determine who answers the questions, values determine what information gets passed back to the original question-asker), and MLPs represent the internal processing done by each person in the line. I claim this is a useful way to intuitively understand the transformer architecture, and I'll present several reasons for this (as well as ways induction heads and indirect object identification can be understood in these terms). Introduction In this post, I'm going to present an analogy for understanding how transformers work. I expect this to be useful for anyone who understands the basics of transformers, in particular people who have gone through Neel Nanda's tutorial, and/or understand the following points at a minimum: What a transformer's input is, what its outputs represent, and the nature of the predict-next-token task that it's trained on What the shape of the residual stream is, and the idea of components of the transformer reading from / writing to the residual stream throughout the model's layers How a transformer is composed of multiple blocks, each one containing an MLP (which does processing on vectors at individual sequence positions), and an attention layer (which moves information between the residual stream vectors at different sequence positions). I think the analogy still offers value even for people who understand transformers deeply already. The Analogy A line is formed by a group of people, each person holding a word. Everyone knows their own word and position in the line, but they can't see anyone else in the line. The objective for each person is to guess the word held by the person in front of them. People have the ability to shout questions to everyone standing behind them in the line (those in front cannot hear them). Upon hearing a question, each individual can choose whether or not to respond, and what information to relay back to the person who asked. After this, people don't remember the questions they were asked (so no information can move backwards in the line, only forwards). As individuals in the line gather information from these exchanges, they can use this information to formulate subsequent questions and provide answers. How this relates to transformer architecture: Each person in the line is a vector in the residual stream They start with just information about their own word (token embedding) and position in the line (positional embedding) The attention heads correspond to the questions that people in the line ask each other: Queries = question (which gets asked to everyone behind them in the line) Keys = how the people who hear the question decide whether or not to reply Values = the information that the people who reply pass back to the person who originally asked the question People can use information gained from earlier questions when answering / asking later questions - this is composition The MLPs correspond to the information processing / factual recall performed by each person in the sequence independently The unembedding at the end of the model is when we ask each person in the line for a final guess at what the next word is (in the form of a probability distribution over all possible words) Key Concepts for Transformers In...

The Nonlinear Library
AF - Infinite-width MLPs as an "ensemble prior" by Vivek Hebbar

The Nonlinear Library

Play Episode Listen Later May 12, 2023 9:30


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Infinite-width MLPs as an "ensemble prior", published by Vivek Hebbar on May 12, 2023 on The AI Alignment Forum. Summary: A simple toy model suggests that infinitely wide MLPs generalize in an "ensemble-ish" way which is exponentially less data-efficient than Solomonoff induction. It's probably fixable by different initializations and/or regularizations, so I note it here mostly as a mathematical curiosity / interesting prior. The analysis seems to be qualitatively consistent with empirical results on generalization vs width in small MLPs. Note: The deficiency of the standard initialization at infinite width seems to be well-known and empirically supported in NTK-related literature, along with ways of fixing it. I'm unsure whether my particular analysis exists elsewhere. Core claims The standard initialization uses weights which are proportional to 1/√input_dimension. This has the effect of keeping the activations at roughly the same scale across layers. However, in the infinite width case, it ends up making the gradients in early layers infinitely smaller than those in the last layer. Hence, training an infinite-width MLP is equivalent to running a regression using the features represented by the last-layer neurons at initialization. These features never change during training, since the early gradients are all zero. If we train without regularization, we will tend to get something very "ensemble-ish", "smooth", and "dumb". I will first summarize this claim in a table, then spend the rest of the post going through the reasoning behind it. Solomonoff InductionInfinite width MLP, low L2-norm solutionBayesian update over programsLinear regression over circuitsPuts most of its weight on a small number of programs, each of which perfectly fits the data on its ownSpreads weight over a broad ensemble, including circuits which have only a small correlation with truthThe amount of data required to make the correct program dominate is O(K), where K is the program lengthThe amount of data to make the correct circuit dominate is O(2C), where C is some "complexity measure" (defined later). This is exponentially less data-efficient than Solomonoff induction.Calling it "superintelligent" is an understatementGeneralizes poorly on many tasksHighly amenable to "sharp" solutionsFavors smooth solutions, only creates "sharp" solutions if certain conditions are met by the training data. If we train an infinitely wide MLP from the standard initialization, only the last layer's weights change. So it is equivalent to a linear regression over an infinite set of random "features", these features being the activation patterns of the last layer neurons at initialization. If the MLP is deep enough, some of these last-layer neurons are contain the output of very intelligent circuits. However, if we train our infinite width MLP, these intelligent circuits will hardly be used by the regression, even if they are very useful. That is, the sum of the weights drawing from them in the last layer will be very small. The reason I believe this is the toy model in the next section. Toy model Let's call each last-layer neuron a "feature". As discussed earlier, their behavior never changes due to how the gradients pan out at infinite width. In a "real" infinite network, these features will be "useful" and "intelligent" to various degrees, but we will simplify this greatly in the toy model, by using just two types of features. The toy model asks: "Suppose that some features already compute the correct answer for every training datapoint, and that the rest of the features are random garbage. Will the trained network rely more on the perfect features, or will it use some giant mixture of random features?" Suppose we have d items in the training set, denoted x1,..,xn. Each has a label of either −1 or 1. Let's say...

The Nonlinear Library
LW - A technical note on bilinear layers for interpretability by Lee Sharkey

The Nonlinear Library

Play Episode Listen Later May 9, 2023 1:35


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A technical note on bilinear layers for interpretability, published by Lee Sharkey on May 8, 2023 on LessWrong. Summary In this short theoretical note (now on Arxiv) I examine bilinear layers, which are MLP layers that take the form MLPBilinear(x)=(W1x)⊙(W2x). When used in language models, they perform better than standard MLPs with elementwise activation functions (but appear very slightly below state of the art). Despite their competitiveness, they are mathematically much easier to analyze: Although they are nonlinear functions of their input, bilinear layers can be expressed using only linear operations and third order tensors. Because they can be linearized, we can extend 'A Mathematical Framework for Transformer Circuits' (Elhage et al. 2022) beyond attention-only transformers to transformers with both attention and MLP layers. In a similar way to how the analysis of Elhage et al. (2022) helped to reveal QK- and OV-circuits, induction heads, and virtual attention heads, the analyzability of bilinear layers may lend them to deeper safety insights by allowing us to talk more formally about circuits in large language models. Additionally, and more speculatively, bilinear layers might offer an alternative path for mechanistic interpretability through understanding the mechanisms of feature construction instead of having to enumerate and understand a (potentially exponentially) large number of features in large models. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

The Nonlinear Library
AF - A technical note on bilinear layers for interpretability by Lee Sharkey

The Nonlinear Library

Play Episode Listen Later May 8, 2023 1:36


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: A technical note on bilinear layers for interpretability, published by Lee Sharkey on May 8, 2023 on The AI Alignment Forum. Summary In this short theoretical note (now on Arxiv) I examine bilinear layers, which are MLP layers that take the form MLPBilinear(x)=(W1x)⊙(W2x). When used in language models, they perform better than standard MLPs with elementwise activation functions (but appear very slightly below state of the art). Despite their competitiveness, they are mathematically much easier to analyze: Although they are nonlinear functions of their input, bilinear layers can be expressed using only linear operations and third order tensors. Because they can be linearized, we can extend 'A Mathematical Framework for Transformer Circuits' (Elhage et al. 2022) beyond attention-only transformers to transformers with both attention and MLP layers. In a similar way to how the analysis of Elhage et al. (2022) helped to reveal QK- and OV-circuits, induction heads, and virtual attention heads, the analyzability of bilinear layers may lend them to deeper safety insights by allowing us to talk more formally about circuits in large language models. Additionally, and more speculatively, bilinear layers might offer an alternative path for mechanistic interpretability through understanding the mechanisms of feature construction instead of having to enumerate and understand a (potentially exponentially) large number of features in large models. Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.

PH SPOTlight: Public health career stories, inspiration, and guidance from current-day public health heroes

In this episode, Sujani sits down with Gwyneth Eliasson, an assistant professor at the Rutgers School of Public Health. They discuss how public health and law intersect, Gwyneth's experiences in academia and teaching, and advice for anyone interested in health policy and these fields.You'll LearnHow Gwyneth found her way into public health from working in public interest law and consultingThe differences between public health law, healthcare law, and public health practice and what opportunities are available for those interested in these areasWhat a day in the life of Gwyneth looks like as a professor How the pandemic has affected Gwyneth's role as a professor and what changes she has seen in students' learningsGwyneth's teaching style and how she incorporates her own experiences and education in projects and assignmentsThe importance of good writing and clear communication in public healthWhat advice Gwyneth has for those interested in the intersection between law and public healthToday's GuestGwyneth M. Eliasson is an Assistant Professor of Health Systems and Policy in the Department of Health Behavior, Society, and Policy at the Rutgers School of Public Health (RSPH). Before joining the RSPH faculty, she was an Assistant Professor in the Department of Health Policy and Management at the School of Public Health - SUNY Downstate Health Sciences University. She received her JD from Brooklyn Law School and her MPH in Health Systems and Policy from RSPH. As a social justice attorney, she advocated for low-income New Yorkers facing systemic health inequities at administrative proceedings and in Federal courts. As a public health practitioner, she managed CDC-contracted projects with the Center for Public Health Law Research at Temple University Beasley School of Law and consulted for Rutgers School of Law on grant-funded projects to develop a medical-legal partnership (MLP) program in Camden, New Jersey. Her case study on MLPs for older adults is in HEALTHY AGING THROUGH THE SOCIAL DETERMINANTS OF HEALTH (APHA Press, 2021). ResourcesFollow Gwyneth on LinkedIn and Twitter Learn more about Camden's Medical-Legal Partnership Learn more about Temple University's Center for Public Health Law Research Learn more about CDC's Public Health Law Program Buy the book "Teaching Public Health Writing" by Jennifer Beard Listen to the previous episode about informational interviews with Shanna Shulman and the previous career tips for informational interviewsSupport the showJoin The Public Health Career Club: the #1 hangout spot and community dedicated to building and growing your dream public health career.

Chuck Yates Needs A Job
Daniel Herz | CEO of WhiteHawk Energy

Chuck Yates Needs A Job

Play Episode Listen Later Feb 15, 2023 103:04


Chuck and Daniel chat about his career which is basically a walk through the energy business over the past twenty-five years. Banking at BOA; MLPs, E&Ps at Atlas Operating; today buying natty minerals.

The Nonlinear Library
AF - Decision Transformer Interpretability by Joseph Bloom

The Nonlinear Library

Play Episode Listen Later Feb 6, 2023 35:55


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decision Transformer Interpretability, published by Joseph Bloom on February 6, 2023 on The AI Alignment Forum. TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: Dynamic Obstacles Environment Black Box Model Characterisation Explaining Obstacle Avoidance at positive RTG using QK and OV circuits Alignment Relevance Future Directions I would welcome assistance with: Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. Communication tasks: Making nicer diagrams/explanations. I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I'm also happy to collaborate on related projects. Introduction For my ARENA Capstone project, I (Joseph) started working on decision transformer interpretability at the suggestion of Paul Colognese. Decision transformers can solve reinforcement learning tasks when conditioned on generating high rewards via the specified “Reward-to-Go” (RTG). However, they can also generate agents of varying quality based on the RTG, making them simultaneously simulators, small transformers and RL agents. As such, it seems possible that identifying and understanding circuits in decision transformers would not only be interesting as an extension of current mechanistic interpretability research but possibly lead to alignment-relevant insights. Previous Work The most important background for this post is: The Decision Transformers paper showed how RL tasks can be solved with transformer sequence modelling. Figure 1 from their paper describes the critical components of a Decision Transformer. A Mathematical Framework for Transformer Circuits that describes how to think about transformers in the context of mechanistic interpretability. Important ideas include the ability to decompose the residual stream into the output of attention heads and MLPs, the QK circuits (decides if to write information to the residual stream), and OV circuits (decides what to write to the residual stream). The Understanding RL Vision, which analyses how an RL agent with a large CNN component responds to input features, attributing them as good or bad news in the value function and proposes the Diversity hypothesis - “Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).” Methods Environment - RL Environm...

The Nonlinear Library
LW - Decision Transformer Interpretability by Joseph Bloom

The Nonlinear Library

Play Episode Listen Later Feb 6, 2023 35:54


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Decision Transformer Interpretability, published by Joseph Bloom on February 6, 2023 on LessWrong. TLDR: We analyse how a small Decision Transformer learns to simulate agents on a grid world task, providing evidence that it is possible to do circuit analysis on small models which simulate goal-directedness. We think Decision Transformers are worth exploring further and may provide opportunities to explore many alignment-relevant deep learning phenomena in game-like contexts. Link to the GitHub Repository. Link to the Analysis App. I highly recommend using the app if you have experience with mechanistic interpretability. All of the mechanistic analysis should be reproducible via the app. Key Claims A 1-Layer Decision Transformer learns several contextual behaviours which are activated by a combination of Reward-to-Go/Observation combinations on a simple discrete task. Some of these behaviours appear localisable to specific components and can be explained with simple attribution and the transformer circuits framework. The specific algorithm implemented is strongly affected by the lack of a one-hot-encoding scheme (initially left out for simplicity of analysis) of the state/observations, which introduces inductive biases that hamper the model. If you are short on time, I recommend reading: Dynamic Obstacles Environment Black Box Model Characterisation Explaining Obstacle Avoidance at positive RTG using QK and OV circuits Alignment Relevance Future Directions I would welcome assistance with: Engineering tasks like app development, improving the model, training loop, wandb dashboard etc. and people who can help me make nice diagrams and write up the relevant maths/theory in the app). Research tasks. Think more about how to exactly construct/interpret circuit analysis in the context of decision transformers. Translate ideas from LLMs/algorithmic tasks. Communication tasks: Making nicer diagrams/explanations. I have a Trello board with a huge number of tasks ranging from small stuff to massive stuff. I'm also happy to collaborate on related projects. Introduction For my ARENA Capstone project, I (Joseph) started working on decision transformer interpretability at the suggestion of Paul Colognese. Decision transformers can solve reinforcement learning tasks when conditioned on generating high rewards via the specified “Reward-to-Go” (RTG). However, they can also generate agents of varying quality based on the RTG, making them simultaneously simulators, small transformers and RL agents. As such, it seems possible that identifying and understanding circuits in decision transformers would not only be interesting as an extension of current mechanistic interpretability research but possibly lead to alignment-relevant insights. Previous Work The most important background for this post is: The Decision Transformers paper showed how RL tasks can be solved with transformer sequence modelling. Figure 1 from their paper describes the critical components of a Decision Transformer. A Mathematical Framework for Transformer Circuits that describes how to think about transformers in the context of mechanistic interpretability. Important ideas include the ability to decompose the residual stream into the output of attention heads and MLPs, the QK circuits (decides if to write information to the residual stream), and OV circuits (decides what to write to the residual stream). The Understanding RL Vision, which analyses how an RL agent with a large CNN component responds to input features, attributing them as good or bad news in the value function and proposes the Diversity hypothesis - “Interpretable features tend to arise (at a given level of abstraction) if and only if the training distribution is diverse enough (at that level of abstraction).” Methods Environment - RL Environments. GridWor...

The Nonlinear Library
LW - More findings on Memorization and double descent by Marius Hobbhahn

The Nonlinear Library

Play Episode Listen Later Feb 2, 2023 30:32


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More findings on Memorization and double descent, published by Marius Hobbhahn on February 1, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. I'd like to thank Wes Gurnee, Aryan Bhatt, Eric Purdy and Stefan Heimersheim for discussions and Evan Hubinger, Neel Nanda, Adam Jermyn and Chris Olah for mentorship and feedback. The post contains a lot of figures, so the suggested length is deceiving. Code can be found in these three colab notebooks [1][2][3]. I have split the post into two parts. The first one is concerned with double descent and other general findings in memorization and the second focuses on measuring memorization using the maximum data dimensionality metric. This is the first post in a series of N posts on memorization in transformers. Executive summary I look at a variety of settings and experiments to better understand memorization in toy models. My primary motivation is to increase our general understanding of NNs but I also suspect that understanding memorization better might increase our ability to detect backdoors/trojans. The work heavily builds on two papers by Anthropic, “Toy models of superposition” and “Superposition, Memorization and double descent”. I successfully replicate a subset of their findings. I specifically look at three different setups of NNs that I speculate are most relevant to understanding memorization in the non-attention parts of transformers. Bottlenecks between layers, i.e. when projecting from high-dimensional spaces (e.g. MLPs) into lower dimensions (e.g. the residual stream). This is similar to the setting in the toy models of superposition paper and its sequel. MLP blocks, i.e. when projecting from lower-dimensional spaces (e.g. the residual stream) into higher dimensions with ReLU non-linearities. The final layer, i.e. when projecting from the end of the residual stream into the vocab space. The main difference to the previous scenarios is that we use the cross-entropy loss for the experiments which has a different inductive bias than the MSE loss. I'm able to find the double descent phenomenon in all three settings. My takeaway from this is that the transition between memorization and learning general features seems to be a very regular and predictable phenomenon (assuming you know the sparsity and number of features of your network). Furthermore, it seems like the network is “confused” (e.g. has much higher test loss) when it is right between memorization and generalization. I test the limits of reconstruction in different settings, i.e. the ability of the neural network to reconstruct its inputs given different dataset sizes, hidden sizes, number of features, importance distributions and sparsities. The findings mostly confirm what we would predict, e.g. more sparsity or larger hidden sizes lead to better reconstructions. A speculative claim is that if we had better measures of sparsity and importance in real-world models, we might be able to derive scaling laws that could tell us how many “concepts” a network has learned. Interpreting NNs that memorized in the simplest settings is extremely straightforward--the network literally creates a dictionary that you can just read off the weights. However, even small increases in complexity make this dictionary much harder to read and I have not yet found a method to decompile it into a human-readable form (maybe in the next posts). Isolated components In the following, we isolate three settings that seem like important components of memorization. They are supposed to model the non-attention parts of a transformer (primarily because I speculate that memorization mostly happens in the non-attention parts). Bottleneck By bottleneck we mean a situation in which a model projects from many into fewer dimensions, e.g. fro...

The Nonlinear Library
AF - More findings on Memorization and double descent by Marius Hobbhahn

The Nonlinear Library

Play Episode Listen Later Feb 1, 2023 30:33


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: More findings on Memorization and double descent, published by Marius Hobbhahn on February 1, 2023 on The AI Alignment Forum. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. I'd like to thank Wes Gurnee, Aryan Bhatt, Eric Purdy and Stefan Heimersheim for discussions and Evan Hubinger, Neel Nanda, Adam Jermyn and Chris Olah for mentorship and feedback. The post contains a lot of figures, so the suggested length is deceiving. Code can be found in these three colab notebooks [1][2][3]. I have split the post into two parts. The first one is concerned with double descent and other general findings in memorization and the second focuses on measuring memorization using the maximum data dimensionality metric. This is the first post in a series of N posts on memorization in transformers. Executive summary I look at a variety of settings and experiments to better understand memorization in toy models. My primary motivation is to increase our general understanding of NNs but I also suspect that understanding memorization better might increase our ability to detect backdoors/trojans. The work heavily builds on two papers by Anthropic, “Toy models of superposition” and “Superposition, Memorization and double descent”. I successfully replicate a subset of their findings. I specifically look at three different setups of NNs that I speculate are most relevant to understanding memorization in the non-attention parts of transformers. Bottlenecks between layers, i.e. when projecting from high-dimensional spaces (e.g. MLPs) into lower dimensions (e.g. the residual stream). This is similar to the setting in the toy models of superposition paper and its sequel. MLP blocks, i.e. when projecting from lower-dimensional spaces (e.g. the residual stream) into higher dimensions with ReLU non-linearities. The final layer, i.e. when projecting from the end of the residual stream into the vocab space. The main difference to the previous scenarios is that we use the cross-entropy loss for the experiments which has a different inductive bias than the MSE loss. I'm able to find the double descent phenomenon in all three settings. My takeaway from this is that the transition between memorization and learning general features seems to be a very regular and predictable phenomenon (assuming you know the sparsity and number of features of your network). Furthermore, it seems like the network is “confused” (e.g. has much higher test loss) when it is right between memorization and generalization. I test the limits of reconstruction in different settings, i.e. the ability of the neural network to reconstruct its inputs given different dataset sizes, hidden sizes, number of features, importance distributions and sparsities. The findings mostly confirm what we would predict, e.g. more sparsity or larger hidden sizes lead to better reconstructions. A speculative claim is that if we had better measures of sparsity and importance in real-world models, we might be able to derive scaling laws that could tell us how many “concepts” a network has learned. Interpreting NNs that memorized in the simplest settings is extremely straightforward--the network literally creates a dictionary that you can just read off the weights. However, even small increases in complexity make this dictionary much harder to read and I have not yet found a method to decompile it into a human-readable form (maybe in the next posts). Isolated components In the following, we isolate three settings that seem like important components of memorization. They are supposed to model the non-attention parts of a transformer (primarily because I speculate that memorization mostly happens in the non-attention parts). Bottleneck By bottleneck we mean a situation in which a model projects from many into fewer dimensi...

The Alternative Investment Podcast
MLP ETFs & Income Strategies, With Jay Hatfield (Episode 90)

The Alternative Investment Podcast

Play Episode Listen Later Jan 31, 2023 45:31


Master limited partnerships (MLPs) hold massive appeal for High Net Worth investors seeking tax-advantaged income. But it is better to invest into an individual MLP, or to seek more diversified exposure with a fund? Jay Hatfield, founder and CEO at Infrastructure Capital Advisors (InfraCap), joins the show to discuss how ETFs offer High Net Worth investors efficient and diversified access to the MLP asset class. Show notes: https://altsdb.com/2023/01/jay-hatfield-090/

The Nonlinear Library
LW - How-to Transformer Mechanistic Interpretability—in 50 lines of code or less! by StefanHex

The Nonlinear Library

Play Episode Listen Later Jan 25, 2023 24:55


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: How-to Transformer Mechanistic Interpretability—in 50 lines of code or less!, published by StefanHex on January 24, 2023 on LessWrong. Produced as part of the SERI ML Alignment Theory Scholars Program - Winter 2022 Cohort. What if I told you that in just one weekend you can get up to speed doing practical Mechanistic Interpretability research on Transformers? Surprised? Then this is your tutorial! I'll give you a view to how I research Transformer circuits in practice, show you the tools you need, and explain my thought process along the way. I focus on the practical side to get started with interventions; for more background see point 2 below. Prerequisites: Understanding the Transformer architecture: Know what the residual stream is, how attention layers and MLPs work, and how logits & predictions work. For future sections familiarity with multi-head attention is useful. Here's a link to Neel's glossary which provides excellent explanations for most terms I might use!If you're not familiar with Transformers you can check out Step 2 (6) on Neel's guide or any of the other explanations online, I recommend Jay Alammar's The Illustrated Transformer and/or Milan Straka's lecture series. Some overview of Mechanistic Interpretability is helpful: See e.g. any of Neel's talks, or look at the results in the IOI paper / walkthrough. Basic Python: Familiarity with arrays (as in NumPy or PyTorch, for indices) is useful; but explicitly no PyTorch knowledge required! No hardware required, free Google Colab account works fine for this. Here's a notebook with all the code from this tutorial! PS: Here's a little web page where you can run some of these methods online! No trivial inconveniences! Step 0: Setup Open a notebook (e.g. Colab) and install Neel Nanda's TransformerLens (formerly known as EasyTransformer). Step 1: Getting a model to play with That's it, now you've got a GPT2 model to play with! TransformerLens supports most relevant open source transformers. Here's how to run the language model Let's have a look at the internal activations: TransformerLens can give you a dictionary with almost all internal activations you ever care about (referred to as “cache”): Here you will find things like the attention pattern blocks.0.attn.hook_pattern, the residual stream before and after each layer blocks.1.hook_resid_pre, and more! You can also access all the weights & parameters of the model in model.named_parameters(). Here you will find weight matrices & biases of every MLP and Attention layer, as well as the embedding & unembedding matrices. I won't focus on these in this guide but they're great to look at! (Exercise: What can the unembedding biases unembed.b_U tell you about common tokens?) Step 2: Let's start analyzing a behavior! Let's go and find some induction heads! I'll make up an example: Her name was Alex Hart. When Alex, with likely completion Hart. TransformerLens has a little tool to plot a tokenized prompt, model predictions, and associated logits: I find it is useful to spend a few minutes thinking about which information is needed to solve the task: The model needs to Realize the last token, Alex, is a repetition of a previous occurrence The model needs to copy the last name from after the previous Alex occurrence to the last token as prediction Method 1: Residual stream patching The number 1 thing I try when I want to reverse engineer a new behavior is to find where in the network the information is “traveling”. In transformers, the model keeps track of all information in the residual stream. Attention heads & MLPs read from the residual stream, perform some computation or information moving, and write their outputs back into the residual stream. I think of this stream as having a couple of “lanes” corresponding to each token position. Over the course of the model...

Your Money's Worth
Investing for Income with Jeffrey Kosnett (Episode 144 rebroadcast)

Your Money's Worth

Play Episode Listen Later Jul 12, 2022 33:18


Steady income, moderate growth, manageable risk — those are the goals that Jeffrey Kosnett, the editor of Kiplinger's Investing for Income, has been pursuing for decades. Also, a quick look at holiday shopping strategies. Links mentioned in this episode: Kiplinger's Investing for Income https://store.kiplinger.com/kiplingers-investing-for-income.html Don't Overlook Preferred Stocks https://www.kiplinger.com/investing/etfs/603643/dont-overlook-preferred-stocks Stay Above the Interest Rate Fray https://www.kiplinger.com/investing/bonds/603317/stay-above-the-interest-rate-fray

Large Marge Sent Us
80s Cartoon Birthday Bonanza

Large Marge Sent Us

Play Episode Listen Later May 19, 2022 64:06


Happy 39th Birthday to Sweetie! In celebration, we watched a slew of 80s cartoons and discovered that we are JEM girls all the way! We watched My Little Pony, Jem and the Hollograms and She-ra Princess of Power and it felt like being wrapped in a big old blanket of nostalgia! We'll discuss all the fun action figures that went along with these cartoons (we had LOTS of MLPs) get confused about what the hell kind of world She-ra is living in, and try to pronouce the really weird names they cooked up for the My Little Pony baddies. Tireck? Scarpan? Who was in charge here?!