POPULARITY
Professor Randall Balestriero joins us to discuss neural network geometry, spline theory, and emerging phenomena in deep learning, based on research presented at ICML. Topics include the delayed emergence of adversarial robustness in neural networks ("grokking"), geometric interpretations of neural networks via spline theory, and challenges in reconstruction learning. We also cover geometric analysis of Large Language Models (LLMs) for toxicity detection and the relationship between intrinsic dimensionality and model control in RLHF.SPONSOR MESSAGES:***CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.https://centml.ai/pricing/Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. Are you interested in working on reasoning, or getting involved in their events?Goto https://tufalabs.ai/***Randall Balestrierohttps://x.com/randall_balestrhttps://randallbalestriero.github.io/Show notes and transcript: https://www.dropbox.com/scl/fi/3lufge4upq5gy0ug75j4a/RANDALLSHOW.pdf?rlkey=nbemgpa0jhawt1e86rx7372e4&dl=0TOC:- Introduction - 00:00:00: Introduction- Neural Network Geometry and Spline Theory - 00:01:41: Neural Network Geometry and Spline Theory - 00:07:41: Deep Networks Always Grok - 00:11:39: Grokking and Adversarial Robustness - 00:16:09: Double Descent and Catastrophic Forgetting- Reconstruction Learning - 00:18:49: Reconstruction Learning - 00:24:15: Frequency Bias in Neural Networks- Geometric Analysis of Neural Networks - 00:29:02: Geometric Analysis of Neural Networks - 00:34:41: Adversarial Examples and Region Concentration- LLM Safety and Geometric Analysis - 00:40:05: LLM Safety and Geometric Analysis - 00:46:11: Toxicity Detection in LLMs - 00:52:24: Intrinsic Dimensionality and Model Control - 00:58:07: RLHF and High-Dimensional Spaces- Conclusion - 01:02:13: Neural Tangent Kernel - 01:08:07: ConclusionREFS:[00:01:35] Humayun – Deep network geometry & input space partitioninghttps://arxiv.org/html/2408.04809v1[00:03:55] Balestriero & Paris – Linking deep networks to adaptive spline operatorshttps://proceedings.mlr.press/v80/balestriero18b/balestriero18b.pdf[00:13:55] Song et al. – Gradient-based white-box adversarial attackshttps://arxiv.org/abs/2012.14965[00:16:05] Humayun, Balestriero & Baraniuk – Grokking phenomenon & emergent robustnesshttps://arxiv.org/abs/2402.15555[00:18:25] Humayun – Training dynamics & double descent via linear region evolutionhttps://arxiv.org/abs/2310.12977[00:20:15] Balestriero – Power diagram partitions in DNN decision boundarieshttps://arxiv.org/abs/1905.08443[00:23:00] Frankle & Carbin – Lottery Ticket Hypothesis for network pruninghttps://arxiv.org/abs/1803.03635[00:24:00] Belkin et al. – Double descent phenomenon in modern MLhttps://arxiv.org/abs/1812.11118[00:25:55] Balestriero et al. – Batch normalization's regularization effectshttps://arxiv.org/pdf/2209.14778[00:29:35] EU – EU AI Act 2024 with compute restrictionshttps://www.lw.com/admin/upload/SiteAttachments/EU-AI-Act-Navigating-a-Brave-New-World.pdf[00:39:30] Humayun, Balestriero & Baraniuk – SplineCam: Visualizing deep network geometryhttps://openaccess.thecvf.com/content/CVPR2023/papers/Humayun_SplineCam_Exact_Visualization_and_Characterization_of_Deep_Network_Geometry_and_CVPR_2023_paper.pdf[00:40:40] Carlini – Trade-offs between adversarial robustness and accuracyhttps://arxiv.org/pdf/2407.20099[00:44:55] Balestriero & LeCun – Limitations of reconstruction-based learning methodshttps://openreview.net/forum?id=ez7w0Ss4g9(truncated, see shownotes PDF)
RJJ Software's Software Development Service This episode of The Modern .NET Show is supported, in part, by RJJ Software's Podcasting Services, whether your company is looking to elevate its UK operations or reshape its US strategy, we can provide tailored solutions that exceed expectations. Show Notes "This should be a mantra for anyone working on web applications or web APIs: you should validate any input of the application as much as possible. When you conceptualize it, all the malicious input into your application comes across in an HTTP request at some point. It's either in the URL or in the body. That's generally speaking, unless you have a supply chain attack, that's generally the vector by which your application will be attacked. Before you even start to kind of think about specific vulnerabilities, if you narrow down what is permissible input right out of the gate, then you kind of close the door on so many different vulnerabilities without even thinking about it."— Malcolm McDonald Welcome friends to The Modern .NET Show; the premier .NET podcast, focusing entirely on the knowledge, tools, and frameworks that all .NET developers should have in their toolbox. We are the go-to podcast for .NET developers worldwide, and I am your host: Jamie "GaProgMan" Taylor. In this episode, Malcolm McDonald joined us to talk about the importance of Application Security, knowing the basics of secure coding, and backing that all up with both "secure by default" designs, and subscribing to CVE notifications. "So what you find in your career is that if you set down kind of sensible default behaviour in your codebase earlier on, then everyone will follow that. You establish, kind of, best practices. It might take a little bit longer, but if you say, "this is the way we do this over here," then even yourself when you, kind of, come back to, will kind of, reproduce that path. So the initial, kind of, like first few check-ins to your source control are key to, like, establishing best practice"— Malcolm McDonald One note before we begin: some of Malcolm's audio is a little rough. Mark, our superstar editor, has done what he can, but there are parts that might be a little unclear. If you have trouble with the audio, please remember that there is a full transcription linked in the accompanying show notes. Anyway, without further ado, let's sit back, open up a terminal, type in `dotnet new podcast` and we'll dive into the core of Modern .NET. Supporting the Show If you find this episode useful in any way, please consider supporting the show by either leaving a review (check our review page for ways to do that), sharing the episode with a friend or colleague, buying the host a coffee, or considering becoming a Patron of the show. Full Show Notes The full show notes, including links to some of the things we discussed and a full transcription of this episode, can be found at: https://dotnetcore.show/season-7/grokking-application-security-malcolm-mcdonald-on-securing-your-apps/ Useful Links Malcolm on LinkedIn Grokking Application Security Hacksplaining Hash function Don't Accept The Defaults Bcrypt Supporting the show: Leave a rating or review Buy the show a coffee Become a patron Getting in Touch: Via the contact page Joining the Discord Remember to rate and review the show on Apple Podcasts, Podchaser, or wherever you find your podcasts, this will help the show's audience grow. Or you can just share the show with a friend. And don't forget to reach out via our Contact page. We're very interested in your opinion of the show, so please get in touch. You can support the show by making a monthly donation on the show's Patreon page at: https://www.patreon.com/TheDotNetCorePodcast. Music created by Mono Memory Music, licensed to RJJ Software for use in The Modern .NET Show
In this episode of The Cognitive Revolution, Nathan interviews Andrew White, Professor of Chemical Engineering at the University of Rochester and Head of Science at Future House. We explore groundbreaking AI systems for scientific discovery, including PaperQA and Aviary, and discuss how large language models are transforming research. Join us for an insightful conversation about the intersection of AI and scientific advancement with this pioneering researcher in his first-ever podcast appearance. Check out Future House: https://www.futurehouse.org Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse SPONSORS: Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognitive Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive CHAPTERS: (00:00:00) Teaser (00:01:13) About the Episode (00:04:37) Andrew White's Journey (00:10:23) GPT-4 Red Team (00:15:33) GPT-4 & Chemistry (00:17:54) Sponsors: Oracle Cloud Infrastructure (OCI) | SelectQuote (00:20:19) Biology vs Physics (00:23:14) Conceptual Dark Matter (00:26:27) Future House Intro (00:30:42) Semi-Autonomous AI (00:35:39) Sponsors: Shopify (00:37:00) Lab Automation (00:39:46) In Silico Experiments (00:45:22) Cost of Experiments (00:51:30) Multi-Omic Models (00:54:54) Scale and Grokking (01:00:53) Future House Projects (01:10:42) Paper QA Insights (01:16:28) Generalizing to Other Domains (01:17:57) Using Figures Effectively (01:22:01) Need for Specialized Tools (01:24:23) Paper QA Cost & Latency (01:27:37) Aviary: Agents & Environments (01:31:42) Black Box Gradient Estimation (01:36:14) Open vs Closed Models (01:37:52) Improvement with Training (01:40:00) Runtime Choice & Q-Learning (01:43:43) Narrow vs General AI (01:48:22) Future Directions & Needs (01:53:22) Future House: What's Next? (01:55:32) Outro SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://www.linkedin.com/in/nathanlabenz/ Youtube: https://www.youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
In this episode of The Cognitive Revolution, Nathan dives deep into the world of state space models with returning co-host Jason Meaux and special guest Quentin Anthony, Head of Model Training at Zyphra. Explore the cutting-edge Zamba 2-7b model, which combines selective state space and attention mechanisms. Uncover practical insights on model training, architectural choices, and the challenges of scaling AI. From learning schedules to hybrid architectures, loss metrics to context length extension, this technical discussion covers it all. Don't miss this in-depth conversation on the future of personalized, on-device AI. Check out more about Zyphra and Jason Meaux here: Zyphra's website: https://www.zyphra.com Zamba2-7B Blog: https://www.zyphra.com/post/zamba2-7b Zamba2 GitHub: https://github.com/Zyphra/Zamba2 Tree attention: https://www.zyphra.com/post/tree-attention-topology-aware-decoding-for-long-context-attention-on-gpu-clusters Jason's Meaux Twitter: https://x.com/KamaraiCode Jason's Meaux website: https://www.statespace.info Be notified early when Turpentine's drops new publication: https://www.turpentine.co/exclusiveaccess SPONSORS: Weights & Biases RAG++: Advanced training for building production-ready RAG applications. Learn from experts to overcome LLM challenges, evaluate systematically, and integrate advanced features. Includes free Cohere credits. Visit https://wandb.me/cr to start the RAG++ course today. Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive Notion: Notion offers powerful workflow and automation templates, perfect for streamlining processes and laying the groundwork for AI-driven automation. With Notion AI, you can search across thousands of documents from various platforms, generating highly relevant analysis and content tailored just for you - try it for free at https://notion.com/cognitiverevolution LMNT: LMNT is a zero-sugar electrolyte drink mix that's redefining hydration and performance. Ideal for those who fast or anyone looking to optimize their electrolyte intake. Support the show and get a free sample pack with any purchase at https://drinklmnt.com/tcr CHAPTERS: (00:00:00) Teaser (00:00:42) About the Show (00:01:05) About the Episode (00:03:09) Introducing Zyphra (00:07:28) Personalization in AI (00:12:48) State Space Models & Efficiency (Part 1) (00:19:22) Sponsors: Weights & Biases RAG++ | Shopify (00:21:26) State Space Models & Efficiency (Part 2) (00:22:23) Dense Attention to Shared Attention (00:29:41) Zyphra's Early Bet on Mamba (Part 1) (00:33:18) Sponsors: Notion | LMNT (00:36:00) Zyphra's Early Bet on Mamba (Part 2) (00:37:22) Loss vs. Model Quality (00:44:53) Emergence & Grokking (00:50:06) Loss Landscapes & Convergence (00:56:55) Sophia, Distillation & Secrets (01:09:00) Competing with Big Tech (01:23:50) The Future of Model Training (01:30:02) Deep Dive into Zamba 1 (01:34:24) Zamba 2 and Mamba 2 (01:38:56) Context Extension & Memory (01:44:04) Sequence Parallelism (01:45:44) Zamba 2 Architecture (01:53:57) Mamba Attention Hybrids (02:00:00) Lock-in Effects (02:05:32) Mamba Hybrids in Robotics (02:07:07) Ease of Use & Compatibility (02:12:10) Tree Attention vs. Ring Attention (02:22:02) Zyphra's Vision & Goals (02:23:57) Outro SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://www.linkedin.com/in/nathanlabenz/
NFTs & crypto, oh my; Google ruled a monopolist in search; you can't run an AI company ethically - with plenty of examples; Grokking without consent; Google's targeting "unknown" minors loophole; MetaAI celebrity voices; watermarking ChatGPT challenges; emotionally hooked on chatbots; Neuralink's second patient implant; Synchro's stent approach; Cybertruck spottings; Deadpool & Wolverine; American Conspiracy: the Octopus Murders; 2024 Olympics; enraging Google, Uber ads; living that iPad life; Humanes being returned; Sonos $30 million app fail; Lego Star Wars; advanced hinges; DarJar; losing power; medical system problems.Sponsors:Mood - Run it back with 20% off your first order and FREE gummies. Go to hellomood.com and use promo code GOG.1Password Extended Access Management - Check it out at 1Password.com/xam. Secure every sign-in for every app on every device.DeleteMe - Head over to JoinDeleteMe.com/GOG and use the code "GOG" for 20% off.Private Internet Access - Go to GOG.Show/vpn and sign up today. For a limited time only, you can get OUR favorite VPN for as little as $2.03 a month.SetApp - With a single monthly subscription you get 240+ apps for your Mac. Go to SetApp and get started today!!!Show notes at https://gog.show/660FOLLOW UPIllinois changes biometric privacy law to help corporations avoid big payoutsTrump-themed $DJT token rug-pulls, people blame Martin Shkreli or Barron TrumpFTX to pay $12.7B to victims of Sam Bankman-Fried's massive schemeIN THE NEWSGoogle 'is a monopolist' in search, US judge rules in antitrust caseIt's practically impossible to run a big AI company ethicallyNvidia Used Videos From Netflix and YouTube to Build AI ModelElon Musk sues OpenAI, Sam Altman for making a “fool” out of himX accused of using EU user data to train Grok without consentX agrees to pause EU data processing for training GrokMeta and Google secretly targeted minors on YouTube with Instagram adsMeta is reportedly offering millions to get Hollywood voices into its AI projectsOpenAI confirms it's looking into text watermarking for ChatGPT that could expose cheating studentsOpenAI Warns Users Could Become Emotionally Hooked on Its Voice ModeNeuralink successfully implants its chip into a second patient's brainAs Neuralink Implants Second Subject's Brain, First Patient Says His Doesn't Work as Well AnymoreGates, Bezos back $75M round for Synchron's drill-free brain-computer interface implantWhat Synchron leaders learned from developing a stent-based BCIFiery Cybertruck Death Spurs Federal ProbeMEDIA CANDYDeadpool & WolverineAmerican Conspiracy: The Octopus MurdersGoogle will no longer air an Olympics ad that showed a child using AI to write a fan letterGoogle + Team USA — Dear SydneyOn Our Way - UberAPPS & DOODADSESR for iPad Pro 12.9 Case (2022/2021/2020, 6th/5th/4th Generation), iPad Air 13 Inch Case 2024 M2, Convenient Magnetic Attachment, 2-Way Stand, Rebound Magnetic Case, BlackHumane's daily returns are outpacing salesCalifornians Are Getting Apple Wallet Driver's Licenses This YearSonos' $30M app fail is cautionary tale against rushing unnecessary updatesHome Security Giant ADT Admits It Has Been HackedThursday, the dating app that you can use only on Thursdays, expands to San FranciscoMultitasking with ChatGPT on the Mac just got a lot betterTHE DARK SIDE WITH DAVEDave BittnerThe CyberWireHacking HumansCaveatControl LoopOnly Malware in the BuildingLEGO® Star Wars: Rebuild the GalaxyThe Dark FalconCLOSING SHOUT-OUTSTexas school bans all-black clothing, cites mental health concernsSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
From building a data platform and Parquet at Twitter to using AI to make biology easier to engineer at Ginkgo Bioworks, Dmitriy joins the show to chat about the early days of big data, the conversation that made him jump into SynBio, LLMs for proteins and more. Segments: (00:03:18) Data engineering roots (00:05:40) Early influences at Lawrence Berkeley Lab (00:09:46) Value of a "gentleman's education in computer science" (00:14:34) The end of junior software engineers (00:20:10) Deciding to go back to school (00:21:36) Early experiments with distributed systems (00:23:33) The early days of big data (00:29:16) "The thing we used to call big data is now ai" (00:31:02) The maturation of data engineering (00:35:05) From consumer tech to biotech (00:37:42) "The 21st century is the century of biology" (00:40:54) The science of lab automation (00:47:22) Software development in biotech vs. consumer tech (00:50:34) Swes make more $$ than scientists? (00:54:27) Llms for language is boring. Llms for proteins? that's cool (01:02:52) Protein engineering 101 (01:06:01) Model explainability in biology Show Notes: The Death of the Junior Developer: https://sourcegraph.com/blog/the-death-of-the-junior-developer Dmitriy on twitter: https://x.com/squarecog?lang=en Tech and Bio slack community: https://www.bitsinbio.org/ Stay in touch: - Make Ronak's day by signing up for our newsletter to get our favorites parts of the convo straight to your inbox every week :D https://softwaremisadventures.com/ Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Nathan interviews Mike Knoop, co-founder of Zapier and co-creator of the ARC Prize, about the $1 million competition for more efficient AI architectures. They discuss the ARC AGI benchmark, its implications for general intelligence, and the potential impact on AI safety. Nathan reflects on the challenges of intuitive problem-solving in AI and considers hybrid approaches to AGI development. Apply to join over 400 founders and execs in the Turpentine Network: https://hmplogxqz0y.typeform.com/to/JCkphVqj RECOMMENDED PODCAST: Patrick McKenzie (@patio11) talks to experts who understand the complicated but not unknowable systems we rely on. You might be surprised at how quickly Patrick and his guests can put you in the top 1% of understanding for stock trading, tech hiring, and more. Spotify: https://open.spotify.com/show/3Mos4VE3figVXleHDqfXOH Apple: https://podcasts.apple.com/us/podcast/complex-systems-with-patrick-mckenzie-patio11/id1753399812 SPONSORS: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/ Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist. CHAPTERS: (00:00:00) About the Show (00:06:06) The ARC Benchmark (00:09:34) Other Benchmarks (00:10:58) Definition of AGI (00:14:38) The rules of the contest (00:18:16) ARC test set (Part 1) (00:18:23) Sponsors: Oracle | Brave (00:20:31) ARC test set (Part 2) (00:22:50) Stair-stepping benchmarks (00:26:17) ARC Prize (00:28:34) The rules of the ARC Prize (00:31:12) Compute costs (Part 1) (00:34:47) Sponsors: Omneky | Squad (00:36:34) Compute costs (Part 2) (00:36:40) Compute Limit (00:41:00) Public Leaderboard (00:42:58) The current AI ecosystem (00:47:23) The four steps of solving a puzzle (00:51:20) Intuition (00:54:32) Human Intelligence (00:56:06) Current Frontier Language Models (00:57:44) Program Synthesis (01:04:10) Is the model learning or memorizing? (01:09:51) Improving the ARC dataset (01:11:34) Step 3: Guessing the Rule (01:12:51) Dealing with Ambiguity (01:15:02) Exploring Solutions (01:17:02) Non-backpropagation evolutionary architecture search (01:19:49) Expectations for an AGI world (01:24:11) Reliability and out of domain generalization (01:28:35) What a person would do (01:29:51) What is the right generalization (01:35:32) The ARC AGI Challenge (01:37:01) Postscript (01:38:07) DSpi (01:39:55) Statespace models (01:43:28) Hybrid models (01:48:32) FunSearch (01:50:41) Kolmogorov-Arnold-Networks (01:54:18) Grokking (01:55:42) Outro
Tune in to this special episode of the Cognitive Revolution for AI scouting report. We cover the state of the art in AI applications for medicine, key AI concepts, and how these technologies could influence society. SPONSORS: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/ RECOMMENDED PODCAST - The Riff with Byrne Hobart Byrne Hobart, the writer of The Diff, is revered in Silicon Valley. You can get an hour with him each week. See for yourself how his thinking can upgrade yours. Spotify: https://open.spotify.com/show/6rANlV54GCARLgMOtpkzKt Apple: https://podcasts.apple.com/us/podcast/the-riff-with-byrne-hobart-and-erik-torenberg/id1716646486 CHAPTERS: (00:00:00) Introduction (00:06:13) Current State of the Art (00:08:34) Medicine (00:13:49) AI limitations (00:17:01) AI capabilities (00:17:02) Sponsors: Oracle | Brave (00:19:09) How AI works (00:24:05) Information processing (00:28:51) Curated data sets (00:32:52) Sponsors: Squad | Omneky (00:34:38) Transformer (00:37:11) Scaling (00:39:59) Emergence (00:44:03) Grokking (00:50:30) Best practices for business (00:53:38) Live Players (00:58:33) What to Watch (01:00:35) Final Thoughts
Fredrik is joined by Eric Normand for a discussion of debugging your ideas through domain modeling, using Eric's concept of lenses to find more good questions to ask. Eric is writing a book about domain modeling and has developed the concept of lenses - ways to look at various aspects of your domain, model, and code in order to better consider various solutions and questions. Why? Because design is needed, but is easily lost in the modern urge to be fast and agile. There's a lot you can and need do on the way to a working system. Eric pushes for design which is an integral part, perferably right in the code, rather than a separate one which can become outdated and separated without anyone noticing. Just spend a little more time on it. Tricks for seeing your domain with fresher eyes. Change is not always maximal and unpredictable! But thinking it is can lead to a lot of indirection and abstraction where a single if-statement could have sufficed for years. Refactoring as a way of finding the seams in your model. What is the code actually supposed to do? How does it actually fit with the domain? Recorded during Øredev 2023, where Eric gave two presentations about the topics discussed: Better software design with domain modeling and Stratified design and functional architecture. Thank you Cloudnet for sponsoring our VPS! Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlundand @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive. If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi. Links Eric Eric's Øredev 2023 presentations: Better software design with domain modeling *Stratified design and functional architecture Eric has his own podcast Grokking simplicity - Eric's book on functional programming Domain modeling Waterfall UML Clojure REPL - Red-evaluate-print loop Kodsnack 294 - the episode where Dan Lebrero gave Fredrik a feel for REPL-driven development Domain modeling lenses Drawing on the right side of the brain The “keynote yesterday” - Na'Tosha Bard about code outliving you (see also episode 558) Then a miracle occurs Titles I'm really on to something Anti-design trend In a waterfall world On the way to code Experimentation in code Not about moving your hand I don't want rules Yes, that's the right question! Take five minutes Spending more time on it Code lets me play with ideas I'm happy working on a whiteboard Debug your ideas Server babysitters
Fredrik is joined by Eric Normand for a discussion of debugging your ideas through domain modeling, using Eric’s concept of lenses to find more good questions to ask. Eric is writing a book about domain modeling and has developed the concept of lenses - ways to look at various aspects of your domain, model, and code in order to better consider various solutions and questions. Why? Because design is needed, but is easily lost in the modern urge to be fast and agile. There’s a lot you can and need do on the way to a working system. Eric pushes for design which is an integral part, perferably right in the code, rather than a separate one which can become outdated and separated without anyone noticing. Just spend a little more time on it. Tricks for seeing your domain with fresher eyes. Change is not always maximal and unpredictable! But thinking it is can lead to a lot of indirection and abstraction where a single if-statement could have sufficed for years. Refactoring as a way of finding the seams in your model. What is the code actually supposed to do? How does it actually fit with the domain? Recorded during Øredev 2023, where Eric gave two presentations about the topics discussed: Better software design with domain modeling and Stratified design and functional architecture. Thank you Cloudnet for sponsoring our VPS! Comments, questions or tips? We are @kodsnack, @tobiashieta, @oferlund and @bjoreman on Twitter, have a page on Facebook and can be emailed at info@kodsnack.se if you want to write longer. We read everything we receive. If you enjoy Kodsnack we would love a review in iTunes! You can also support the podcast by buying us a coffee (or two!) through Ko-fi. Links Eric Eric’s Øredev 2023 presentations: Better software design with domain modeling *Stratified design and functional architecture Eric has his own podcast Grokking simplicity - Eric’s book on functional programming Domain modeling Waterfall UML Clojure REPL - Red-evaluate-print loop Kodsnack 294 - the episode where Dan Lebrero gave Fredrik a feel for REPL-driven development Domain modeling lenses Drawing on the right side of the brain The “keynote yesterday” - Na’Tosha Bard about code outliving you (see also episode 558) Then a miracle occurs Titles I’m really on to something Anti-design trend In a waterfall world On the way to code Experimentation in code Not about moving your hand I don’t want rules Yes, that’s the right question! Take five minutes Spending more time on it Code lets me play with ideas I’m happy working on a whiteboard Debug your ideas Server babysitters
Art's unique background: the creator of Energetic NLP, NLP trainer, medical researcher, healer, intuitive, leadership and team specialist for major corporations; 4 decades studying psychic development, plus energetic and spiritual healing. Art has been blessed to receive two empowerments directly from the Dalai Lama of Tibet (including one to increase his ability to heal others). Art Has Been Fetaured On: The Love Code, the award-winning Dare to Dream podcast, Happy Insights, Leading Visionaries, Sense of Soul, One World in a New World, Zen Success, The Health & Happiness show, Awakenings, Remember Who You Are, Science & Spirituality Podcast, Change It Up Radio, Superpower Experts, JeffMara Podcast, Ghostman Radio, The Prosperity show, Influencers Channel, 360 Wisdom Speaks, Sunday Soul, the Self-Help Coaching podcast, Lisa Davies, Adam Markel, Loretta Brown Show, Laugh Box, Prosperity Place, Bringing Inspiration To Earth Radio show, Remember Who You Are, and Spiritually Inspired. He's been featured on hundreds of radio, television, podcast shows and in magazines. Art will be featured in an upcoming proninent documentary film, to be released next year. Art will promote the interview to his following. Art is a master at opening up people's abilities: — Learn to use universal energies to transform lives. — Read people's energy fields through the clair's and psychic abilities — Facilitate remote energy work and work with anyone — Accelerating Personal/Spiritual Growth while Activating Your Miraculous Abilities! Art's Talking Points: 1) Reprogramming Your Chakras to Thrive in Turbulent Times 2) Confidently be guided by your Inner Wisdom and Spirit 3) 3 Energy Techniques You Need to Know · See Art Giser's bio, interview questions, talking points, and free gift below. One of Art's areas of mastery is remote energy work. Your listeners will receive transformative energies during the podcast, even if it is a recording! BIO: Meet Art Giser, a visionary powerhouse with a lifetime of diverse experiences that have shaped him into a true trailblazer. With an impressive background spanning 39 years, Art has immersed himself in the realms of NLP, scientific exploration, psychic development, and the profound art of energetic and spiritual healing. His journey has been nothing short of extraordinary. In his earlier years, Art's insatiable curiosity led him to a role as a research lab manager at the esteemed University of California, San Francisco Medical School for eleven years. This foundational experience instilled in him a deep appreciation for the intricacies of human potential and sparked a lifelong passion for understanding the unseen dimensions that shape our reality. As an executive coach, Art has become an invaluable asset to some of the world's largest corporations. His unique approach empowers leaders and fosters the creation of highly collaborative and high-performing teams. Drawing from his rich tapestry of knowledge, he has crafted an innovative coaching style that transcends the boundaries of conventional methods. This has earned him a reputation as a sought-after mentor for those seeking transformative growth and profound shifts in both their personal and professional lives. Some of his clients include Sony, HP, Intel, Pfizer, and Merril Lynch.
NATS provides incredibly powerful usage patterns and one of the most fundamental is streams and consumers. This pairing is a super power and can solve many challenges in the most simplest way. We cover the bare fundamentals of NATS streams and how to consume data from them with some model use cases to help the exploration. Want to feel more confident about how to embrace data streams? Take a listen as we go back to basics. --- If you like this video, be sure to hit like and subscribe! Want to suggest a topic? Find us on Twitter (https://twitter.com/nats_io), LinkedIn (https://www.linkedin.com/showcase/nat..., or Slack (https://slack.nats.io)! Here are some useful links for NATS! - Website - https://nats.io - Examples - https://natsbyexample.com - Docs - https://docs.nats.io The home of the NATS team: https://synadia.com
Join Arize Co-Founder & CEO Jason Lopatecki, and ML Solutions Engineer, Sally-Ann DeLucia, as they discuss “Explaining Grokking Through Circuit Efficiency." This paper explores novel predictions about grokking, providing significant evidence in favor of its explanation. Most strikingly, the research conducted in this paper demonstrates two novel and surprising behaviors: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalization to partial rather than perfect test accuracy.Find the transcript and more here: https://arize.com/blog/explaining-grokking-through-circuit-efficiency-paper-reading/To learn more about ML observability, join the Arize AI Slack community or get the latest on our LinkedIn and Twitter.
Neel Nanda is a researcher at Google DeepMind working on mechanistic interpretability. He is also known for his YouTube channel where he explains what is going on inside of neural networks to a large audience. In this conversation, we discuss what is mechanistic interpretability, how Neel got into it, his research methodology, his advice for people who want to get started, but also papers around superposition, toy models of universality and grokking, among other things. Youtube: https://youtu.be/cVBGjhN4-1g Transcript: https://theinsideview.ai/neel OUTLINE (00:00) Intro (00:57) Why Neel Started Doing Walkthroughs Of Papers On Youtube (07:59) Induction Heads, Or Why Nanda Comes After Neel (12:19) Detecting Induction Heads In Basically Every Model (14:35) How Neel Got Into Mechanistic Interpretability (16:22) Neel's Journey Into Alignment (22:09) Enjoying Mechanistic Interpretability And Being Good At It Are The Main Multipliers (24:49) How Is AI Alignment Work At DeepMind? (25:46) Scalable Oversight (28:30) Most Ambitious Degree Of Interpretability With Current Transformer Architectures (31:05) To Understand Neel's Methodology, Watch The Research Walkthroughs (32:23) Three Modes Of Research: Confirming, Red Teaming And Gaining Surface Area (34:58) You Can Be Both Hypothesis Driven And Capable Of Being Surprised (36:51) You Need To Be Able To Generate Multiple Hypothesis Before Getting Started (37:55) All the theory is bullshit without empirical evidence and it's overall dignified to make the mechanistic interpretability bet (40:11) Mechanistic interpretability is alien neuroscience for truth seeking biologists in a world of math (42:12) Actually, Othello-GPT Has A Linear Emergent World Representation (45:08) You Need To Use Simple Probes That Don't Do Any Computation To Prove The Model Actually Knows Something (47:29) The Mechanistic Interpretability Researcher Mindset (49:49) The Algorithms Learned By Models Might Or Might Not Be Universal (51:49) On The Importance Of Being Truth Seeking And Skeptical (54:18) The Linear Representation Hypothesis: Linear Representations Are The Right Abstractions (00:57:26) Superposition Is How Models Compress Information (01:00:15) The Polysemanticity Problem: Neurons Are Not Meaningful (01:05:42) Superposition and Interference are at the Frontier of the Field of Mechanistic Interpretability (01:07:33) Finding Neurons in a Haystack: Superposition Through De-Tokenization And Compound Word Detectors (01:09:03) Not Being Able to Be Both Blood Pressure and Social Security Number at the Same Time Is Prime Real Estate for Superposition (01:15:02) The Two Differences Of Superposition: Computational And Representational (01:18:07) Toy Models Of Superposition (01:25:39) How Mentoring Nine People at Once Through SERI MATS Helped Neel's Research (01:31:25) The Backstory Behind Toy Models of Universality (01:35:19) From Modular Addition To Permutation Groups (01:38:52) The Model Needs To Learn Modular Addition On A Finite Number Of Token Inputs (01:41:54) Why Is The Paper Called Toy Model Of Universality (01:46:16) Progress Measures For Grokking Via Mechanistic Interpretability, Circuit Formation (01:52:45) Getting Started In Mechanistic Interpretability And Which WalkthroughS To Start With (01:56:15) Why Does Mechanistic Interpretability Matter From an Alignment Perspective (01:58:41) How Detection Deception With Mechanistic Interpretability Compares to Collin Burns' Work (02:01:20) Final Words From Neel
www.BlockBuster7.com www.ChangeTheWorldFromYourLivingRoom.com Art's unique background: the creator of Energetic NLP, NLP trainer, medical researcher, healer, intuitive, leadership and team specialist for major corporations; 4 decades studying psychic development, plus energetic and spiritual healing. Art has been blessed to receive two empowerments directly from the Dalai Lama of Tibet (including one to increase his ability to heal others). Art Has Been Fetaured On: The Love Code, the award-winning Dare to Dream podcast, Happy Insights, Leading Visionaries, Sense of Soul, One World in a New World, Zen Success, The Health & Happiness show, Awakenings, Remember Who You Are, Science & Spirituality Podcast, Change It Up Radio, Superpower Experts, JeffMara Podcast, Ghostman Radio, The Prosperity show, Influencers Channel, 360 Wisdom Speaks, Sunday Soul, the Self-Help Coaching podcast, Lisa Davies, Adam Markel, Loretta Brown Show, Laugh Box, Prosperity Place, Bringing Inspiration To Earth Radio show, Remember Who You Are, and Spiritually Inspired. He's been featured on hundreds of radio, television, podcast shows and in magazines. Art will be featured in an upcoming proninent documentary film, to be released next year. To Book Art on your program, email: · Debbi: debontheradio@gmail.com · Art: Art@EnergeticNLP.com Art will promote the interview to his following. Art is a master at opening up people's abilities: — Learn to use universal energies to transform lives. — Read people's energy fields through the clair's and psychic abilities — Facilitate remote energy work and work with anyone — Accelerating Personal/Spiritual Growth while Activating Your Miraculous Abilities! Art's Talking Points: 1) Reprogramming Your Chakras to Thrive in Turbulent Times 2) Confidently be guided by your Inner Wisdom and Spirit 3) 3 Energy Techniques You Need to Know · See Art Giser's bio, interview questions, talking points, and free gift below. One of Art's areas of mastery is remote energy work. Your listeners will receive transformative energies during the podcast, even if it is a recording! BIO: Meet Art Giser, a visionary powerhouse with a lifetime of diverse experiences that have shaped him into a true trailblazer. With an impressive background spanning 39 years, Art has immersed himself in the realms of NLP, scientific exploration, psychic development, and the profound art of energetic and spiritual healing. His journey has been nothing short of extraordinary. In his earlier years, Art's insatiable curiosity led him to a role as a research lab manager at the esteemed University of California, San Francisco Medical School for eleven years. This foundational experience instilled in him a deep appreciation for the intricacies of human potential and sparked a lifelong passion for understanding the unseen dimensions that shape our reality.
The podcast episode features a discussion between the host and Ishan Sharma, an AI practitioner, on various facets of artificial intelligence, particularly focusing on AI safety. The episode covers the following key points: AI Safety Debate: Ishan explains the concept of "Grokking" — an AI's deep understanding of data — and suggests that it contributes to mistrust in AI systems. He outlines two main camps in the AI safety debate: AI accelerationists, who downplay risks and advocate for rapid progress, and AI doomers, who emphasize the potentially catastrophic risks of AI. Primary Concerns: Ishan mentions three major concerns regarding AI safety: unpredictable emergence of capabilities, alignment with human values, and risks from deceptive AI. Safety Interventions: Various safety measures are proposed, ranging from extreme actions like pausing AI development to more moderate ones like better governance and oversight. Current Limitations: Ishan points out that current AI systems, like transformer architectures, are nearing their peak performance and that future advancements might require experiential learning akin to human experiences. Recorded Sept 10th, 2023. Other ways to connect Follow us on X and Instagram Follow Shubham on X Follow Ishan on X
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Explaining grokking through circuit efficiency, published by Vikrant Varma on September 8, 2023 on The AI Alignment Forum. This is a linkpost for our paper ''Explaining grokking through circuit efficiency", which provides a general theory explaining when and why grokking (aka delayed generalisation) occurs, and makes several interesting and novel predictions which we experimentally confirm (introduction copied below). You might also enjoy our explainer on X/Twitter. Abstract One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking occurs when the task admits a generalising solution and a memorising solution, where the generalising solution is slower to learn but more efficient, producing larger logits with the same parameter norm. We hypothesise that memorising circuits become more inefficient with larger training datasets while generalising circuits do not, suggesting there is a critical dataset size at which memorisation and generalisation are equally efficient. We make and confirm four novel predictions about grokking, providing significant evidence in favour of our explanation. Most strikingly, we demonstrate two novel and surprising behaviours: ungrokking, in which a network regresses from perfect to low test accuracy, and semi-grokking, in which a network shows delayed generalisation to partial rather than perfect test accuracy. Introduction When training a neural network, we expect that once training loss converges to a low value, the network will no longer change much. Power et al. (2021) discovered a phenomenon dubbed grokking that drastically violates this expectation. The network first ''memorises'' the data, achieving low and stable training loss with poor generalisation, but with further training transitions to perfect generalisation. We are left with the question: why does the network's test performance improve dramatically upon continued training, having already achieved nearly perfect training performance? Recent answers to this question vary widely, including the difficulty of representation learning (Liu et al., 2022), the scale of parameters at initialisation (Liu et al., 2023), spikes in loss (''slingshots") (Thilak et al., 2022), random walks among optimal solutions (Millidge et al., 2022), and the simplicity of the generalising solution (Nanda et al., 2023, Appendix E). In this paper, we argue that the last explanation is correct, by stating a specific theory in this genre, deriving novel predictions from the theory, and confirming the predictions empirically. We analyse the interplay between the internal mechanisms that the neural network uses to calculate the outputs, which we loosely call ''circuits'' (Olah et al., 2020). We hypothesise that there are two families of circuits that both achieve good training performance: one which generalises well (Cgen) and one which memorises the training dataset (Cmem). The key insight is that when there are multiple circuits that achieve strong training performance, weight decay prefers circuits with high ''efficiency'', that is, circuits that require less parameter norm to produce a given logit value. Efficiency answers our question above: if Cgen is more efficient than Cmem, gradient descent can reduce nearly perfect training loss even further by strengthening Cgen while weakening Cmem, which then leads to a transition in test performance. With this understanding, we demonstrate in Section 3 that three key properties are sufficient for grokking: (1) Cgen generalises well while Cmem does not, (2) Cgen is more efficient than Cmem, and (3) Cgen is learned more slowly than Cmem. Since Cgen generalises well, it automatically works ...
What is effective population size (Ne), and why is it important? Robin Waples takes us back to the basics of this important evolutionary concept and discusses his new paper, using simulations to demonstrate that Bill Hill's 1972 equation for calculating Ne still works for populations with extreme reproductive patterns. Hosted on Acast. See acast.com/privacy for more information.
In this episode, Nathan sits down with Paige Bailey, Lead Product Manager of Generative Models at Google Deepmind. In this conversation, they discuss what it's like to be a PM for an LLM as opposed to an app, defining ideal LLM behaviour, and reasoning - how do you distinguish real abilities vs pattern matching? If you're looking for an ERP platform, check out our sponsor, NetSuite: http://netsuite.com/cognitive RECOMMENDED PODCAST: The HR industry is at a crossroads. What will it take to construct the next generation of incredible businesses – and where can people leaders have the most business impact? Hosts Nolan Church and Kelli Dragovich have been through it all, the highs and the lows – IPOs, layoffs, executive turnover, board meetings, culture changes, and more. With a lineup of industry vets and experts, Nolan and Kelli break down the nitty-gritty details, trade offs, and dynamics of constructing high performing companies. Through unfiltered conversations that can only happen between seasoned practitioners, Kelli and Nolan dive deep into the kind of leadership-level strategy that often happens behind closed doors. Check out the first episode with the architect of Netflix's culture deck Patty McCord. https://link.chtbl.com/hrheretics TIMESTAMPS: (00:00) Episode Preview (00:01:15) Introducing Paige Bailey (00:04:21) Paige's background at Google Brain and the Deepmind merger (00:07:00) PM for a LLM vs being a PM for an app (00:11:21) The development timeline and compute budget of PaLM-2 (00:14:30) Paige's role in the PaLM 2 project (00:15:30) Sponsors: Netsuite | Omneky (00:17:26) Defining desired capabilities for PaLM-2 (00:19:17) The amount of work that went into elevating PaLM 2 from PaLM 1 (00:20:28) Has Google lost its ability to ship? (00:24:240) Paige's "eureka" moment seeing GitHub Copilot capabilities (00:27:47) Competing PaLM 2 with other models (00:32:20) Grokking and the predictability of emergent capabilities (00:37:30) Citizen scientists and the multilingual capabilities of PaLM 2 (00:39:29) Distinguishing real reasoning vs pattern matching (00:45:51) Products using PaLM-2 that people should try (00:50:35) Most exciting AI projects that you can try out (00:52:29) Curriculum learning and successor to the transformer LINKS: PaLM 2 Duet AI for developers Avenging Polayni's Revenge X/TWITTER: @DynamicWebPaige (Paige) @labenz (Nathan) @eriktorenberg @CogRev_Podcast SPONSORS: NetSuite | Omneky NetSuite has 25 years of providing financial software for all your business needs. More than 36,000 businesses have already upgraded to NetSuite by Oracle, gaining visibility and control over their financials, inventory, HR, eCommerce, and more. If you're looking for an ERP platform ✅ head to NetSuite: http://netsuite.com/cognitive and download your own customized KPI checklist. Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off. Music Credit: GoogleLM
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: An interactive introduction to grokking and mechanistic interpretability, published by Adam Pearce on August 7, 2023 on The AI Alignment Forum. Our write up largely agrees with @Quintin Pope's summary, with the addition of training trajectory visualizations and an explanation of the MLP construction that solves modular addition. A meta note that didn't make it into the article - with so many people looking into this problem over the last 18 months, I'm surprised this construction took so long to find. The modular addition task with a 1-layer MLP is about as simple as you can get! Scaling mechanistic interpretability up to more complex tasks/models seems worth continuing to try, but I'm less sure extracting crisp explanations will be possible. Even if we "solve" superposition, figuring the construction here - where there's no superposition in the generalizing model - wasn't trivial. gif/twitter summary If we train a MLP to solve modular addition, the generalizing phase has suggestive periodic patterns. To figure out why the model generalizes, we first look at task where we know the generalizing solution - sparse parity. You can see the model generalizing as weight decay prunes spurious connections. One point from the Omnigrok paper I hadn't internalized before training lots of models: grokking only happens when hyper-parameters are just right. We can make other weird things happen too, like AdamW oscillating between low train loss and low weights. To understand how a MLP solves modular addition, we train a much smaller model with a circular input embedding baked in. Following @Neel Nanda and applying a discrete Fourier transform, we see larger models trained from scratch use the same star trick! Finally, we show what the stars are doing and prove that they work: Our ReLU activation has a small error, but it's close enough to the exact solution - an x² activation suggested in Grokking modular arithmetic - for the model to patch everything up w/ constructive interference. And there are still open question: why are the frequencies with >5 neurons lopsided? Why does factoring Winput not do that same thing as factoring Woutput? Also see The Hydra Effect: Emergent Self-repair in Language Model Computations Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Visible loss landscape basins don't correspond to distinct algorithms, published by Mikhail Samin on July 28, 2023 on LessWrong. Thanks to Justis, Arthur Conmy, Neel Nanda, Joseph Miller, and Tilman Räuker for their feedback on a draft. I feel like many people haven't noticed an important result of mechanistic interpretability analysis of grokking, and so haven't updated how they think about loss landscapes and algorithms that neural networks end up implementing. I think this has implications for alignment research. When thinking about grokking, people often imagine something like this: the neural network implements Algorithm 1 (e.g., memorizes the training data), achieves ~ the lowest loss available via memorization, then moves around the bottom of the Algorithm 1 basin and after a while, stumbles across a path to Algorithm 2 (e.g., the general algorithm for modular addition). But the mechanistic interpretability of grokking analysis has shown that this is not true! Approximately from the start of the training, Algorithm 1 is most of what the circuits are doing and what almost entirely determines the neural network's output; but at the same time, the entire time the neural network's parameters visibly move down the wider basin, they don't just become better at memorization; they increasingly implement the circuits for Algorithm 1 and the circuits for Algorithm 2, in superposition. (Neel Nanda et al. have shown that the circuits that at the end implement the general algorithm for modular addition start forming approximately at the start of the training: the gradient was mostly an arrow towards memorization, but also, immediately from the initialization of the weights, a bit of an arrow pointing towards the general algorithm. The circuits were gradually tuned throughout the training. The noticeable change in the test loss starts occurring when the circuits are already almost right.) A path through the loss landscape visible in 3D doesn't correspond to how and what the neural network is actually learning. Almost all of the changes to the loss are due to the increasingly good implementation of Algorithm 1; but apparently, the entire time, the gradient also points towards some faraway implementation of Algorithm 2. Somehow, the direction in which Algorithm 2 lies is also visible to the derivative, and moving the parameters in the direction the gradient points means mostly increasingly implementing Algorithm 1, and also increasingly implementing the faraway Algorithm 2. "Grokking", visible in the test loss, is due to the change that happens when the parameters already implement Algorithm 2 accurately enough for the switch from mostly outputting the results of an implementation of Algorithm 1 to the results of an improving implementation of Algorithm 2 not to hurt the performance. Once it's the case, the neural network puts more weight into Algorithm 2 and at the same time quickly tunes it to be even more accurate (which is increasingly easy as the output is increasingly determined by the implementation of Algorithm 2). This is something many people seem to have missed. I did not expect it to be the case, was surprised, and updated how I think about loss landscapes. Does this generalize? Maybe. I'm not sure whether it's correct to generalize from the mechanistic interpretability of grokking analysis to neural networks in general, real LLMs are under-parametrised while the grokking model is very over-parameterised, but I guess it might be reasonable to expect that this is how deep learning generally works. People seem to think that multi-dimensional loss landscapes of neural networks have basins for specific algorithms, and neural networks get into these depending on how relatively large these basins are, which might be caused by how simple the algorithms are, how path-depe...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Visible loss landscape basins don't correspond to distinct algorithms, published by Mikhail Samin on July 28, 2023 on LessWrong. Thanks to Justis, Arthur Conmy, Neel Nanda, Joseph Miller, and Tilman Räuker for their feedback on a draft. I feel like many people haven't noticed an important result of mechanistic interpretability analysis of grokking, and so haven't updated how they think about loss landscapes and algorithms that neural networks end up implementing. I think this has implications for alignment research. When thinking about grokking, people often imagine something like this: the neural network implements Algorithm 1 (e.g., memorizes the training data), achieves ~ the lowest loss available via memorization, then moves around the bottom of the Algorithm 1 basin and after a while, stumbles across a path to Algorithm 2 (e.g., the general algorithm for modular addition). But the mechanistic interpretability of grokking analysis has shown that this is not true! Approximately from the start of the training, Algorithm 1 is most of what the circuits are doing and what almost entirely determines the neural network's output; but at the same time, the entire time the neural network's parameters visibly move down the wider basin, they don't just become better at memorization; they increasingly implement the circuits for Algorithm 1 and the circuits for Algorithm 2, in superposition. (Neel Nanda et al. have shown that the circuits that at the end implement the general algorithm for modular addition start forming approximately at the start of the training: the gradient was mostly an arrow towards memorization, but also, immediately from the initialization of the weights, a bit of an arrow pointing towards the general algorithm. The circuits were gradually tuned throughout the training. The noticeable change in the test loss starts occurring when the circuits are already almost right.) A path through the loss landscape visible in 3D doesn't correspond to how and what the neural network is actually learning. Almost all of the changes to the loss are due to the increasingly good implementation of Algorithm 1; but apparently, the entire time, the gradient also points towards some faraway implementation of Algorithm 2. Somehow, the direction in which Algorithm 2 lies is also visible to the derivative, and moving the parameters in the direction the gradient points means mostly increasingly implementing Algorithm 1, and also increasingly implementing the faraway Algorithm 2. "Grokking", visible in the test loss, is due to the change that happens when the parameters already implement Algorithm 2 accurately enough for the switch from mostly outputting the results of an implementation of Algorithm 1 to the results of an improving implementation of Algorithm 2 not to hurt the performance. Once it's the case, the neural network puts more weight into Algorithm 2 and at the same time quickly tunes it to be even more accurate (which is increasingly easy as the output is increasingly determined by the implementation of Algorithm 2). This is something many people seem to have missed. I did not expect it to be the case, was surprised, and updated how I think about loss landscapes. Does this generalize? Maybe. I'm not sure whether it's correct to generalize from the mechanistic interpretability of grokking analysis to neural networks in general, real LLMs are under-parametrised while the grokking model is very over-parameterised, but I guess it might be reasonable to expect that this is how deep learning generally works. People seem to think that multi-dimensional loss landscapes of neural networks have basins for specific algorithms, and neural networks get into these depending on how relatively large these basins are, which might be caused by how simple the algorithms are, how path-depe...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on The AI Alignment Forum. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking l...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on LessWrong. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking longer about a...
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: QAPR 5: grokking is maybe not that big a deal?, published by Quintin Pope on July 23, 2023 on LessWrong. [Thanks to support from Cavendish Labs and a Lightspeed grant, .I've been able to restart the Quintin's Alignment Papers Roundup sequence.] Introduction Grokking refers to an observation by Power et al. (below) that models trained on simple modular arithmetic tasks would first overfit to their training data and achieve nearly perfect training loss, but that training well past the point of overfitting would eventually cause the models to generalize to unseen test data. The rest of this post discusses a number of recent papers on grokking. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of "grokking" a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. My opinion: When I first read this paper, I was very excited. It seemed like a pared-down / "minimal" example that could let us study the underlying mechanism behind neural network generalization. You can read more of my initial opinion on grokking in the post Hypothesis: gradient descent prefers general circuits. I now think I was way too excited about this paper, that grokking is probably a not-particularly-important optimization artifact, and that grokking is no more connected to the "core" of deep learning generalization than, say, the fact that it's possible for deep learning to generalize from an MNIST training set to the testing set. I also think that using the word "grokking" was anthropomorphizing and potentially misleading (like calling the adaptive information routing component of a transformer model its "attention"). Evocative names risk letting the connotations of the name filter into the analysis of the object being named. E.g., "Grokking" brings connotations of sudden realization, despite the fact that the grokking phase in the above plot starts within the first ~5% - 20% of the training process, though it appears much more abrupt due to the use of a base 10 logarithmic scale on the x-axis. "Grokking" also brings connotations of insight, realization or improvement relative to some previously confused baseline. This leads to the impression that things which grok are better than things which don't. Humans often use the word "grokking" to mean deeply understanding complex domains that actually matter in the real world. Using the same word in an ML context suggests that ML grokking is relevant to whatever mechanisms might let an ML system deeply understand complex domains that actually matter in the real world. I've heard several people say things like: Studying grokking could significantly advance ML capabilities, if doing so were to lead to a deeper understanding of the mechanisms underlying generalization in ML. Training long enough could eventually result in grokking occurring in ML domains of actual relevance, such as language, and thereby lead to sudden capabilities gains or break alignment properties. Grokking is an example of how thinking longer about a...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Eric Michaud on the Quantization Model of Neural Scaling, Interpretability and Grokking, published by Michaël Trazzi on July 12, 2023 on The AI Alignment Forum. Eric is a PhD student in the Department of Physics at MIT working with Max Tegmark on improving our scientific/theoretical understanding of deep learning -- understanding what deep neural networks do internally and why they work so well. We mostly talk about Eric's paper, The Quantization Model of Neural Scaling, but also two papers he recently published on Grokking, Towards Understanding Grokking: an effective theory of representation learning, and Omnigrok: Grokking Beyond Algorithmic Data. Below are some highlighted quotes from our conversation (available on Youtube, Spotify, Google Podcast, Apple Podcast). For the full context for each of these quotes, you can find the accompanying transcript. On The Quantization Of Neural Scaling "The name of the paper is the quantization model of neural scaling. And the one-tweet summary is that it's possible for smooth loss curves on average to average over lots of small, discrete phase changes in the network performance. What if there were a bunch of things that you need to learn to do prediction well in something language? And so these things could be pieces of knowledge or different abilities to perform certain types of specific computations. We can imagine enumerating this set of things that you need to learn to do prediction well. And we call these the quanta of the prediction problem. And then what if the frequency in natural data that these were useful, each of these quanta, each of these pieces of knowledge or computational ability, what if the frequency that they were useful for prediction followed a power law?" (context) Quantas are the smallest clusters for simple subtasks "In order to predict the new line, has to count line lengths for the previous lines in the document. And then it's able to use that to accurately predict when a new line should be present. And you can find just a large number of clusters where the thing that is common between the clusters just seems to be that it's the same type of problem, or doing prediction on those samples requires the same piece of knowledge. And so you might call these the quanta, or evidence of there being quanta, although it's a little bit tricky, because we, in doing the clustering, enforce this discreteness, where everything is a member of a cluster, a particular cluster, and not another cluster. Anyway, it's complicated and weird. Who knows whether this is even the right model for thinking about the networks." (context) What the existence of quanta would mean for interpretability "It would be very exciting if it was the true model, because it would maybe tell you that there were these set of things where, if you enumerated them, you could understand the network's performance and understood what it has learned. It's just like, ah, there's this set of pieces of knowledge or pieces of computation that are needed. And you could describe what these are. You could find them in the network and maybe hope to mechanistically understand the whole network by decomposing it into how it implements each one of these things, how it learns each piece of knowledge or each piece of computation." How Quantization of Neural Scaling relates to other lines of research like Grokking, or interpretability "With both the quanta scaling stuff and with the grokking stuff, we sort of hope to identify these maybe mechanisms in the model that are responsible for certain behaviors or for the model generalizing. And in the case of grokking, there's sort of multiple circuits or multiple mechanisms that are going on in the model or something where there's a memorizing mechanism and a generalizing mechanism. [...] And maybe just in general beyond ...
Eric is a PhD student in the Department of Physics at MIT working with Max Tegmark on improving our scientific/theoretical understanding of deep learning -- understanding what deep neural networks do internally and why they work so well. This is part of a broader interest in the nature of intelligent systems, which previously led him to work with SETI astronomers, with Stuart Russell's AI alignment group (CHAI), and with Erik Hoel on a project related to integrated information theory. Transcript: https://theinsideview.ai/eric Youtube: https://youtu.be/BtHMIQs_5Nw The Quantization Model of Neural Scaling: https://arxiv.org/abs/2303.13506An Effective Theory of Representation Learning https://arxiv.org/abs/2205.10343 Omnigrok: Grokking Beyond Algorithmic Data: https://arxiv.org/abs/2210.01117
The Abide Project reported on Classis Grand Rapids East's Town Hall Meeting held at Calvin CRC in Grand Rapids. Paul DeVries afforded them a reprieve at Synod 2023 some are coming to the realization that they can't simply stonewall denominational efforts to resist their decisions with respect to Same Sex Marriage. This is some analysis on Abide's Youtube presentation. @abideproject CRCNA Synod Update 7 - Classis Grand Rapids East https://youtu.be/3FQu7ynbJx8 @crcna Synod 2023 https://www.youtube.com/live/XCbC6hmymC8?feature=share July Washington DC Event with Michael Martin and Spencer Klavan https://www.eventbrite.com/e/christ-and-community-in-the-modern-age-tickets-608591853617 Breakwater Festival Mannheim Germany October 27-29 2023 Event Details and Tickets: https://buytickets.at/breakwater/935800 T-shirts: https://buytickets.at/breakwater/store Discord: tinyurl.com/BreakwaterDiscord Festival Email: contact.breakwater@gmail.com Flyer https://bit.ly/breakwaterfestival2023 Paul Vander Klay clips channel https://www.youtube.com/channel/UCX0jIcadtoxELSwehCh5QTg Bridges of Meaning Discord https://discord.gg/rpf3zatJ https://www.meetup.com/sacramento-estuary/ My Substack https://paulvanderklay.substack.com/ Estuary Hub Link https://www.estuaryhub.com/ If you want to schedule a one-on-one conversation check here. https://paulvanderklay.me/2019/08/06/converzations-with-pvk/ There is a video version of this podcast on YouTube at http://www.youtube.com/paulvanderklay To listen to this on ITunes https://itunes.apple.com/us/podcast/paul-vanderklays-podcast/id1394314333 If you need the RSS feed for your podcast player https://paulvanderklay.podbean.com/feed/ All Amazon links here are part of the Amazon Affiliate Program. Amazon pays me a small commission at no additional cost to you if you buy through one of the product links here. This is is one (free to you) way to support my videos. https://paypal.me/paulvanderklay Blockchain backup on Lbry https://odysee.com/@paulvanderklay https://www.patreon.com/paulvanderklay Paul's Church Content at Living Stones Channel https://www.youtube.com/channel/UCh7bdktIALZ9Nq41oVCvW-A To support Paul's work by supporting his church give here. https://tithe.ly/give?c=2160640
In this wide-ranging conversation, Tim Scarfe interviews Neel Nanda, a researcher at DeepMind working on mechanistic interpretability, which aims to understand the algorithms and representations learned by machine learning models. Neel discusses how models can represent their thoughts using motifs, circuits, and linear directional features which are often communicated via a "residual stream", an information highway models use to pass information between layers. Neel argues that "superposition", the ability for models to represent more features than they have neurons, is one of the biggest open problems in interpretability. This is because superposition thwarts our ability to understand models by decomposing them into individual units of analysis. Despite this, Neel remains optimistic that ambitious interpretability is possible, citing examples like his work reverse engineering how models do modular addition. However, Neel notes we must start small, build rigorous foundations, and not assume our theoretical frameworks perfectly match reality. The conversation turns to whether models can have goals or agency, with Neel arguing they likely can based on heuristics like models executing long term plans towards some objective. However, we currently lack techniques to build models with specific goals, meaning any goals would likely be learned or emergent. Neel highlights how induction heads, circuits models use to track long range dependencies, seem crucial for phenomena like in-context learning to emerge. On the existential risks from AI, Neel believes we should avoid overly confident claims that models will or will not be dangerous, as we do not understand them enough to make confident theoretical assertions. However, models could pose risks through being misused, having undesirable emergent properties, or being imperfectly aligned. Neel argues we must pursue rigorous empirical work to better understand and ensure model safety, avoid "philosophizing" about definitions of intelligence, and focus on ensuring researchers have standards for what it means to decide a system is "safe" before deploying it. Overall, a thoughtful conversation on one of the most important issues of our time. Support us! https://www.patreon.com/mlst MLST Discord: https://discord.gg/aNPkGUQtc5 Twitter: https://twitter.com/MLStreetTalk Neel Nanda: https://www.neelnanda.io/ TOC [00:00:00] Introduction and Neel Nanda's Interests (walk and talk) [00:03:15] Mechanistic Interpretability: Reverse Engineering Neural Networks [00:13:23] Discord questions [00:21:16] Main interview kick-off in studio [00:49:26] Grokking and Sudden Generalization [00:53:18] The Debate on Systematicity and Compositionality [01:19:16] How do ML models represent their thoughts [01:25:51] Do Large Language Models Learn World Models? [01:53:36] Superposition and Interference in Language Models [02:43:15] Transformers discussion [02:49:49] Emergence and In-Context Learning [03:20:02] Superintelligence/XRisk discussion Transcript: https://docs.google.com/document/d/1FK1OepdJMrqpFK-_1Q3LQN6QLyLBvBwWW_5z8WrS1RI/edit?usp=sharing Refs: https://docs.google.com/document/d/115dAroX0PzSduKr5F1V4CWggYcqIoSXYBhcxYktCnqY/edit?usp=sharing
Robby has a chat with software engineer, trainer, and author of the book Grokking Simplicity, Eric Normand (he/him/his). As Eric reflects on his experience, the first thought he has of well-maintained software is that it's like a unicorn (Something you can't find in real life), but on a more serious note, he emphasizes that smaller code bases are easier to maintain and that the maintainability of a code base is also highly dependent on the team of engineers that are working on it.Robby and Eric will dive into the pros and cons of microservices in small organizations, why teams need to ensure they know where they're going with the future of their codebase, Eric's book - Grokking Simplicity, the differences within a functional programming language, higher-order functions, Eric's journey toward authoring a technical book, and the value of engineers being able to get comfortable moving code around in a project without needing to ask for permission first. Tune in and enjoy!Book Recommendations:Book Recommendation: The Timeless Way of Building by Christopher W. Alexander>Helpful Links:Grokking Simplicity By Eric Normand, ManningEric's WebsiteEric on LinkedInEric on TwitterSubscribe to Maintainable on:Apple PodcastsOvercastSpotifyOr search "Maintainable" wherever you stream your podcasts.Keep up to date with the Maintainable Podcast by joining the newsletter.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper Replication Walkthrough: Reverse-Engineering Modular Addition, published by Neel Nanda on March 12, 2023 on The AI Alignment Forum. I'm excited about trying different formats for mechanistic interpretability education! I've made a video walkthrough where we replicate my paper, Progress Measures for Grokking via Mechanistic Interpretability. With Jess Smith, one of my co-authors, we record ourselves coding a replication and discussed what we did at each step. This is a three part walkthrough and you can see the accompanying code for the walkthrough here: In part 1, we train a model to perform modular addition, and see that it does grok! In part 2, we take this model and reverse-engineer the trig-based circuit it's learned to do modular addition. We show that you can both read out intermediate steps of the circuit from the activations, and that you can just read off some of the algorithm's steps from the model weights. In part 3, we define some progress measures that let us distinguish progress towards the generalising and the memorising algorithm. We then look at the model during training and watch how the circuits develop, and use this to understand why it groks. This is an experiment with a new format, and I'd love to hear about how useful you find it! Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: 200 COP in MI: Interpreting Algorithmic Problems, published by Neel Nanda on December 31, 2022 on The AI Alignment Forum. This is the fourth post in a sequence called 200 Concrete Open Problems in Mechanistic Interpretability. Start here, then read in any order. If you want to learn the basics before you think about open problems, check out my post on getting started. I'll make another post every 1-2 days, giving a new category of open problems. If you want to read ahead, check out the draft sequence here! Motivation Motivating paper: A Mechanistic Interpretability Analysis of Grokking When models are trained on synthetic, algorithmic tasks, they often learn to do some clean, interpretable computation inside. Choosing a suitable task and trying to reverse engineer a model can be a rich area of interesting circuits to interpret! In some sense, this is interpretability on easy mode - the model is normally trained on a single task (unlike language models, which need to learn everything about language!), we know the exact ground truth about the data and optimal solution, and the models are tiny. So why care? I consider my work on grokking to be an interesting case study of this work going well. Grokking (shown below) is a mysterious phenomena where, when small models are trained on algorithmic tasks (eg modular addition or modular division), they initially memorise the training data. But when they keep being trained on that data for a really long time, the model suddenly(ish) figures out how to generalise! In my work, I simplified their setup even further, by training a 1 Layer transformer (with no LayerNorm or biases) to do modular addition and reverse engineered the weights to understand what was going on. And it turned out to be doing a funky trig-based algorithm (shown below), where the numbers are converted to frequencies with a memorised Discrete Fourier Transform, added using trig identities, and converted back to the answer! Using this, we looked inside the model and identified that despite seeming to have plateaued, in the period between memorising and "grokking", the model is actually slowly forming the circuit that does generalise. But so long as the model still has the memorising circuit, this adds too much noise to have good test loss. Grokking occurs when the generalising circuit is so strong that the model decides to "clean-up" the memorising circuit, and "uncovers" the mature generalising circuit beneath, and suddenly gets good test performance. OK, so I just took this as an excuse to explain my paper to you. Why should you care? I think that the general lesson from this, that I'm excited to see applied elsewhere, is using toy algorithmic models to analyse a phenomena we're confused about. Concretely, given a confusing phenomena like grokking, I'd advocate the following strategy: Simplify to the minimal setting that exhibits the phenomena, yet is complex enough to be interesting Reverse-engineer the resulting model, in as much detail as you can Extrapolate the insights you've learned from the reverse-engineered model - what are the broad insights you've learned? What do you expect to generalise? Can you form any automated tests to detect the circuits you've found, or any of their motifs? Verify by looking at other examples of the phenomena and seeing whether these insights actually hold (larger models, different tasks, even just earlier checkpoints of the model or different random seeds) Grokking is an example in a science of deep learning context - trying to uncover mysteries about how models learn and behave. But this same philosophy also applies to understanding confusing phenomena in language models, and building toy algorithmic problems to study those! Anthropic's Toy Models of Superposition is an excellent example of this done well, for the case of ...
In der heutigen Folge sprechen Jan-Paul, Chris, Martin und Calso mit Kalle Rosenbaum, dem Autor von dem in dieser Staffel gelesenen Buch „Bitcoin begreifen“ oder “Grokking Bitcoin”. Mit diesem englischsprachigen Interview schließen wir den Buchclub für diese Staffel ab. Von und mit Jan-Paul, Chris, Martin, Calso und Kalle Rosenbaum. Hier könnt ihr uns eine Spende über Lightning da lassen: ⚡️ tip@tip.nodesignal.space Wenn euch unsere Arbeit gefällt, könnt ihr unsere Folgen auch auf Podcasting 2.0 Plattformen, wie Breez, Fountain oder Castamatic hören und uns so eine kleine Aufmerksamkeit da lassen. Danke an alle, die die Bitcoin Community mit ihren Spenden unterstützen! Mit diesen Spenden wird unter anderem unser Bounty Programm verwirklicht, in dem ihr euch für die Mitarbeit an einem Projekt eine Belohnung sichern könnt. -------------- Blockzeit: 767260 Nodesignal Buchclubreihe Kalle bei Twitter Kalle auf Mastodon -------------- Für Feedback und weitergehenden Diskussionen kommt gerne in die Telegramgruppe von Nodesignal und bewertet uns bei Spotify und Apple Podcasts. Folgt uns auch gerne bei Twitter oder Instagram.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Mesa-Optimizers via Grokking, published by orthonormal on December 6, 2022 on The AI Alignment Forum. Summary: Recent interpretability work on "grokking" suggests a mechanism for a powerful mesa-optimizer to emerge suddenly from a ML model. Inspired By: A Mechanistic Interpretability Analysis of Grokking Overview of Grokking In January 2022, a team from OpenAI posted an article about a phenomenon they dubbed "grokking", where they trained a deep neural network on a mathematical task (e.g. modular division) to the point of overfitting (it performed near-perfectly on the training data but generalized poorly to test data), and then continued training it. After a long time where seemingly nothing changed, suddenly the model began to generalize correctly and perform much better on test data: A team at Anthropic analyzed grokking within large language models and formulated the idea of "induction heads". These are particular circuits (small sub-networks) that emerge over the course of training, which serve clearly generalizable functional roles for in-context learning. In particular, for GPTs multi-layer transformer networks doing text prediction, the model eventually generates circuits which hold on to past tokens from the current context, such that when token A appears, they direct attention to every token that followed A earlier in the context. (To reiterate, this circuit does not start a session with those associations between tokens; it is instead a circuit which learns patterns as it reads the in-context prompt.) The emergence of these induction heads coincides with the drop in test error, which the Anthropic team called a "phase change": Neel Nanda and Tom Lieberum followed this with a post I highly recommend, the aforementioned A Mechanistic Interpretability Analysis of Grokking. They looked more closely at grokking for mathematical problems, and were impressively able to reverse-engineer the post-grokking algorithm: it had cleanly implemented the Discrete Fourier Transform. (To be clear, it is not as if the neural network had abstractly reasoned its way through higher mathematics; it just found the solution with the simplest structure, which is the DFT.) They also gave a fascinating account of what might be happening behind the curtain as a neural network groks a pattern. In short, a network starts out by memorizing the training data rather than finding a general solution, because the former can easily be implemented with one modification at a time, while the latter requires coordinated circuits. However, once the model reaches diminishing returns on memorization, a bit of regularization will encourage it to reinforce simple circuits that cover many cases: And the natural way to do this is by picking up on regularities in the data - eg, you can memorise modular addition twice as efficiently by recognising that x+y=y+x. Once these circuits emerge, gradient descent quickly replaces the memorized solution with them. The training error decreases ever-so-slightly as the regularization penalty gets lower, while the test error plummets because the circuit generalizes there. To make an analogy: What Grokking Feels Like From the Inside You're an AI being trained for astronomy. Your trainers have collected observations of the night sky, for a millennium, from planet surfaces of ten thousand star systems. Over and over, they're picking a system and feeding you the skies one decade at a time and asking you to predict the next decade of skies. They also regularize you a bit (giving you tiny rewards for being a simpler AI). Eventually, you've seen each star system enough times that you've memorized a compressed version of their history, and all you do is identify which system you're in and then replay your memories of it. For instance, within the first few visits to our system, ...
Today our guest postponed so the gang briefly groks Keurig Dr Pepper's $50m investment in non-alc brewer Athletic Brewing. Note: Video feeds on Spotify and at www.youtube.com/beernetradio About Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests once a week as they grok the industry issues of the day. -Podcast Hotline: Text or call Harry in confidence at: (262) 345-2501 -Audio/Video feeds at https://linktr.ee/beernet -Articles referenced at beernet.com -Twitter / IG / TikTok: @beerbizdaily --- Send in a voice message: https://anchor.fm/beernetradio/message
What is Grokking?On today's episode we'll answer the question: “What is Grokking?” You may have heard the word “Grok” before but may not have known what it meant. Grokking is used by different people and in different contexts. Usually, people use the term to convey thoroughly understanding and interaction with a game or system. But, the origins of the word; how it's used; and its application to games and games-based learning is diverse and fascinatingIf you liked this episode please consider commenting, sharing, and subscribing.Subscribing is absolutely free and ensures that you'll get the next episode of Experience Points delivered directly to you.I'd also love it if you took some time to rate the show!I live to lift others with learning. So, if you found this episode useful, consider sharing it with someone who could benefit.Also make sure to visit University XP online at www.universityxp.comUniversity XP is also on Twitter @University_XP and on Facebook and LinkedIn as University XPAlso, feel free to email me anytime at dave@universityxp.comGame on!Get the full transcript and references for this episode here: https://www.universityxp.com/podcast/71Support the show
Today the gang groks Boston Beer's recent pivot to lighter-tasting products -- zigging where others zag -- and the nature of malt/sugar seltzers versus spirits-based RTDs and how brands are jockeying for position between the two, and a quick discussion on what the TTB/IRS raid(s) on Southern Glazer's means. Note: Video feeds on Spotify and YouTube at: https://www.youtube.com/watch?v=LGL_lMAGZ5s Beer Summit info: https://beernet.com/beer-summit/ About Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests once a week as they grok the industry issues of the day. -Podcast Hotline: Text or call Harry in confidence at: (262) 345-2501 -Audio/Video feeds at https://linktr.ee/beernet -Articles referenced at beernet.com -Twitter / IG / TikTok: @beerbizdaily --- Send in a voice message: https://anchor.fm/beernetradio/message
Today Jordan and Harry sit down right after the Mark Anthony Brands 2023 distributor meeting in the lobby of the Sheraton Grand in Chicago to talk through our impressions of their new innovations. Note: The audio is poor, but we've included subtitles in the video. Watch the video on YouTube: https://www.youtube.com/watch?v=LlB0dArHJmA ---About--- Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests once a week as they grok the industry issues of the day. -Podcast Hotline: Text or call Harry in confidence at: (262) 345-2501 -Audio/Video feeds at https://linktr.ee/beernet -Articles referenced at beernet.com -Twitter / IG / TikTok: @beerbizdaily --- Send in a voice message: https://anchor.fm/beernetradio/message
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA, published by Marius Hobbhahn on October 4, 2022 on LessWrong. There is a new paper by Liu et al. that claims to have understood the key mechanism underlying grokking (potentially even generalization more broadly). They argue: 1. Grokking can be explained via the norm of the weights. They claim that there is a constant level of the weight norm that is optimal for generalization.2. If there is an optimal level of the weight norm, the weight norm of your model after initialization can be either too low, too high or optimal. They claim that grokking is a phenomenon where we initialize the model with a large weight norm and it then slowly walks toward the optimal weight norm and then generalizes. 3. They also claim that you can get the same results as grokking but much faster if you set the weight norm correctly at every step.4. They set the norm "correctly" by rescaling the weights after each unconstrained optimization step (so after every weight update loss.backward()?!)Implications: I think they have found a very important insight of grokking and finding generalizing circuits more broadly. I'm still a bit skeptical of some of the claims and results. On some level "just fix your weight norm and the model generalizes" sounds too simple to be true for all tasks. I think this result could have big implications but I'm not yet sure whether they are positive or negative. On the one hand, finding generalizing circuits seems to solve some of the problems associated with bad out-of-distribution generalization. On the other hand, it likely speeds up capabilities. I'm very unsure about this paper but intuitively it feels important. Thoughts? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Link to original articleWelcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Paper+Summary: OMNIGROK: GROKKING BEYOND ALGORITHMIC DATA, published by Marius Hobbhahn on October 4, 2022 on LessWrong. There is a new paper by Liu et al. that claims to have understood the key mechanism underlying grokking (potentially even generalization more broadly). They argue: 1. Grokking can be explained via the norm of the weights. They claim that there is a constant level of the weight norm that is optimal for generalization.2. If there is an optimal level of the weight norm, the weight norm of your model after initialization can be either too low, too high or optimal. They claim that grokking is a phenomenon where we initialize the model with a large weight norm and it then slowly walks toward the optimal weight norm and then generalizes. 3. They also claim that you can get the same results as grokking but much faster if you set the weight norm correctly at every step.4. They set the norm "correctly" by rescaling the weights after each unconstrained optimization step (so after every weight update loss.backward()?!)Implications: I think they have found a very important insight of grokking and finding generalizing circuits more broadly. I'm still a bit skeptical of some of the claims and results. On some level "just fix your weight norm and the model generalizes" sounds too simple to be true for all tasks. I think this result could have big implications but I'm not yet sure whether they are positive or negative. On the one hand, finding generalizing circuits seems to solve some of the problems associated with bad out-of-distribution generalization. On the other hand, it likely speeds up capabilities. I'm very unsure about this paper but intuitively it feels important. Thoughts? Thanks for listening. To help us out with The Nonlinear Library or to learn more, please visit nonlinear.org.
Yeah, I know that all of the so-called news shows here in the US are all in on the death of the Queen, but she died, she'll remain dead, Charles is the King and that's about all I have to say about that. I could say a few other things, but it would be counter-productive, and we've got more important things to deal with. Like WTF with this so-called judge that the former guy installed who appeased his call for a "special master" to derail the DOJ's investigation into TFG for stealing classified information on his way out of the White House. Today we'll meet a new guest of the show. Teri Kanefield is an author, teacher, appellate defense attorney and someone who's taken the facts of the Stolen Documents caper, organized them and made them easy to read, understand and digest on her blog at https://terikanefield.com and on Twitter (@Teri_Kanefield). Perhaps we'll be able to cut through some of the noise trying to drown out the truth.
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Grokking “Semi-informative priors over AI timelines”, published by anson.ho on June 12, 2022 on The Effective Altruism Forum. Notes: I give visual explanations for Tom Davidson's report, Semi-informative priors over AI timelines, and summarise the key assumptions and intuitions The diagrams can be found here – you can click on the boxes to get linked to the part of the report that you're interested in Thanks to the Epoch team for feedback and support! Thanks especially to Jaime Sevilla and Tom Davidson for providing detailed feedback. Executive Summary The framework in Semi-informative priors over AI timelines assumes a model of AGI development which consists of a sequence of Bernoulli trials, i.e. it treats each calendar year as a “trial” at building AGI with constant probability p of succeeding. However, we don't know what this value of p is, so we use a generalisation of Laplace's rule of succession to estimate P(AGI next year | no AGI yet). This is done by specifying a first-trial probability, the probability of successfully building AGI in the first year of AI research, together with the number of virtual successes, which tells us how quickly we should update our estimate for P(AGI next year | no AGI yet) based on evidence. The framework leans very heavily on the first-trial probability, which is determined using a subjective selection of reference classes (more here). How much evidence we get depends on the number of trials that we see, which depends on the regime start-time – you can think of this as the time before which failure to develop AGI doesn't tell us anything useful about the probability of success in later trials. For instance, we might think that 1956 (the year of the Dartmouth Conference) was the first year where people seriously started trying to build AGI, so the absence of AGI before 1956 isn't very informative. If we think of each trial as a calendar year, then there have been 2021-1956 = 65 trials since the regime start-time, and we still haven't developed AGI, so that's 65 failed trials which we use to update P(AGI next year | no AGI yet), where “next year” now corresponds to 2022 rather than 1957. But why should a trial correspond to a calendar year? The answer is that it doesn't have to! In total, Davidson considers three candidate trial definitions: Calendar-year trials: 1 trial = 1 calendar year Compute trials: 1 trial = a 1% increase in the largest amount of compute used to develop an AI system to date Researcher-year trials: 1 trial = a 1% increase in the total researcher-years so far If we extend this reasoning, then we can predict the probability that AGI is built X years into the future. Davidson does this to predict P(AGI by 2036 | no AGI yet) as follows: The idea is that this framework only incorporates a small amount of information based on observational evidence, giving “semi-informative priors” over AI timelines. This framework is shown in more detail below: Since Davidson uses three different trial definitions, we actually get three of these diagrams! All in all, Davidson uses this to get a central estimate of P(AGI by 2036 | no AGI yet)=8%, with the following cumulative probability function: Motivation One way of forecasting AI Timelines is to consider the inner workings of AI, guess what kinds of developments are the most important, and then generate a probability distribution over when Artificial General Intelligence (AGI) will be developed. This is the approach taken by Ajeya Cotra in Forecasting TAI with biological anchors, a really detailed draft report that draws analogy to the human brain to forecast when Transformative AI (TAI) will first be developed. Tom Davidson's report, Semi-informative priors over AI timelines, is also a detailed report forecasting AI timelines, but it takes a different approach to Cotra's repor...
Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Grokking “Forecasting TAI with biological anchors”, published by Anson Ho on June 6, 2022 on The AI Alignment Forum. Notes: I give a visual explanation of Ajeya Cotra's draft report, Forecasting TAI with biological anchors (Cotra, 2020), summarising the key assumptions, intuitions, and conclusions The diagrams can be found here – you can click on some of the boxes to get linked to the part of the report that you're interested in Thanks to Michael Aird, Ashwin Acharya, and the Epoch team for suggestions and feedback! Special thanks to Jaime Sevilla and Ajeya Cotra for detailed feedback. Executive Summary Click here to skip the summary Ajeya Cotra's biological anchors framework attempts to forecast the development of Transformative AI (TAI) by treating compute as a key bottleneck to AI progress. This lets us focus on a concrete measure (compute, measured in FLOP) as a proxy for the question “when will TAI be developed?” Given this, we can decompose the question into two main questions: 2020 training compute requirements: How much compute will we need to train TAI, using 2020 Machine Learning architectures and algorithms? Affordability of compute: How likely is it that we'll be able to afford the compute required to train TAI in a particular year? The second question can be tackled by turning to existing trends in three main factors: (1) algorithmic progress e.g. improved algorithmic efficiency, (2) decreasing computation prices e.g. due to hardware improvements, and (3) increased willingness to spend on compute. The first question is significantly trickier. Cotra attempts to answer it by treating the brain as a “proof of concept” that the “amount of compute” used to “train” the brain can train a general intelligence. This lets her relate the question “how much compute will we need to train TAI?” with the question “how much ‘compute' was used to ‘train' the human brain?”. However, there's no obvious single interpretation for the latter question, so Cotra comes up with six hypotheses for what this corresponds to, referring to these hypotheses as “biological anchors” or “bioanchors”: Evolution anchor: Compute to train TAI = Compute performed over evolution since the first neurons Lifetime anchor: Compute to train TAI = Compute performed by the human brain when maturing to an adult (0 to 32 years old) Three neural network anchors: Anchor to the processing power of the human brain, and to empirical parameter scaling laws. Technically there are three of these, corresponding to short, medium, and long “effective horizon lengths” – the amount of data required to determine whether or not a perturbation to the AI system improves or worsens performance Genome anchor: Anchor to the processing power of the human brain, set the number of parameters = number of bytes in the human genome, and extrapolate the amount of training data required using the same empirical scaling laws mentioned above and assuming a long horizon length (one “data point” = multiple years) In calculating the training compute requirements distribution, Cotra places 90% weight collectively across these bioanchor hypotheses, leaving 10% to account for the possibility that all of the anchors significantly underestimate the required compute. Here's a visual representation of how Cotra breaks down the question “How likely is the development is TAI by a given year?”: The above was essentially a summary of Cotra's factorization of the question of AI timelines; for a summary of her key findings, see here. Motivation One of the biggest unresolved debates in AI Safety is the question of AI Timelines – when will Transformative AI (TAI) be developed? In 2020, Ajeya Cotra released a draft report, Forecasting TAI with biological anchors (Cotra, 2020), that aims to answer this question. It's over 200 pages long including the a...
In today's episode Mohamad Lawand helps us try and grok gRPC. gRPC is a modern open source high performance Remote Procedure Call (RPC) framework that can run in any environment. That description doesn't really demystify what gRPC can do so Mohamad helps us fill in some of the gaps. We find out how gRPC works in .NET, how it works with HTTP, how it is different from REST and some good use cases for using it in your applications. gRPC is used by companies like Netflix and Square and is supported by the Cloud Native Computing Foundation. Have you used gRPC in your applications? Let us know on Twitter at @dotnet_Podcast. (https://twitter.com/dotNET_Podcast) Sponsors Top End Devs (https://topenddevs.com/) Coaching | Top End Devs (https://topenddevs.com/coaching) Links Getting Started with gRPC and .NET (https://www.infoq.com/articles/getting-started-grpc-dotnet/?topicPageSponsorship=e4f7a6e7-0225-492c-adca-2bdf6f0012f3) gRPC (https://grpc.io/) HTTP/2 protocol (https://caniuse.com/http2) Envoy Proxy (https://www.envoyproxy.io/) Mohamad Lawand (https://mohamadlawand.com/#home) Picks Caleb- Lost Ark - Free to Play MMO Action RPG (https://www.playlostark.com/en-us) Shawn- Reacher (https://www.imdb.com/title/tt9288030/) Wai- Pragmatic Works (https://www.youtube.com/channel/UC5CugyvTdOloiuTc9nN09TA) Special Guest: Mohamad Lawand.
TestTalks | Automation Awesomeness | Helping YOU Succeed with Test Automation
Want a practical guide to implementing and using continuous delivery in your software projects? In this episode, Christie Wilson, a software engineer at Google and author of Grokking Continuous Delivery, shares insight from her soon-to-be-released book. Discover a complete overview of all the pieces of a CD pipeline automation testing best practices and learn how to fit them together for new and legacy codebases. Listen up
Dr. Svoboda shares the cosmic and internal lessons to be learned from the layered Indian myth – The Churning of the Ocean of Milk.In this special episode of Living With Reality, Dr. Svoboda—through the blessing of his mentor, the Aghori Vimalananda—invites us into a lively telling of the richly layered Indian myth: The Churning of the Ocean of Milk. Grokking this sweet and useful story from both the external cosmic vantage, as well as the microcosmic internal perspective, Dr. Svoboda lifts the veil to a holographic understanding of this astronomy-steeped foundational Vedic story.Join Dr. Svoboda's Online Courses & Satsangha for teachings on Ayurveda, Kundalini, Sadhana, The Ramayana, & The Greatness of Saturn.Dive into Paula Crossfield's new podcast and ongoing offerings in Astrology, Ayurveda, Business, & more, at WeaveYourBliss.comSee Privacy Policy at https://art19.com/privacy and California Privacy Notice at https://art19.com/privacy#do-not-sell-my-info.
- Ταξίδια στην εποχή του Covid - Aegean airlines απόψεις και business class - Ληξιαρχικές πράξεις σε προξενεια και e-ληξιαρχειο επιτελους - [Wordle](https://www.powerlanguage.co.uk/wordle/) - Συμβυουλές για system design interview - [Grokking the Advanced System Design Interview](https://www.educative.io/courses/grokking-adv-system-design-intvw) - [Microservices Patters - Book](https://www.amazon.co.uk/Microservice-Patterns-examples-Chris-Richardson/dp/1617294543) - [Monolith to Microservices- Book](https://www.oreilly.com/library/view/monolith-to-microservices/9781492047834/) - [Fundamentals of Software Architecture - Book](https://www.oreilly.com/library/view/fundamentals-of-software/9781492043447/) - Δυστυχώς πολλές φορές η απάντηση σε ολα είναι - Kafka ... #not
In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of “grokking” a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. 2022: Alethea Power, Yuri Burda, Harrison Edwards, I. Babuschkin, Vedant Misra https://arxiv.org/pdf/2201.02177v1.pdf
#grokking #openai #deeplearning Grokking is a phenomenon when a neural network suddenly learns a pattern in the dataset and jumps from random chance generalization to perfect generalization very suddenly. This paper demonstrates grokking on small algorithmic datasets where a network has to fill in binary tables. Interestingly, the learned latent spaces show an emergence of the underlying binary operations that the data were created with. OUTLINE: 0:00 - Intro & Overview 1:40 - The Grokking Phenomenon 3:50 - Related: Double Descent 7:50 - Binary Operations Datasets 11:45 - What quantities influence grokking? 15:40 - Learned Emerging Structure 17:35 - The role of smoothness 21:30 - Simple explanations win 24:30 - Why does weight decay encourage simplicity? 26:40 - Appendix 28:55 - Conclusion & Comments Paper: https://mathai-iclr.github.io/papers/... Abstract: In this paper we propose to study generalization of neural networks on small algorithmically generated datasets. In this setting, questions about data efficiency, memorization, generalization, and speed of learning can be studied in great detail. In some situations we show that neural networks learn through a process of “grokking” a pattern in the data, improving generalization performance from random chance level to perfect generalization, and that this improvement in generalization can happen well past the point of overfitting. We also study generalization as a function of dataset size and find that smaller datasets require increasing amounts of optimization for generalization. We argue that these datasets provide a fertile ground for studying a poorly understood aspect of deep learning: generalization of overparametrized neural networks beyond memorization of the finite training dataset. Authors: Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin & Vedant Misra Links: TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick YouTube: https://www.youtube.com/c/yannickilcher Twitter: https://twitter.com/ykilcher Discord: https://discord.gg/4H8xxDF BitChute: https://www.bitchute.com/channel/yann... Minds: https://www.minds.com/ykilcher Parler: https://parler.com/profile/YannicKilcher LinkedIn: https://www.linkedin.com/in/ykilcher BiliBili: https://space.bilibili.com/1824646584 If you want to support me, the best thing to do is to share out the content :) If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this): SubscribeStar: https://www.subscribestar.com/yannick... Patreon: https://www.patreon.com/yannickilcher Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2 Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n
”This book takes an impossibly broad area of computer science and communicates what working developers need to understand in a clear and thorough way.” - David Jacobs, Product Advance Local Key Features Master the core algorithms of deep learning and AI Build an intuitive understanding of AI problems and solutions Written in simple language, with lots of illustrations and hands-on examples Creative coding exercises, including building a maze puzzle game and exploring drone optimization About The Book “Artificial intelligence” requires teaching a computer how to approach different types of problems in a systematic way. The core of AI is the algorithms that the system uses to do things like identifying objects in an image, interpreting the meaning of text, or looking for patterns in data to spot fraud and other anomalies. Mastering the core algorithms for search, image recognition, and other common tasks is essential to building good AI applications Grokking Artificial Intelligence Algorithms uses illustrations, exercises, and jargon-free explanations to teach fundamental AI concepts.You'll explore coding challenges like detecting bank fraud, creating artistic masterpieces, and setting a self-driving car in motion. All you need is the algebra you remember from high school math class and beginning programming skills. What You Will Learn Use cases for different AI algorithms Intelligent search for decision making Biologically inspired algorithms Machine learning and neural networks Reinforcement learning to build a better robot This Book Is Written For For software developers with high school–level math skills. About the Author Rishal Hurbans is a technologist, startup and AI group founder, and international speaker. Table of Contents 1 Intuition of artificial intelligence 2 Search fundamentals 3 Intelligent search 4 Evolutionary algorithms 5 Advanced evolutionary approaches 6 Swarm intelligence: Ants 7 Swarm intelligence: Particles 8 Machine learning 9 Artificial neural networks 10 Reinforcement learning with Q-learning
--- Panel ---Harry is joined by Jenn Litz-Kirk and Jordan Driggers to decipher what's going on with the top ten malt seltzers.---Time-Stamped Contents ---00:00 Intro - harry is dramatically in emo clothes and wearing blue blockers for no reason. 02:30 jenn recreates the plot of the Jerk. 02:50 harry sings song, nacho libre, etc. 04:20 Top ten seltzer pop quiz10:45 hot prices for labor day12:50 highlights from our fall distrib survey14:20 most salespeoples' time spent dealing with out-of-stocks15;30 jordan and rib going to Spokane. 16:45 jenn hanging at home, no more frisbee golf. makes herself known on the course.18:00 jenn thinks harry is jealous because he doesn't work out. 19:00 wrap-up.---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests (almost) every day as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
--- Panel ---Harry is joined by Wine & Spirits Daily editors Sarah Barrett and Hana Kruger, along with beer editor Jordan Driggers, to discuss the latest in the W&S distribution world. ---Time-Stamped Contents ---00:00 needless banter about tech difficulties, Biscuit (the dog harry obsesses over) had cbd oil and is feeling the vibes. 04:45 industry trying to define these new bev-alc rtd's. 08:00 do rtd's play on-premise? harry says no, sarah says yes in short term.09:25 jordan says seltzers don't mix well with food. so true09:60 grokking seltzer's fall, but hana says still popular11:00 seltzers all about latest hot flavor or variety pack. 11:40 jordan doubles down on wc variety pack 312:40 sarah: rtd's expanding outlets13:30 harry goes on rant about trucks and delivery19:20 RNDC going into all control states direct threat to S-G.21:25 Montgomery county and boss hogg.---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
--- Panel ---Harry is joined by bbd editors Jenn and Jordan.---Time-Stamped Contents ---00:00 Intro - bts talking canna-beverages coverage01:40 harry is dressed like a rodeo clown, and we don't know why.02:30 bts: jenn interviews bill newlands of constellation. 03:13 upside for modelo especial03:40 jenn describes modelo as top brand in many non-california markets.04:10 modelo is so big now, how much runway could it have? 05:30 modelo 50/50 hispanic vs general market, jim sabia predicted it.06:20 pacifico is starting to break out and shine.06:40 constellation is it's own biggest competitor in mex beer.08:00 corona is their go-to brand with line extensions/innovations.08:25 beer the big performer compared to wine/spirits at stz09:40 jordan simulates the modelo ad theme song, risking us being cited for copyright infringement. 10:15 talking about pod intro music.13:00 jenn gives us a sneak peak at rest of interview with stz's bill newlands.15:00 bts with bbd15:15 brewer distrib sales meetings going forward, some hybrid.---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
--- Panel ---Harry is joined by CM Profit Group's Tom Fox, along with Jenn and Jordan, to talk about all things chain sales.---Time-Stamped Contents ---00:00 intro - hard mtn dew not very similar to soft version.00:50 will it compete more with 4 loko or twisted tea?02:05 why low abv? likely due to issues of crossover from non-alc to alc.02:50 it's a big deal and small deal: small in that it's one brand, big in that it brings a new big distrib competitor to the cold box. 03:30 jordan: it's the opposite of what coke did with topo. 04:10 pepsi had choice to be producer or distributor, wisely chose to be distributor. 04:50 jenn doesn't think it'll blow the doors off. 05:35 harry thinks it's big deal structurally. 06:30 beer distribs who are also pep/coke bottlers have edge.07:05 TOM FOX joins to talk chains08:45 boston taking marketing, pepsi taking sales. why???09:40 why pep chose boston beer? cheaper options. burwick connexion. 10:40 how will pepsi get mouse access in bev-alc cold box?11:00 tom: likely have lotsa communication between two12:05 pepsi serious about learning about bev-alc, bigger than mtn dew. pepsi new bev-alc distrb12:40 what about retail pricing? mainstream, so vol accretive to boston (but not margins)14:00 offering distribs on-premise ain't much of an offer.14:30 harry: is this like inbev embedding themselves with a-b to learn the business in early oughts?15:00 boston ranks tops in tom's on and off-premise chain surveys.15:40 new belgium on the map now with chains16:00 constellation hitting hard too.16:30 a-b still gets top marks, most mentions. molson coors right after.17:40 chains loving virtual vs in-person19:30 distribs loving it too21:30 suppliers' performance on vs off-premise chain differences22:00 scores haven't improved even though supplier performance has, because retailer expectations also grow.23:30 buyers are mba's except for a few OG goats. suppliers playing ketchup24:10 suppliers and retailers more aligned today.25:05 on the other hand, retailers want more tailored solutions, to "be understood".26:05 craft brewers stepping game: nbb, others28:05 smarter to start with regional chains or go straight to bentonville?29:20 constellation alternative to a-b and molson coors for cat captaincy.31:35 tom: seltzers are overspaced32:50 assortment has returned as a big priority after pan year.34:40 jordan asks about mc sku reduction and retailers' reaction37:55 tom describes their digital e-sales training platform at fox sales coaching---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
--- Panel ---Harry is joined by Jenn Litz-Kirk, Sarah Barrett, and Jordan Driggers---Time-Stamped Contents ---00:00 Intro - Gopuff model02:00 They have the capital to win the prize.02:15 Harry takes us down milkman and diaper delivery lane03:35 So many models: tip of the iceberg04:45 It will all end up being Amazon in the end05:45 Favor was Harry's first delivery app experience, at Jenn's wedding in 201407:00 Jordan doesn't think tobacco is healthy. 07:40 Harry disagrees. 08:20 Sarah groks Gallo's High Noon.12:15 Jordan wonders what the demo is for HN at that big price point.13:10 Making flavors unique is the game. 14:00 A-B's Neutral is their answer to High Noon. 16:10 Shortage of sixtel kegs.18:20 Hot girl summer has breached America's airports.19:40 For the rest of the pod Jenn speaks into a megaphone.---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
Harry is joined by Jenn and Jordan.---Time-Stamped Contents ---00:00 Intro - Biscuit be patrollin', they be hatin'.01:00 Grokking A-B's plans to sell its Hawaii distribution branch to non-AB house Odom.11:45 Former MolsonCoors CMO Andy England showing his British moxy using same slogan for vodka brand.
--- Panel ---Harry is joined by Jenn Litz-Kirk, Jordan Driggers, and Chanel Gulf Coast---Time-Stamped Contents ---00:00 Intro - Jessica won our company seltzer cannonball contest (yes we are all over 21). 02:20 Grokking cyber-security and ransomware issues in the beer business. 06:00 Grokking the future of cannabis in the bev-alc space. 07:50 How California screwed the pooch on cannabis taxation and regulation.10:10 Conquering the last mile is crucial, 50% of canna trade. 12:40 Does FL have a commerce clause issue with their cannabis structure? 13:20 Harry tries to convince everybody he's a legal expert. 14:45 Bezos, Mars, and earth.18:00 Athletic building another brewery on the East Coast. 20:10 Hot Girl Summer update.21:30 Introducing Uncle Paul Fall22:50 Harry reveals he likes "Call Her Daddy" podcast. 23:00 Save the date: Beer Summit at the Hotel Del in San Diego January 30, 2022. Sponsorships available, contact millicent@beernet.com (Chanel Gulf Coast).---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
--- Panel ---Harry is joined poolside by editors Jenn Litz-Kirk and Jordan Driggers.---Time-Stamped Contents ---00:00 Intro - Pod'ing by the pool.01:00 Introducing the SchuhPub Crime Wagon (tm)02:10 Dog initially ejected, but now has hang of it.02:25 It is a "Crime Wagon", to prevent crimes.02:50 Black Rifle Coffee gets fair write-up in New York Times Magazine. Harry's family ranch Guajolote gets mentioned.03:20 Jenn talks about Bang Energy's plans for bev-alc selter rollout.04:05 Rolling out new markets, focus on military. 05:50 Harry comments on Biden Administration's targeting of bev-alc for trade practices and antitrust.08:00 Grokking the Boston Beer / Beam Suntory deal.09:05 Harry has ideas for Bud Ice.09:35 JD thinks this deal muddies waters.11:30 Does consumer care about the base?11:45 Jenn says this deal litmus test to see. 12:00 Are distribs psyched? 12:40 Jenn points out Cayman Jack is already huge.14:20 BBC beer distribs sitting prettier than Beam/W&S distribs on this deal. 15:20 Hard to make good whiskey RTD.16:00 This deal feels like a big test.17:20 Starting to plan for Beer Summit 2021 at the Hotel Del Coronado. 18:00 Jenn falls for Harry's Millennial bait.20:00 Distributor Productivity Summit on Monday, last chance. 20:30 Harry wants to start making money on the pod by selling t-shirts to recoup losses from cargo bike. 21:00 Pods by the pool at Hotel Emma next week.---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text or call Harry in confidence at: (262) 345-2501-Audio/Video feeds at HarrysPod.com-Articles referenced at beernet.com-Twitter / IG / TikTok: @beerbizdaily
Our first remote/mobile pod from bustling Vegas. It's a short 11 minutes. ---Panel---Harry is joined by editors Jenn Litz-Kirk and Jordan Driggers ---Contents---00:00 Intro03:13 Grokking Brendan---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text comments and topic ideas to Harry at: (262) 345-2501-Audio feed at HarrysPod.com
--Panel---Harry is joined by Wine & Spirits Executive Editor Sarah Barrett for a special edition pod.---Contents---00:00 Intro01:00 Delicato acquires Coppola, what do it mean?03:23 Harry tries to flex and compares himself to Bill Murray in Lost in Translation. Sarah checks him. 04:45 Is it believable that Daniel Craig can date 20 year-olds in Bond movies?05:36 Grokking the challenge to Missouri's in-state vs out-of-state laws. 06:40 Same issue, same players, different decade: Tom Wark and James Tanner still hammering away at three-tier.08:17 State residency requirements and physical location mandates under attack.14:46 Will the Supreme Court hear the case? Probably not.---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text comments and topic ideas to Harry at: (262) 345-2501-Audio feed at HarrysPod.com-Video feed at youtube.com/c/beerwriterTwitter / IG: @beerbizdaily
----PANEL----Jenn-Litz, Sarah Barrett, Jordan Driggers00:00 Intro - Jordan attends Fiesta NIOSA block party.01:56 Jenn finally got her mic figured out.03:44 Harry administers an FMB pop quiz to Jordan. 04:40 Ranking FMBs.09:35 Wine and spirits also having supply problems. 12:37 Jenn groks celebrity-backed FMBs.16:00 Harry hosting a Molson Coors exec tonight to discuss the charms of San Antonio as a host city for next year's distrib meeting. 18:07 Jenn's ideas for podcast recurring segments. 20:55 More texts from Harry's son.22:58 More discussion on pod segments. 24:25 Jordan talks about Great Nation Brewing's expansion plans.26:06 Jordan discusses Reyes' on-premise account issues in SF.27:35 truly trends and on-premise seltzer sales. 29:33 Harry picks on Jordan for being a picky seltzer drinker.30:26 Hot girl summer flavor: watermelon. ---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text comments and topic ideas to Harry at: (262) 345-2501-Audio feed at HarrysPod.comTwitter / IG: @beerbizdaily
Shipping bev-alc to consumers' doorsteps is the last frontier. We grok the GoPuff model of buying brick-and-mortar with technology to bridge the last mile.----PANEL----Jenn-Litz, Sarah Barrett, Jordan Driggers, Gulf Coast----Contents-----00:00 Intro - After Harry makes it needlessly complicated, Sarah offers a simple yet elegant solution for fixing Jenn's mic. 02:14 Victory! Jenn's mic finally works.04:23 Jenn and Harry bicker about the direction of the podcast at a dinner with a top industry exec.05:43 Banter vs content: Why can't we have it all?06:45 A rare apology from Harry 07:55 GoPuff buys Kentucky's Liquor Barn09:13 GoPuff's interesting model of buying liquor retailers instead of partnering with them (like Drizly).10:46 Jordan brings up the Kentucky Bourbon angle.12:00 Harry discovers he doesn't fully understand GoPuff's model as clearly as he thought. Sarah sets him straight.13:18 First purchase on Favor: Camel Snus14:02 Biscuit's eating and sleeping habits15:37 Jenn describes a three-tier panel she attended last week, and how w&s suppliers and distribs are in lock-step in supporting tech to solve the DTC issue rather than through the courts. 18:14 Technology solving the decades-long DTC / three-tier fight that the courts couldn't solve.18:46 The amazon wildcard23:38 The perils of DTC, beer vs wine/spirits: freshness vs progeny.28:13 The shortages of hot girl summer, and why you don't order raw seafood on Mondays. ---About---Beer Business Daily publisher Harry Schuhmacher joins his editors and other guests every workday as they grok the industry issues of the day. -Text comments and topic ideas to Harry at: (262) 345-2501-Audio feed at HarrysPod.comTwitter / IG: @beerbizdaily
It's the explanation of capacity utilization that does it for me.
Plus, still working out the logistics of the podcast ..... on the podcast. It's called being efficient. By the way, if you have ideas for the pod, text Harry at (262) 345-2501. (
We only had about 15 minutes in a crowded hotel lobby before we were expected at a supplier reception, but Jordan and Harry were able to knock out a quick-and-dirty pod before the weekend regarding the craziness doing on in off-premise numbers. We apologize for the poor audio quality (we were in the lobby bar of the Hotel Emma). (
2005 was a seminal year. Was 2020 another one? (
Science Book Movement. Revisión Online del Libro: Grokking Machine Learning. Autor: Luis Serrano. Invitado: Luis Serrano. Ve la grabación en video por Youtube: https://youtu.be/LN7-zZGON2o. Únete a nuestra comunidad en Discord a través del siguiente enlace: https://bookmovement.co/discord. See acast.com/privacy for privacy and opt-out information.
Miguel Morales is a Machine Learning engineer at Lockheed Martin and teaches at Georgia Institute of Technology. This episode starts with a basic explanation of Reinforcement Learning. Miguel then talks through the various methods of implementing and training systems through Reinforcement Learning. We talk algorithms and models and much more… Panel Charles Max Wood Guest Miguel Morales Sponsors Dev Heroes Accelerator Links GitHub | mimoralea/gdrl Manning | Grokking Deep Reinforcement Learning Twitter: mimoralea Email: mimoralea@gmail.com Picks Charles- The Expanse by James S. A. Corey Charles- Dev Heroes Accelerator Miguel- Deep Reinforcement Learning Miguel- Youtube: Dimitri Bertsekas Miguel- RL Course by David Silver - Lecture 1: Introduction to Reinforcement Learning Miguel- Thor ( 2011 )
Author of Grokking Bitcoin, Kalle Rosenbaum joins Ced to discuss: - How does the Bitcoin network prevent cheating - What is a cryptographic function & is it a defensive weapon of war - Will government punish their people harshly for using Bitcoin & how bloody and violent will it get - How cyptographic tools & encryption are the only ways to maintain privacy and security during the age of big brother -- Hard forks vs soft forks - Backwards compatability vs forward compatitability - & so much more
Researcher Huw Lloyd, fluent in numerous mental models, is a good explainer of concepts -- including many I was completely unversed in. A few threads run through the entirety of this chat: development, self-awareness, and construing an active orientation to any given situation. SECTION 1: Our pathways to Vygotsky 0:36 - Reflections on Huw's recent "Vygotsky and Parenting" (http://tiny.cc/ymqpsz) 1:45 - Pros and cons of taking scholarly shortcuts 6:07 - Huw's arrival to Vygotsky, in part through dissatisfaction elsewhere SECTION 2: Huw's ideas about Active Orientation 16:38 - What is Active Orientation? 23:30 - Is Active Orientation a practice? (An exercise in self awareness) 31:58 - Active Orientation can be documented (microgenesis research of Huw's) SECTION 3: What is Developmental Education? 40:37 - A primer on Davydov and Developmental Education 46:07 - Empirical thinking vs. Theoretical thinking 50:23 - Grokking the material and the History of ideas 52:59 - Problems are Good 55:40 - An illustrative lesson of Davydov's 1:03:43 - Some key characteristics of developmental education 1:07:41 - Crises, construals, and neoformations 1:10:20 - The Desert Oak: a developmental TRIZ problem SECTION 4: Imagination and Confidence-building 1:16:07 - Imagination, flow, and problem-solving 1:24:10 - Systems and Design Ideas (TRIZ approach) 1:30:22 - Earned, authoritative confidence: Your tempered ideas are become Real 1:35:20 - The importance of problem-construal or framing 1:48:04 - Problem-creating, -solving, and -construing 1:53:51 - This is rich, highly concentrated material (Foundational, generative, "unfoldable" concepts) 1:55:27 - Notational vs. developmental education (and epistemology) 1:57:10 - Final two questions (adult-development & advice for problem-designers) 2:06:20 - Complex vs. complicated (and self-regulation and distance learning) 2:09:46 - An idea for lunch (as promised: http://tiny.cc/bvqpsz) References: http://tiny.cc/5vqpsz - "A Study of Active Orientation" (brief introduction) http://tiny.cc/2xqpsz - "TRIZ: a Powerful Methodology for Creative Problem Solving" http://tiny.cc/dxqpsz - "Going with the Flow: How to Engage Boys (and Girls) in Their Literacy Learning"
Tredje avsnittet av Konsensuspodden i nytt format, som livepodd i samarbete med Safello. Varannan onsdag klockan 20:00. I dagens avsnitt av Konsensuspodden pratar vi med Kalle Rosenbaum som är författare till boken "Grokking Bitcoin". Vi pratar om hans bok, hur han hittade till Bitcoin och hur det alltid finns något nytt att lära sig om Bitcoin. Länkar nämnda i avsnittet: https://www.manning.com/books/grokking-bitcoin https://github.com/kallerosenbaum/grokkingbitcoin http://rosenbaum.se/book/ https://www.adlibris.com/se/bok/grokking-bitcoin-9781617294648 Vill du kolla på avsnittet kan du göra det på youtube: https://www.youtube.com/watch?v=6WPsUT3x2_w Hårdvaruplånböcker: Trezor: https://shop.trezor.io/product/trezor-model-t?offer_id=15&aff_id=1187&tiny_url=1 Ledger: https://shop.ledger.com/pages/ledger-nano-x?r=aa88 Använd gärna våra länkar så stödjer du podden! Kontaktinfo: www.konsensuspodden.se Ludvig: twitter.com/ludvigoeberg Tristan: twitter.com/t4t5 Dea: twitter.com/deabrunner Kalle: https://twitter.com/kallerosenbaum Keywords: Konsensus Konsensuspodden Grokking Bitcoin Kalle Rosenbaum
EP.22 Bryan Eisenberg (Best-selling Author) on Grokking Baseball One Resource at a Time On this episode, Bryan Eisenberg shares the strategies and resources that he has found to help him support his son, and help him grow each day. Bryan discusses the importance of communication and goal setting, and also how he has used to […]
Talk Python To Me - Python conversations for passionate developers
See the full show notes for this episode on the website at talkpython.fm/82.