POPULARITY
In large enterprise software companies, Red and Blue Teams collaborate through Purple Teaming to proactively detect, respond to, and mitigate advanced threats. In this episode of CyberWire-X, N2K's Dave Bittner is joined by Adobe's Justin Tiplitsky, Director of Red Team and Ivan Koshkin, Senior Detection Engineer to discuss how their teams work together daily to strengthen Adobe's security ecosystem. They share real-world insights on how this essential collaboration enhances threat detection, refines security controls, and improves overall cyber resilience. Learn more about your ad choices. Visit megaphone.fm/adchoices
In this episode, Michael, Sarah, and Mark talk to Craig Nelson, VP of the Microsoft Red Team about how the Red Team works to help secure Microsoft and its customers.In life, there are things you know you know, things you know you don't know, and finally, things you don't know you don't know. This episode is full of the latter.We also cover security news about LLMs and MCP, TLS 1.1 and 1.0 deprecation, Private End Point Improvements, Containers and more.https://aka.ms/azsecpod
Fred Wilmot, CEO and co-founder of Detecteam, and Sebastien Tricaud, CTO and co-founder, bring a candid and critical take on cybersecurity's detection and response problem. Drawing on their collective experience—from roles at Splunk, Devo, and time spent in defense and offensive operations—they raise a core question: does any of the content, detections, or tooling security teams deploy actually work?The Detecteam founders challenge the industry's obsession with metrics like mean time to detect or respond, pointing out that these often measure operational efficiency—not true risk readiness. Instead, they propose a shift in thinking: stop optimizing broken processes and start creating better ones.At the heart of their work is a new approach to detection engineering—one that continuously generates and validates detections based on actual behavior, environmental context, and adversary tactics. It's about moving away from one-size-fits-all IOCs toward purpose-built, context-aware detections that evolve as threats do.Sebastien highlights the absurdity of relying on static, signature-based detection in a world of dynamic threats. Adversaries constantly change tactics, yet detection rules often sit unchanged for months. The platform they've built breaks detection down into a testable, iterative process—closing the gap between intel, engineering, and operations. Teams no longer need to rely on hope or external content packs—they can build, test, and validate detections in minutes.Fred explains the benefit in terms any CISO can understand: this isn't just detection—it's readiness. If a team can build a working detection in under 15 minutes, they beat the average breakout time of many attackers. That's a tangible advantage, especially when operating with limited personnel.This conversation isn't about a silver bullet or more noise—it's about clarity. What's working? What's not? And how do you know? For organizations seeking real impact in their security operations—not just activity—this episode explores a path forward that's faster, smarter, and grounded in reality.Learn more about Detecteam: https://itspm.ag/detecteam-21686Note: This story contains promotional content. Learn more.Guests: Fred Wilmot, Co-Founder & CEO, Detecteam | https://www.linkedin.com/in/fredwilmot/Sebastien Tricaud, Co-Founder & CTO, Detecteam | https://www.linkedin.com/in/tricaud/ResourcesLearn more and catch more stories from Detecteam: https://www.itspmagazine.com/directory/detecteamWebinar: Rethink, Don't Just Optimize: A New Philosophy for Intelligent Detection and Response — An ITSPmagazine Webinar with Detecteam | https://www.crowdcast.io/c/rethink-dont-just-optimize-a-new-philosophy-for-intelligent-detection-and-response-an-itspmagazine-webinar-with-detecteam-314ca046e634Learn more and catch more stories from RSA Conference 2025 coverage: https://www.itspmagazine.com/rsac25______________________Keywords:sean martin, fred wilmot, sebastien tricaud, detecteam, detection, cybersecurity, behavior, automation, red team, blue team, brand story, brand marketing, marketing podcast, brand story podcast______________________Catch all of our event coverage: https://www.itspmagazine.com/technology-and-cybersecurity-conference-coverageWant to tell your Brand Story Briefing as part of our event coverage? Learn More
We've been in enough conversations to know when something clicks. This one did — and it did from the very first moment.In our debut Brand Story with White Knight Labs, we sat down with co-founders John Stigerwalt and Greg Hatcher, and what unfolded was more than a company intro — it was a behind-the-scenes look at what offensive security should be.John's journey is the kind that earns your respect quickly: he started at the help desk and worked his way to CISO, before pivoting into red teaming and co-founding WKL. Greg's path was more unconventional — from orchestral musician to Green Beret to cybersecurity leader. Two very different stories, but a shared philosophy: learn by doing, adapt without a manual, and never take the easy route when something meaningful is on the table.That mindset now defines how White Knight Labs works with clients. They don't sell cookie-cutter pen tests. Instead, they ask the right question up front: How does your business make money? Because if you can answer that, you can identify what a real-world attacker would go after. Then they simulate it — not in theory, but in practice.Their ransomware simulation service is a perfect example. They don't just show up with a scanner. They emulate modern adversaries using Cobalt Strike, bypassing endpoint defenses with in-house payloads, encrypting and exfiltrating data like it's just another Tuesday. Most clients fail the test — not because they're careless, but because most simulations aren't this real.And that's the point.White Knight Labs isn't here to help companies check a box. They're here to expose the gaps and raise the bar — because real threats don't play fair, and security shouldn't pretend they do.What makes them different is what they don't do. They're not an all-in-one shop, and they're proud of that. They won't touch IR for major breaches — they've got partners for that. They only resell hardware and software they've personally vetted. That honesty builds credibility. That kind of focus builds trust.Their training programs are just as intense. Between live DEF CON courses and their online platform, they're giving both new and experienced professionals a chance to train the way they operate: no shortcuts, no watered-down certs, just hard-earned skills that translate into real-world readiness.Pass their ODPC certification, and you'll probably get a call — not because they need to check a hiring box, but because it proves you're serious. And if you can write loaders that bypass real defenses? You're speaking their language.This first conversation with John and Greg reminded us why we started this series in the first place. It's not just about product features or service offerings — it's about people who live and breathe what they do, and who bring that passion into every test, every client call, and every training they offer.We've got more stories with them on the way. But if this first one is any sign of what's to come, we're in for something special.⸻Learn more about White Knight Labs: Guests:John Stigerwalt | Founder at White Knight Labs | Red Team Operations Leader | https://www.linkedin.com/in/john-stigerwalt-90a9b4110/Greg Hatcher | Founder at White Knight Labs | SOF veteran | Red Team | https://www.linkedin.com/in/gregoryhatcher2/White Knight Labs Website | https://itspm.ag/white-knight-labs-vukr______________________Keywords: penetration testing, red team, ransomware simulation, offensive security, EDR bypass, cybersecurity training, White Knight Labs, advanced persistent threat, cybersecurity startup, DEF CON training, security partnerships, cybersecurity services______________________ResourcesVisit the White Knight Labs Website to learn more: https://itspm.ag/white-knight-labs-vukrLearn more and catch more stories from White Knight Labs on ITSPmagazine: https://www.itspmagazine.com/directory/white-knight-labsLearn more about ITSPmagazine Brand Story Podcasts: https://www.itspmagazine.com/purchase-programsNewsletter Archive: https://www.linkedin.com/newsletters/tune-into-the-latest-podcasts-7109347022809309184/Business Newsletter Signup: https://www.itspmagazine.com/itspmagazine-business-updates-sign-upAre you interested in telling your story?https://www.itspmagazine.com/telling-your-story
Bugged boardrooms. Insider moles. Social engineers posing as safety inspectors!? In this Talking Lead episode, Lefty assembles a veteran intel crew—Bryan Seaver U.S. Army Military Police vet and owner of SAPS Squadron Augmented Protection Services, LLC, a Nashville outfit running dignitary protection, K9 ops, and intelligence training. A *Talking Lead* mainstay! He's got firsthand scoop on "Red Teaming"; Mitch Davis U.S. Marine, private investigator, interrogator, Phoenix Consulting Group (now DynCorp) contractor, with a nose for sniffing out moles and lies; Brad Duley U.S. Marine, embassy guard, Phoenix/DynCorp contractor, Iraq vet, deputy sheriff, and precision shooter, bringing tactical grit to the table —to expose the high-stakes world of corporate espionage. They pull back the curtain on real-world spy tactics that were used during the the "Cold War" era and are still used in today's business battles: Red Team operations, honeypots, pretexting, data theft, and the growing threat of AI-driven deception. From cyber breaches to physical infiltrations, the tools of Cold War espionage are now aimed at American companies, defense tech, and even firearms innovation. State-backed actors, insider threats, and corporate sabotage—it's not just overseas anymore. Tune-in and get "Leaducated"!!
Bugged boardrooms. Insider moles. Social engineers posing as safety inspectors!? In this Talking Lead episode, Lefty assembles a veteran intel crew—Bryan Seaver U.S. Army Military Police vet and owner of SAPS Squadron Augmented Protection Services, LLC, a Nashville outfit running dignitary protection, K9 ops, and intelligence training. A *Talking Lead* mainstay! He's got firsthand scoop on "Red Teaming"; Mitch Davis U.S. Marine, private investigator, interrogator, Phoenix Consulting Group (now DynCorp) contractor, with a nose for sniffing out moles and lies; Brad Duley U.S. Marine, embassy guard, Phoenix/DynCorp contractor, Iraq vet, deputy sheriff, and precision shooter, bringing tactical grit to the table —to expose the high-stakes world of corporate espionage. They pull back the curtain on real-world spy tactics that were used during the the "Cold War" era and are still used in today's business battles: Red Team operations, honeypots, pretexting, data theft, and the growing threat of AI-driven deception. From cyber breaches to physical infiltrations, the tools of Cold War espionage are now aimed at American companies, defense tech, and even firearms innovation. State-backed actors, insider threats, and corporate sabotage—it's not just overseas anymore. Tune-in and get "Leaducated"!!
Joas Santos é especialista em Red Team e traz uma visão prática sobre como pensar segurança de forma ofensiva. Falamos sobre engenharia social, testes de intrusão, inteligência de ameaças, mentoria e os desafios de construir defesas que realmente funcionam. Uma conversa direta com quem está na linha de frente da segurança cibernética no Brasil.
Get featured on the show by leaving us a Voice Mail: https://bit.ly/MIPVM FULL SHOW NOTES https://www.microsoftinnovationpodcast.com/672We dive deep into the world of AI security with Microsoft's Senior Offensive Security Engineer from the AI Red Team who shares insights into how they test and break AI systems to ensure safety and trustworthiness.TAKEAWAYS• Microsoft requires all AI features to be thoroughly documented and approved by a central board• The AI Red Team tests products adversarially and as regular users to identify vulnerabilities• Red teaming originated in military exercises during the Cold War before being adapted for software security• The team tests for jailbreaks, harmful content generation, data exfiltration, and bias• Team members come from diverse backgrounds including PhDs in machine learning, traditional security, and military experience• New AI modalities like audio, images, and video each present unique security challenges• Mental health support is prioritized since team members regularly encounter disturbing content• Working exclusively with failure modes creates a healthy skepticism about AI capabilities• Hands-on experimentation is recommended for anyone wanting to develop AI skills• Curating your own information sources rather than relying on algorithms helps discover new knowledgeCheck out the Microsoft co-pilot and other AI tools to start experimenting and finding practical ways they can help in your daily work.This year we're adding a new show to our line up - The AI Advantage. We'll discuss the skills you need to thrive in an AI-enabled world. DynamicsMinds is a world-class event in Slovenia that brings together Microsoft product managers, industry leaders, and dedicated users to explore the latest in Microsoft Dynamics 365, the Power Platform, and Copilot.Early bird tickets are on sale now and listeners of the Microsoft Innovation Podcast get 10% off with the code MIPVIP144bff https://www.dynamicsminds.com/register/?voucher=MIPVIP144bff Accelerate your Microsoft career with the 90 Day Mentoring Challenge We've helped 1,300+ people across 70+ countries establish successful careers in the Microsoft Power Platform and Dynamics 365 ecosystem.Benefit from expert guidance, a supportive community, and a clear career roadmap. A lot can change in 90 days, get started today!Support the showIf you want to get in touch with me, you can message me here on Linkedin.Thanks for listening
Every bank and fintech company has a suite of anti-fraud tools that they use to keep the bad guys out. Few tools are 100% effective, however, and often the implementation of these tools, along with their interfaces with other system leave gaps. And the fraudsters will exploit these gaps. So, how do you get a holistic view of your anti-fraud arsenal and discover where these gaps are?My next guest on the Fintech One-on-One podcast is Jerry Tylman, the co-founder and partner at Greenway Solutions and the founder of their Fraud Red Team. The Fraud Red Team is all about discovering the gaps, where the weaknesses in the anti-fraud systems are. They are 100% focused on financial services, working with many of the largest banks in the country as well as several fintech companies.In this podcast you will learn:How Greenway Solutions became focused on financial services.What a pen test is and the groundbreaking work they do with fraud controls.The different attack vectors that fraudsters use.Why banks and fintechs need the services of the Fraud Red Team.How successful they are in penetrating the fraud detection systems.How they interact with the anti-fraud providers to banks and fintechs.An example of a recent test they have done that penetrated anti-fraud systems.How they tackle the challenge of account onboarding.Why behavioral technology is a key piece of the puzzle.How deepfake video and audio are being used by fraudsters.The fascinating way that the Fraud Red Team works with deepfakes.Why companies have to completely rethink their internal authentication today.Some of the fintechs they have worked with recently.How they work with check fraud and why it is a growing problem.Why all financial institutions cannot stop investing in anti-fraud tools.Connect with Fintech One-on-One: Tweet me @PeterRenton Connect with me on LinkedIn Find previous Fintech One-on-One episodes
On accélère pour ce second épisode... place au phishing et à la red team !Les émissions sont enregistrées en live sur Twitch et redifusée sur Youtube avant de devenir des Podcast.Venez assister à l'enregistrement en live !Hébergé par Ausha. Visitez ausha.co/politique-de-confidentialite pour plus d'informations.
Former Trump administration cybersecurity official Sean Planky has been nominated to lead the Cybersecurity and Infrastructure Security Agency (CISA). His nomination comes amid significant layoffs at the agency, where over 100 employees were let go, including key members of the Red Team responsible for simulating cyberattacks. These cuts raise concerns about CISA's ability to maintain cybersecurity amid ongoing federal budget constraints, potentially leading to increased threats in the private sector as federal infrastructure and intelligence sharing weaken.In the realm of artificial intelligence, the General Services Administration (GSA) has introduced a custom chatbot named GSAI to automate various government tasks, coinciding with significant job cuts within the agency. While the chatbot aims to enhance efficiency, internal memos have warned employees against inputting sensitive information. This trend reflects a broader movement in the federal government towards tech-driven workforce reductions, raising questions about data privacy and the reliability of AI tools in government operations.Utah has made headlines by passing legislation requiring App Store operators to verify the ages of users and obtain parental consent for minors downloading apps. This law, aimed at enhancing online safety for children, has garnered support from major tech companies but has also faced criticism regarding potential infringements on privacy rights. The Supreme Court is expected to examine age verification issues, particularly concerning adult content websites, highlighting the ongoing debate over online safety regulations.The podcast also discusses the competitive landscape of AI, with Google reporting continued growth in search queries despite the rise of ChatGPT. New benchmarks have been developed to measure the honesty of AI models, revealing that larger models do not necessarily correlate with higher honesty rates. As companies like Microsoft and Amazon introduce advanced AI tools, the implications for businesses are significant, emphasizing the need for oversight and governance in AI deployment to mitigate risks associated with inaccuracies and compliance issues. Three things to know today00:00 Cybersecurity Jobs Cut, AI Hired, and Kids Get ID'd—Welcome to the Future of Tech Policy05:45 ChatGPT Isn't Killing Google Search—And AI Lies More Than You'd Think08:27 Microsoft and OpenAI: A Rocky Relationship, While AI Prices Tumble Supported by: https://getflexpoint.com/msp-radio/ Event: https://www.nerdiocon.com/ All our Sponsors: https://businessof.tech/sponsors/ Do you want the show on your podcast app or the written versions of the stories? Subscribe to the Business of Tech: https://www.businessof.tech/subscribe/Looking for a link from the stories? The entire script of the show, with links to articles, are posted in each story on https://www.businessof.tech/ Support the show on Patreon: https://patreon.com/mspradio/ Want to be a guest on Business of Tech: Daily 10-Minute IT Services Insights? Send Dave Sobel a message on PodMatch, here: https://www.podmatch.com/hostdetailpreview/businessoftech Want our stuff? Cool Merch? Wear “Why Do We Care?” - Visit https://mspradio.myspreadshop.com Follow us on:LinkedIn: https://www.linkedin.com/company/28908079/YouTube: https://youtube.com/mspradio/Facebook: https://www.facebook.com/mspradionews/Instagram: https://www.instagram.com/mspradio/TikTok: https://www.tiktok.com/@businessoftechBluesky: https://bsky.app/profile/businessof.tech
We come back to the other side of the brain dance and find that the crew we had gotten to know are actually the targets. Our new team is coming for them and will squeeze Baron in anyway they need to to get it, but types like him have a few hardwire tricks up their sleeves... Thanks to A Wilhelm Scream for intro music, "Walkin' with Michael Douglas" more here https://www.awilhelmscream.com/ Theme song for Red Rising is "Neon Drifter" by Antti Martikainen. All other scores are by Antti Martikainen and Adrian von Ziegler. Check us out online at www.nastygramrpg.com Find us on Facebook at www.facebook.com/nastygram and our group is at https://www.facebook.com/groups/865467380821766; we are @nastygramrpg on both Instagram and Twitter and on Tik Tok at @nastygram.rpg
Twitter: https://x.com/croco_byteTwitter: https://x.com/mpgn_x64Blog post - https://www.synacktiv.com/publications/relaying-kerberos-over-smb-using-krbrelayx- https://www.synacktiv.com/publications/abusing-multicast-poisoning-for-pre-authenticated-kerberos-relay-over-http-with- https://www.synacktiv.com/publications/taking-the-relaying-capabilities-of-multicast-poisoning-to-the-next-level-tricking
This week on Hacker And The Fed former FBI agent Chris Tarbell and ex-black hat hacker Hector Monsegur discuss a shocking backdoor found in healthcare patient monitors linked to China, a major vulnerability in Subaru's Starlink system allowing remote vehicle control, and the ongoing concerns over modern cars collecting unnecessary user data. They also discuss cybersecurity career paths—Blue Team vs. Red Team—and how to build a well-rounded skillset. Plus, plenty of laughs, from muscle car nostalgia to an unexpected debate about pole vs. stripper dancing. Send HATF your questions at questions@hackerandthefed.com.
On Hacking Humans, Dave Bittner, Joe Carrigan, and Maria Varmazis (also host of N2K's daily space podcast, T-Minus), are once again sharing the latest in social engineering scams, phishing schemes, and criminal exploits that are making headlines to help our audience become aware of what is out there. This week Maria has the story on how the return to office life brings unique security challenges, highlighting the need for Red Team assessments to uncover and address physical and digital vulnerabilities, empowering organizations to proactively enhance workplace security and protect against evolving threats. Joe's story comes from the FCC's warning about a scam dubbed "Green Mirage," where fraudsters impersonate mortgage lenders, spoof caller IDs, and use social engineering to trick financially vulnerable homeowners into sending payments via unconventional methods, often only discovered when foreclosure proceedings begin. Last but not least, Dave's story is on how a Reddit user shared their cautious experiment with a suspected Airbnb scam involving a new account requesting to move to WhatsApp, agreeing to unusually high rental rates, and engaging in rapport-building tactics, with red flags pointing to potential financial fraud or phishing attempts. Our catch of the day comes from listener William, who spotted a phishing scam disguised as a security alert about a compromised crypto wallet, featuring an unsolicited QR code and a generic warning that targets even non-crypto users. Resources and links to stories: Navigating Workplace Security: Red Team Insights for the Return to Office FCC warns of 50-state scam by fraudsters posing as mortgage lenders FCC ENFORCEMENT ADVISORY I'm saying "Yes" to the Chinese long-term rental WhatsApp chat asking for video You can hear more from the T-Minus space daily show here. Have a Catch of the Day you'd like to share? Email it to us at hackinghumans@n2k.com.
In this episode of The BlueHat Podcast, hosts Nic Fillingham and Wendy Zenone are joined by BlueHat 2024 presenter Joe Bialek, a security engineer at Microsoft with over 13 years of experience. Joe shares his fascinating journey from intern to red team pioneer, recounting how he helped establish the Office 365 Red Team and pushed the boundaries of ethical hacking within Microsoft. He discusses his formative years building sneaky hacking tools, navigating the controversial beginnings of red teaming, and transitioning to the Windows Security Team to focus on low-level security and mitigations. Joe reflects on the challenges of internal hacking, the human reactions to being "hacked," and the value of strengthening defenses before external threats arise. In This Episode You Will Learn: How Microsoft is developing tooling to identify and address bad programming patterns Why kernel-related discussions are primarily focused on Windows and driver developers The challenges developers face when reading and writing through pointers in C or C++ Some Questions We Ask: How does working with the Windows kernel impact system security and performance? What sets Windows kernel and driver development apart from other types of development? Why should internal teams test systems for vulnerabilities before external hackers? Resources: View Joe Bialek on LinkedIn View Wendy Zenone on LinkedIn View Nic Fillingham on LinkedIn BlueHat 2024 Session: Pointer Problems – Why We're Refactoring the Windows Kernel Related Microsoft Podcasts: Microsoft Threat Intelligence Podcast Afternoon Cyber Tea with Ann Johnson Uncovering Hidden Risks Discover and follow other Microsoft podcasts at microsoft.com/podcasts The BlueHat Podcast is produced by Microsoft and distributed as part of N2K media network.
Big thank you to Cisco for sponsoring this video! Hackers are hacking AI models. Prompt injection attacks are happening all the time. AI's are hallucinating and giving incorrect information. The AI models you download could be made by hackers. Your users are posting confidential information like passwords and API keys into online AI models. Developers are leveraging AI systems in their applications without checking that the AI models are not open to prompt injections. Read more here: https://blogs.cisco.com/security/cisc... We need a way to protect AI systems. And Cisco have a solution. // DJ Sampath's SOCIALS // LinkedIn: / djsampath Twitter/X: / djsampath // David's SOCIAL // Discord: discord.com/invite/usKSyzb Twitter: www.twitter.com/davidbombal Instagram: www.instagram.com/davidbombal LinkedIn: www.linkedin.com/in/davidbombal Facebook: www.facebook.com/davidbombal.co TikTok: tiktok.com/@davidbombal // MY STUFF // https://www.amazon.com/shop/davidbombal // SPONSORS // Interested in sponsoring my videos? Reach out to my team here: sponsors@davidbombal.com // MENU // 0:00 - Coming up 0:49 - Securing A.I. 01:23 - The dangers of downloading open-source A.I. models 06:29 - Securing A.I. models 07:52 - The future of companies and A.I. 10:58 - Introducing Cisco AI Defense 13:33 - How to break an A.I. model and how to prevent it 16:08 - One-time protection 17:48 - Securing A.I. access 18:17 - What Cisco AI Defense provides 19:09 - Will Cisco AI Defense block attacks? 22:29 - The effects of Data Poisoning 24:38 - How will Cisco AI Defense be deployed 26:18 - When it will be available 26:30 - Conclusion Please note that links listed may be affiliate links and provide me with a small percentage/kickback should you use them to purchase any of the items listed or recommended. Thank you for supporting me and this channel! Disclaimer: This video is for educational purposes only.
Learn what ethical hackers can teach us about the next era of artificial intelligence.We speak with Michael Skelton, VP of Operations and Sajeeb Lohani, Global TISO for Bugcrowd on the latest edition of 'Inside The Mind Of A Hacker'.We're also joined by CJ Fairhead who is a Senior Penetration Tester, OSCP Certified, Security obsessed and tinkerer of things. Passionate about combining years of Internal IT experience with his security knowledge for Red Team engagements, CJ is involved in the Bug Bounty scene and works on giving back to the community through tool development, blog posts or just general advice. In the latest edition of ITMOAH, dive inside the minds of 1000 hackers and see your organization from a new perspective, with the latest analysis on security researchers and their transformative use of generative AI.For more information and to access more, including the Bugcrowd Report series - visit https://mysecuritymarketplace.com/bugcrowd-register-to-access/#bugcrowd #cisoseries #mysecuritytv #cybersecurity #ITMOAH #ethicalhackers
This week on Hacker And The Fed former FBI agent Chris Tarbell and ex-black hat hacker Hector Monsegur discuss Yahoo's controversial decision to lay off its red team, the rise of North Korean IT workers infiltrating U.S. companies, and the ethical dilemmas around hacking. They also reflects on the desensitization to data breaches, debate the significance of protecting medical history, and share candid moments about their personal lives and experiences in the industry. Send HATF your questions at questions@hackerandthefed.com.
In this episode of The BlueHat Podcast, hosts Nic Fillingham and Wendy Zenone are joined by Johann Rehberger, security expert and Red Team director at Electronic Arts. Johann shares his career journey through roles at Microsoft, Uber, and EA, highlighting his expertise in red teaming and cybersecurity. Johann shares the inspiration behind his book on Red Team strategies and discusses his BlueHat 2024 talk on prompt injection vulnerabilities, a critical and evolving AI security challenge. Johann breaks down the distinction between prompt injection and jailbreaking, offering insights into the potential risks, including data exfiltration and system unavailability, and emphasizes the importance of securing Red Teams themselves. In This Episode You Will Learn: Why AI tools should have stricter default settings to control what kind of outputs they generate The importance of reading technical documentation to understand how AI systems are built Why developers should implement stronger filters for what tokens are allowed to be emitted by LLMs Some Questions We Ask: How are prompt injection and SQL injection similar, and how are they different? What is AI spyware, and how does it exploit memory tools in ChatGPT? Does AI jailbreaking access the LLM's core system like iPhone jailbreaking does the OS? Resources: View Johann Rehberger on LinkedIn View Wendy Zenone on LinkedIn View Nic Fillingham on LinkedIn Related Microsoft Podcasts: Microsoft Threat Intelligence Podcast Afternoon Cyber Tea with Ann Johnson Uncovering Hidden Risks Discover and follow other Microsoft podcasts at microsoft.com/podcasts
In this episode of The Cognitive Revolution, Nathan interviews Andrew White, Professor of Chemical Engineering at the University of Rochester and Head of Science at Future House. We explore groundbreaking AI systems for scientific discovery, including PaperQA and Aviary, and discuss how large language models are transforming research. Join us for an insightful conversation about the intersection of AI and scientific advancement with this pioneering researcher in his first-ever podcast appearance. Check out Future House: https://www.futurehouse.org Help shape our show by taking our quick listener survey at https://bit.ly/TurpentinePulse SPONSORS: Oracle Cloud Infrastructure (OCI): Oracle's next-generation cloud platform delivers blazing-fast AI and ML performance with 50% less for compute and 80% less for outbound networking compared to other cloud providers13. OCI powers industry leaders with secure infrastructure and application development capabilities. New U.S. customers can get their cloud bill cut in half by switching to OCI before December 31, 2024 at https://oracle.com/cognitive SelectQuote: Finding the right life insurance shouldn't be another task you put off. SelectQuote compares top-rated policies to get you the best coverage at the right price. Even in our AI-driven world, protecting your family's future remains essential. Get your personalized quote at https://selectquote.com/cognitive Shopify: Shopify is the world's leading e-commerce platform, offering a market-leading checkout system and exclusive AI apps like Quikly. Nobody does selling better than Shopify. Get a $1 per month trial at https://shopify.com/cognitive CHAPTERS: (00:00:00) Teaser (00:01:13) About the Episode (00:04:37) Andrew White's Journey (00:10:23) GPT-4 Red Team (00:15:33) GPT-4 & Chemistry (00:17:54) Sponsors: Oracle Cloud Infrastructure (OCI) | SelectQuote (00:20:19) Biology vs Physics (00:23:14) Conceptual Dark Matter (00:26:27) Future House Intro (00:30:42) Semi-Autonomous AI (00:35:39) Sponsors: Shopify (00:37:00) Lab Automation (00:39:46) In Silico Experiments (00:45:22) Cost of Experiments (00:51:30) Multi-Omic Models (00:54:54) Scale and Grokking (01:00:53) Future House Projects (01:10:42) Paper QA Insights (01:16:28) Generalizing to Other Domains (01:17:57) Using Figures Effectively (01:22:01) Need for Specialized Tools (01:24:23) Paper QA Cost & Latency (01:27:37) Aviary: Agents & Environments (01:31:42) Black Box Gradient Estimation (01:36:14) Open vs Closed Models (01:37:52) Improvement with Training (01:40:00) Runtime Choice & Q-Learning (01:43:43) Narrow vs General AI (01:48:22) Future Directions & Needs (01:53:22) Future House: What's Next? (01:55:32) Outro SOCIAL LINKS: Website: https://www.cognitiverevolution.ai Twitter (Podcast): https://x.com/cogrev_podcast Twitter (Nathan): https://x.com/labenz LinkedIn: https://www.linkedin.com/in/nathanlabenz/ Youtube: https://www.youtube.com/@CognitiveRevolutionPodcast Apple: https://podcasts.apple.com/de/podcast/the-cognitive-revolution-ai-builders-researchers-and/id1669813431 Spotify: https://open.spotify.com/show/6yHyok3M3BjqzR0VB5MSyk
Dive into the latest #CryingOutCloud episode featuring Johann Rehberger! Join Eden and Amitai as they sit down with Johann Rehberger, Red Team Director at @electronicarts and a cybersecurity expert. Johann also publishes innovative security research on his blog, Embrace the Red. What you'll learn:
This week on Hacker And The Fed former FBI agent Chris Tarbell and ex-black hat hacker Hector Monsegur discuss key cybersecurity challenges, from the effectiveness of phishing training to the ethical dilemmas of vulnerability disclosure. They explore how technical controls and employee education can work together to defend against increasingly sophisticated attacks, including SMS and social media phishing. They also dive into career advice for transitioning from Blue Team to Red Team roles and the complexities of the cybersecurity job market. And to close out, a heartfelt Thanksgiving message.
Colin Bell, Rob Cuddy and Kris Duer from HCL Software bring you another insightful application paranoia session.In this episode our special guest is Mark Spears.Mark is currently a Principal Security Consultant at Solis Security. Having fulfilled significant time as a network defender and vCISO dealing with writing and testing InfoSec Programs and dealing with auditors and endless reporting, he has now re-focused his time on Penetration Testing to get his fill of offensive security operations. So Red Pill or Blue Pill?A lot of his most recent education and skill focus has been on helping companies with their Web Application security through Secure-SDLC practices including configuration of Web Application Firewalls and Zero Trust solutions. When not enjoying his work at Solis Security, he can be found practicing physical security, lock picking, social engineering, or hardware hacking. Or, out on a Harley Davidson!
In this episode of the Breaking Badness Cybersecurity Podcast, Jason Haddix dives into his unique journey from red teaming and pentesting to leading security teams as a CISO in high-profile organizations, including a top gaming company. Jason unpacks the distinct challenges of securing a gaming company, where risks come not only from state actors but also from clout-seeking young hackers. He shares valuable insights on building scalable security programs, secrets management, and the importance of radical transparency in corporate security cultures. Tune in to hear why, in Jason's words, "gaming saved me from a misspent youth," and learn about his latest ventures into offensive security training and AI-driven security solutions.
We are recording our next big recap episode and taking questions! Submit questions and messages on Speakpipe here for a chance to appear on the show!Also subscribe to our calendar for our Singapore, NeurIPS, and all upcoming meetups!In our first ever episode with Logan Kilpatrick we called out the two hottest LLM frameworks at the time: LangChain and Dust. We've had Harrison from LangChain on twice (as a guest and as a co-host), and we've now finally come full circle as Stanislas from Dust joined us in the studio.After stints at Oracle and Stripe, Stan had joined OpenAI to work on mathematical reasoning capabilities. He describes his time at OpenAI as "the PhD I always wanted to do" while acknowledging the challenges of research work: "You're digging into a field all day long for weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, 'oh, yeah, that was obvious.' And you go back to digging." This experience, combined with early access to GPT-4's capabilities, shaped his decision to start Dust: "If we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down."The History of DustDust's journey can be broken down into three phases:* Developer Framework (2022): Initially positioned as a competitor to LangChain, Dust started as a developer tooling platform. While both were open source, their approaches differed – LangChain focused on broad community adoption and integration as a pure developer experience, while Dust emphasized UI-driven development and better observability that wasn't just `print` statements.* Browser Extension (Early 2023): The company pivoted to building XP1, a browser extension that could interact with web content. This experiment helped validate user interaction patterns with AI, even while using less capable models than GPT-4.* Enterprise Platform (Current): Today, Dust has evolved into an infrastructure platform for deploying AI agents within companies, with impressive metrics like 88% daily active users in some deployments.The Case for Being HorizontalThe big discussion for early stage companies today is whether or not to be horizontal or vertical. Since models are so good at general tasks, a lot of companies are building vertical products that take care of a workflow end-to-end in order to offer more value and becoming more of “Services as Software”. Dust on the other hand is a platform for the users to build their own experiences, which has had a few advantages:* Maximum Penetration: Dust reports 60-70% weekly active users across entire companies, demonstrating the potential reach of horizontal solutions rather than selling into a single team.* Emergent Use Cases: By allowing non-technical users to create agents, Dust enables use cases to emerge organically from actual business needs rather than prescribed solutions.* Infrastructure Value: The platform approach creates lasting value through maintained integrations and connections, similar to how Stripe's value lies in maintaining payment infrastructure. Rather than relying on third-party integration providers, Dust maintains its own connections to ensure proper handling of different data types and structures.The Vertical ChallengeHowever, this approach comes with trade-offs:* Harder Go-to-Market: As Stan talked about: "We spike at penetration... but it makes our go-to-market much harder. Vertical solutions have a go-to-market that is much easier because they're like, 'oh, I'm going to solve the lawyer stuff.'"* Complex Infrastructure: Building a horizontal platform requires maintaining numerous integrations and handling diverse data types appropriately – from structured Salesforce data to unstructured Notion pages. As you scale integrations, the cost of maintaining them also scales. * Product Surface Complexity: Creating an interface that's both powerful and accessible to non-technical users requires careful design decisions, down to avoiding technical terms like "system prompt" in favor of "instructions." The Future of AI PlatformsStan initially predicted we'd see the first billion-dollar single-person company in 2023 (a prediction later echoed by Sam Altman), but he's now more focused on a different milestone: billion-dollar companies with engineering teams of just 20 people, enabled by AI assistance.This vision aligns with Dust's horizontal platform approach – building the infrastructure that allows small teams to achieve outsized impact through AI augmentation. Rather than replacing entire job functions (the vertical approach), they're betting on augmenting existing workflows across organizations.Full YouTube EpisodeChapters* 00:00:00 Introductions* 00:04:33 Joining OpenAI from Paris* 00:09:54 Research evolution and compute allocation at OpenAI* 00:13:12 Working with Ilya Sutskever and OpenAI's vision* 00:15:51 Leaving OpenAI to start Dust* 00:18:15 Early focus on browser extension and WebGPT-like functionality* 00:20:20 Dust as the infrastructure for agents* 00:24:03 Challenges of building with early AI models* 00:28:17 LLMs and Workflow Automation* 00:35:28 Building dependency graphs of agents* 00:37:34 Simulating API endpoints* 00:40:41 State of AI models* 00:43:19 Running evals* 00:46:36 Challenges in building AI agents infra* 00:49:21 Buy vs. build decisions for infrastructure components* 00:51:02 Future of SaaS and AI's Impact on Software* 00:53:07 The single employee $1B company race* 00:56:32 Horizontal vs. vertical approaches to AI agentsTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, partner and CTO at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:11]: Hey, and today we're in a studio with Stanislas, welcome.Stan [00:00:14]: Thank you very much for having me.Swyx [00:00:16]: Visiting from Paris.Stan [00:00:17]: Paris.Swyx [00:00:18]: And you have had a very distinguished career. It's very hard to summarize, but you went to college in both Ecopolytechnique and Stanford, and then you worked in a number of places, Oracle, Totems, Stripe, and then OpenAI pre-ChatGPT. We'll talk, we'll spend a little bit of time about that. About two years ago, you left OpenAI to start Dust. I think you were one of the first OpenAI alum founders.Stan [00:00:40]: Yeah, I think it was about at the same time as the Adept guys, so that first wave.Swyx [00:00:46]: Yeah, and people really loved our David episode. We love a few sort of OpenAI stories, you know, for back in the day, like we're talking about pre-recording. Probably the statute of limitations on some of those stories has expired, so you can talk a little bit more freely without them coming after you. But maybe we'll just talk about, like, what was your journey into AI? You know, you were at Stripe for almost five years, there are a lot of Stripe alums going into OpenAI. I think the Stripe culture has come into OpenAI quite a bit.Stan [00:01:11]: Yeah, so I think the buses of Stripe people really started flowing in, I guess, after ChatGPT. But, yeah, my journey into AI is a... I mean, Greg Brockman. Yeah, yeah. From Greg, of course. And Daniela, actually, back in the days, Daniela Amodei.Swyx [00:01:27]: Yes, she was COO, I mean, she is COO, yeah. She had a pretty high job at OpenAI at the time, yeah, for sure.Stan [00:01:34]: My journey started as anybody else, you're fascinated with computer science and you want to make them think, it's awesome, but it doesn't work. I mean, it was a long time ago, it was like maybe 16, so it was 25 years ago. Then the first big exposure to AI would be at Stanford, and I'm going to, like, disclose a whole lamb, because at the time it was a class taught by Andrew Ng, and there was no deep learning. It was half features for vision and a star algorithm. So it was fun. But it was the early days of deep learning. At the time, I think a few years after, it was the first project at Google. But you know, that cat face or the human face trained from many images. I went to, hesitated doing a PhD, more in systems, eventually decided to go into getting a job. Went at Oracle, started a company, did a gazillion mistakes, got acquired by Stripe, worked with Greg Buckman there. And at the end of Stripe, I started interesting myself in AI again, felt like it was the time, you had the Atari games, you had the self-driving craziness at the time. And I started exploring projects, it felt like the Atari games were incredible, but there were still games. And I was looking into exploring projects that would have an impact on the world. And so I decided to explore three things, self-driving cars, cybersecurity and AI, and math and AI. It's like I sing it by a decreasing order of impact on the world, I guess.Swyx [00:03:01]: Discovering new math would be very foundational.Stan [00:03:03]: It is extremely foundational, but it's not as direct as driving people around.Swyx [00:03:07]: Sorry, you're doing this at Stripe, you're like thinking about your next move.Stan [00:03:09]: No, it was at Stripe, kind of a bit of time where I started exploring. I did a bunch of work with friends on trying to get RC cars to drive autonomously. Almost started a company in France or Europe about self-driving trucks. We decided to not go for it because it was probably very operational. And I think the idea of the company, of the team wasn't there. And also I realized that if I wake up a day and because of a bug I wrote, I killed a family, it would be a bad experience. And so I just decided like, no, that's just too crazy. And then I explored cybersecurity with a friend. We're trying to apply transformers to cut fuzzing. So cut fuzzing, you have kind of an algorithm that goes really fast and tries to mutate the inputs of a library to find bugs. And we tried to apply a transformer to that and do reinforcement learning with the signal of how much you propagate within the binary. Didn't work at all because the transformers are so slow compared to evolutionary algorithms that it kind of didn't work. Then I started interested in math and AI and started working on SAT solving with AI. And at the same time, OpenAI was kind of starting the reasoning team that were tackling that project as well. I was in touch with Greg and eventually got in touch with Ilya and finally found my way to OpenAI. I don't know how much you want to dig into that. The way to find your way to OpenAI when you're in Paris was kind of an interesting adventure as well.Swyx [00:04:33]: Please. And I want to note, this was a two-month journey. You did all this in two months.Stan [00:04:38]: The search.Swyx [00:04:40]: Your search for your next thing, because you left in July 2019 and then you joined OpenAI in September.Stan [00:04:45]: I'm going to be ashamed to say that.Swyx [00:04:47]: You were searching before. I was searching before.Stan [00:04:49]: I mean, it's normal. No, the truth is that I moved back to Paris through Stripe and I just felt the hardship of being remote from your team nine hours away. And so it kind of freed a bit of time for me to start the exploration before. Sorry, Patrick. Sorry, John.Swyx [00:05:05]: Hopefully they're listening. So you joined OpenAI from Paris and from like, obviously you had worked with Greg, but notStan [00:05:13]: anyone else. No. Yeah. So I had worked with Greg, but not Ilya, but I had started chatting with Ilya and Ilya was kind of excited because he knew that I was a good engineer through Greg, I presume, but I was not a trained researcher, didn't do a PhD, never did research. And I started chatting and he was excited all the way to the point where he was like, hey, come pass interviews, it's going to be fun. I think he didn't care where I was, he just wanted to try working together. So I go to SF, go through the interview process, get an offer. And so I get Bob McGrew on the phone for the first time, he's like, hey, Stan, it's awesome. You've got an offer. When are you coming to SF? I'm like, hey, it's awesome. I'm not coming to the SF. I'm based in Paris and we just moved. He was like, hey, it's awesome. Well, you don't have an offer anymore. Oh, my God. No, it wasn't as hard as that. But that's basically the idea. And it took me like maybe a couple more time to keep chatting and they eventually decided to try a contractor set up. And that's how I kind of started working at OpenAI, officially as a contractor, but in practice really felt like being an employee.Swyx [00:06:14]: What did you work on?Stan [00:06:15]: So it was solely focused on math and AI. And in particular in the application, so the study of the larger grid models, mathematical reasoning capabilities, and in particular in the context of formal mathematics. The motivation was simple, transformers are very creative, but yet they do mistakes. Formal math systems are of the ability to verify a proof and the tactics they can use to solve problems are very mechanical, so you miss the creativity. And so the idea was to try to explore both together. You would get the creativity of the LLMs and the kind of verification capabilities of the formal system. A formal system, just to give a little bit of context, is a system in which a proof is a program and the formal system is a type system, a type system that is so evolved that you can verify the program. If the type checks, it means that the program is correct.Swyx [00:07:06]: Is the verification much faster than actually executing the program?Stan [00:07:12]: Verification is instantaneous, basically. So the truth is that what you code in involves tactics that may involve computation to search for solutions. So it's not instantaneous. You do have to do the computation to expand the tactics into the actual proof. The verification of the proof at the very low level is instantaneous.Swyx [00:07:32]: How quickly do you run into like, you know, halting problem PNP type things, like impossibilities where you're just like that?Stan [00:07:39]: I mean, you don't run into it at the time. It was really trying to solve very easy problems. So I think the... Can you give an example of easy? Yeah, so that's the mass benchmark that everybody knows today. The Dan Hendricks one. The Dan Hendricks one, yeah. And I think it was the low end part of the mass benchmark at the time, because that mass benchmark includes AMC problems, AMC 8, AMC 10, 12. So these are the easy ones. Then AIME problems, somewhat harder, and some IMO problems, like Crazy Arm.Swyx [00:08:07]: For our listeners, we covered this in our Benchmarks 101 episode. AMC is literally the grade of like high school, grade 8, grade 10, grade 12. So you can solve this. Just briefly to mention this, because I don't think we'll touch on this again. There's a bit of work with like Lean, and then with, you know, more recently with DeepMind doing like scoring like silver on the IMO. Any commentary on like how math has evolved from your early work to today?Stan [00:08:34]: I mean, that result is mind blowing. I mean, from my perspective, spent three years on that. At the same time, Guillaume Lampe in Paris, we were both in Paris, actually. He was at FAIR, was working on some problems. We were pushing the boundaries, and the goal was the IMO. And we cracked a few problems here and there. But the idea of getting a medal at an IMO was like just remote. So this is an impressive result. And we can, I think the DeepMind team just did a good job of scaling. I think there's nothing too magical in their approach, even if it hasn't been published. There's a Dan Silver talk from seven days ago where it goes a little bit into more details. It feels like there's nothing magical there. It's really applying reinforcement learning and scaling up the amount of data that can generate through autoformalization. So we can dig into what autoformalization means if you want.Alessio [00:09:26]: Let's talk about the tail end, maybe, of the OpenAI. So you joined, and you're like, I'm going to work on math and do all of these things. I saw on one of your blog posts, you mentioned you fine-tuned over 10,000 models at OpenAI using 10 million A100 hours. How did the research evolve from the GPD 2, and then getting closer to DaVinci 003? And then you left just before ChatGPD was released, but tell people a bit more about the research path that took you there.Stan [00:09:54]: I can give you my perspective of it. I think at OpenAI, there's always been a large chunk of the compute that was reserved to train the GPTs, which makes sense. So it was pre-entropic splits. Most of the compute was going to a product called Nest, which was basically GPT-3. And then you had a bunch of, let's say, remote, not core research teams that were trying to explore maybe more specific problems or maybe the algorithm part of it. The interesting part, I don't know if it was where your question was going, is that in those labs, you're managing researchers. So by definition, you shouldn't be managing them. But in that space, there's a managing tool that is great, which is compute allocation. Basically by managing the compute allocation, you can message the team of where you think the priority should go. And so it was really a question of, you were free as a researcher to work on whatever you wanted. But if it was not aligned with OpenAI mission, and that's fair, you wouldn't get the compute allocation. As it happens, solving math was very much aligned with the direction of OpenAI. And so I was lucky to generally get the compute I needed to make good progress.Swyx [00:11:06]: What do you need to show as incremental results to get funded for further results?Stan [00:11:12]: It's an imperfect process because there's a bit of a... If you're working on math and AI, obviously there's kind of a prior that it's going to be aligned with the company. So it's much easier than to go into something much more risky, much riskier, I guess. You have to show incremental progress, I guess. It's like you ask for a certain amount of compute and you deliver a few weeks after and you demonstrate that you have a progress. Progress might be a positive result. Progress might be a strong negative result. And a strong negative result is actually often much harder to get or much more interesting than a positive result. And then it generally goes into, as any organization, you would have people finding your project or any other project cool and fancy. And so you would have that kind of phase of growing up compute allocation for it all the way to a point. And then maybe you reach an apex and then maybe you go back mostly to zero and restart the process because you're going in a different direction or something else. That's how I felt. Explore, exploit. Yeah, exactly. Exactly. Exactly. It's a reinforcement learning approach.Swyx [00:12:14]: Classic PhD student search process.Alessio [00:12:17]: And you were reporting to Ilya, like the results you were kind of bringing back to him or like what's the structure? It's almost like when you're doing such cutting edge research, you need to report to somebody who is actually really smart to understand that the direction is right.Stan [00:12:29]: So we had a reasoning team, which was working on reasoning, obviously, and so math in general. And that team had a manager, but Ilya was extremely involved in the team as an advisor, I guess. Since he brought me in OpenAI, I was lucky to mostly during the first years to have kind of a direct access to him. He would really coach me as a trainee researcher, I guess, with good engineering skills. And Ilya, I think at OpenAI, he was the one showing the North Star, right? He was his job and I think he really enjoyed it and he did it super well, was going through the teams and saying, this is where we should be going and trying to, you know, flock the different teams together towards an objective.Swyx [00:13:12]: I would say like the public perception of him is that he was the strongest believer in scaling. Oh, yeah. Obviously, he has always pursued the compression thesis. You have worked with him personally, what does the public not know about how he works?Stan [00:13:26]: I think he's really focused on building the vision and communicating the vision within the company, which was extremely useful. I was personally surprised that he spent so much time, you know, working on communicating that vision and getting the teams to work together versus...Swyx [00:13:40]: To be specific, vision is AGI? Oh, yeah.Stan [00:13:42]: Vision is like, yeah, it's the belief in compression and scanning computes. I remember when I started working on the Reasoning team, the excitement was really about scaling the compute around Reasoning and that was really the belief we wanted to ingrain in the team. And that's what has been useful to the team and with the DeepMind results shows that it was the right approach with the success of GPT-4 and stuff shows that it was the right approach.Swyx [00:14:06]: Was it according to the neural scaling laws, the Kaplan paper that was published?Stan [00:14:12]: I think it was before that, because those ones came with GPT-3, basically at the time of GPT-3 being released or being ready internally. But before that, there really was a strong belief in scale. I think it was just the belief that the transformer was a generic enough architecture that you could learn anything. And that was just a question of scaling.Alessio [00:14:33]: Any other fun stories you want to tell? Sam Altman, Greg, you know, anything.Stan [00:14:37]: Weirdly, I didn't work that much with Greg when I was at OpenAI. He had always been mostly focused on training the GPTs and rightfully so. One thing about Sam Altman, he really impressed me because when I joined, he had joined not that long ago and it felt like he was kind of a very high level CEO. And I was mind blown by how deep he was able to go into the subjects within a year or something, all the way to a situation where when I was having lunch by year two, I was at OpenAI with him. He would just quite know deeply what I was doing. With no ML background. Yeah, with no ML background, but I didn't have any either, so I guess that explains why. But I think it's a question about, you don't necessarily need to understand the very technicalities of how things are done, but you need to understand what's the goal and what's being done and what are the recent results and all of that in you. And we could have kind of a very productive discussion. And that really impressed me, given the size at the time of OpenAI, which was not negligible.Swyx [00:15:44]: Yeah. I mean, you've been a, you were a founder before, you're a founder now, and you've seen Sam as a founder. How has he affected you as a founder?Stan [00:15:51]: I think having that capability of changing the scale of your attention in the company, because most of the time you operate at a very high level, but being able to go deep down and being in the known of what's happening on the ground is something that I feel is really enlightening. That's not a place in which I ever was as a founder, because first company, we went all the way to 10 people. Current company, there's 25 of us. So the high level, the sky and the ground are pretty much at the same place. No, you're being too humble.Swyx [00:16:21]: I mean, Stripe was also like a huge rocket ship.Stan [00:16:23]: Stripe, I was a founder. So I was, like at OpenAI, I was really happy being on the ground, pushing the machine, making it work. Yeah.Swyx [00:16:31]: Last OpenAI question. The Anthropic split you mentioned, you were around for that. Very dramatic. David also left around that time, you left. This year, we've also had a similar management shakeup, let's just call it. Can you compare what it was like going through that split during that time? And then like, does that have any similarities now? Like, are we going to see a new Anthropic emerge from these folks that just left?Stan [00:16:54]: That I really, really don't know. At the time, the split was pretty surprising because they had been trying GPT-3, it was a success. And to be completely transparent, I wasn't in the weeds of the splits. What I understood of it is that there was a disagreement of the commercialization of that technology. I think the focal point of that disagreement was the fact that we started working on the API and wanted to make those models available through an API. Is that really the core disagreement? I don't know.Swyx [00:17:25]: Was it safety?Stan [00:17:26]: Was it commercialization?Swyx [00:17:27]: Or did they just want to start a company?Stan [00:17:28]: Exactly. Exactly. That I don't know. But I think what I was surprised of is how quickly OpenAI recovered at the time. And I think it's just because we were mostly a research org and the mission was so clear that some divergence in some teams, some people leave, the mission is still there. We have the compute. We have a site. So it just keeps going.Swyx [00:17:50]: Very deep bench. Like just a lot of talent. Yeah.Alessio [00:17:53]: So that was the OpenAI part of the history. Exactly. So then you leave OpenAI in September 2022. And I would say in Silicon Valley, the two hottest companies at the time were you and Lanktrain. What was that start like and why did you decide to start with a more developer focused kind of like an AI engineer tool rather than going back into some more research and something else?Stan [00:18:15]: Yeah. First, I'm not a trained researcher. So going through OpenAI was really kind of the PhD I always wanted to do. But research is hard. You're digging into a field all day long for weeks and weeks and weeks, and you find something, you get super excited for 12 seconds. And at the 13 seconds, you're like, oh, yeah, that was obvious. And you go back to digging. I'm not a trained, like formally trained researcher, and it wasn't kind of a necessarily an ambition of me of creating, of having a research career. And I felt the hardness of it. I enjoyed a lot of like that a ton. But at the time, I decided that I wanted to go back to something more productive. And the other fun motivation was like, I mean, if we believe in AGI and if we believe the timelines might not be too long, it's actually the last train leaving the station to start a company. After that, it's going to be computers all the way down. And so that was kind of the true motivation for like trying to go there. So that's kind of the core motivation at the beginning of personally. And the motivation for starting a company was pretty simple. I had seen GPT-4 internally at the time, it was September 2022. So it was pre-GPT, but GPT-4 was ready since, I mean, I'd been ready for a few months internally. I was like, okay, that's obvious, the capabilities are there to create an insane amount of value to the world. And yet the deployment is not there yet. The revenue of OpenAI at the time were ridiculously small compared to what it is today. So the thesis was, there's probably a lot to be done at the product level to unlock the usage.Alessio [00:19:49]: Yeah. Let's talk a bit more about the form factor, maybe. I think one of the first successes you had was kind of like the WebGPT-like thing, like using the models to traverse the web and like summarize things. And the browser was really the interface. Why did you start with the browser? Like what was it important? And then you built XP1, which was kind of like the browser extension.Stan [00:20:09]: So the starting point at the time was, if you wanted to talk about LLMs, it was still a rather small community, a community of mostly researchers and to some extent, very early adopters, very early engineers. It was almost inconceivable to just build a product and go sell it to the enterprise, though at the time there was a few companies doing that. The one on marketing, I don't remember its name, Jasper. But so the natural first intention, the first, first, first intention was to go to the developers and try to create tooling for them to create product on top of those models. And so that's what Dust was originally. It was quite different than Lanchain, and Lanchain just beat the s**t out of us, which is great. It's a choice.Swyx [00:20:53]: You were cloud, in closed source. They were open source.Stan [00:20:56]: Yeah. So technically we were open source and we still are open source, but I think that doesn't really matter. I had the strong belief from my research time that you cannot create an LLM-based workflow on just one example. Basically, if you just have one example, you overfit. So as you develop your interaction, your orchestration around the LLM, you need a dozen examples. Obviously, if you're running a dozen examples on a multi-step workflow, you start paralyzing stuff. And if you do that in the console, you just have like a messy stream of tokens going out and it's very hard to observe what's going there. And so the idea was to go with an UI so that you could kind of introspect easily the output of each interaction with the model and dig into there through an UI, which is-Swyx [00:21:42]: Was that open source? I actually didn't come across it.Stan [00:21:44]: Oh yeah, it wasn't. I mean, Dust is entirely open source even today. We're not going for an open source-Swyx [00:21:48]: If it matters, I didn't know that.Stan [00:21:49]: No, no, no, no, no. The reason why is because we're not open source because we're not doing an open source strategy. It's not an open source go-to-market at all. We're open source because we can and it's fun.Swyx [00:21:59]: Open source is marketing. You have all the downsides of open source, which is like people can clone you.Stan [00:22:03]: But I think that downside is a big fallacy. Okay. Yes, anybody can clone Dust today, but the value of Dust is not the current state. The value of Dust is the number of eyeballs and hands of developers that are creating to it in the future. And so yes, anybody can clone it today, but that wouldn't change anything. There is some value in being open source. In a discussion with the security team, you can be extremely transparent and just show the code. When you have discussion with users and there's a bug or a feature missing, you can just point to the issue, show the pull request, show the, show the, exactly, oh, PR welcome. That doesn't happen that much, but you can show the progress if the person that you're chatting with is a little bit technical, they really enjoy seeing the pull request advancing and seeing all the way to deploy. And then the downsides are mostly around security. You never want to do security by obfuscation. But the truth is that your vector of attack is facilitated by you being open source. But at the same time, it's a good thing because if you're doing anything like a bug bountying or stuff like that, you just give much more tools to the bug bountiers so that their output is much better. So there's many, many, many trade-offs. I don't believe in the value of the code base per se. I think it's really the people that are on the code base that have the value and go to market and the product and all of those things that are around the code base. Obviously, that's not true for every code base. If you're working on a very secret kernel to accelerate the inference of LLMs, I would buy that you don't want to be open source. But for product stuff, I really think there's very little risk. Yeah.Alessio [00:23:39]: I signed up for XP1, I was looking, January 2023. I think at the time you were on DaVinci 003. Given that you had seen GPD 4, how did you feel having to push a product out that was using this model that was so inferior? And you're like, please, just use it today. I promise it's going to get better. Just overall, as a founder, how do you build something that maybe doesn't quite work with the model today, but you're just expecting the new model to be better?Stan [00:24:03]: Yeah, so actually, XP1 was even on a smaller one that was the post-GDPT release, small version, so it was... Ada, Babbage... No, no, no, not that far away. But it was the small version of GDPT, basically. I don't remember its name. Yes, you have a frustration there. But at the same time, I think XP1 was designed, was an experiment, but was designed as a way to be useful at the current capability of the model. If you just want to extract data from a LinkedIn page, that model was just fine. If you want to summarize an article on a newspaper, that model was just fine. And so it was really a question of trying to find a product that works with the current capability, knowing that you will always have tailwinds as models get better and faster and cheaper. So that was kind of a... There's a bit of a frustration because you know what's out there and you know that you don't have access to it yet. It's also interesting to try to find a product that works with the current capability.Alessio [00:24:55]: And we highlighted XP1 in our anatomy of autonomy post in April of last year, which was, you know, where are all the agents, right? So now we spent 30 minutes getting to what you're building now. So you basically had a developer framework, then you had a browser extension, then you had all these things, and then you kind of got to where Dust is today. So maybe just give people an overview of what Dust is today and the courtesies behind it. Yeah, of course.Stan [00:25:20]: So Dust, we really want to build the infrastructure so that companies can deploy agents within their teams. We are horizontal by nature because we strongly believe in the emergence of use cases from the people having access to creating an agent that don't need to be developers. They have to be thinkers. They have to be curious. But anybody can create an agent that will solve an operational thing that they're doing in their day-to-day job. And to make those agents useful, there's two focus, which is interesting. The first one is an infrastructure focus. You have to build the pipes so that the agent has access to the data. You have to build the pipes such that the agents can take action, can access the web, et cetera. So that's really an infrastructure play. Maintaining connections to Notion, Slack, GitHub, all of them is a lot of work. It is boring work, boring infrastructure work, but that's something that we know is extremely valuable in the same way that Stripe is extremely valuable because it maintains the pipes. And we have that dual focus because we're also building the product for people to use it. And there it's fascinating because everything started from the conversational interface, obviously, which is a great starting point. But we're only scratching the surface, right? I think we are at the pong level of LLM productization. And we haven't invented the C3. We haven't invented Counter-Strike. We haven't invented Cyberpunk 2077. So this is really our mission is to really create the product that lets people equip themselves to just get away all the work that can be automated or assisted by LLMs.Alessio [00:26:57]: And can you just comment on different takes that people had? So maybe the most open is like auto-GPT. It's just kind of like just trying to do anything. It's like it's all magic. There's no way for you to do anything. Then you had the ADAPT, you know, we had David on the podcast. They're very like super hands-on with each individual customer to build super tailored. How do you decide where to draw the line between this is magic? This is exposed to you, especially in a market where most people don't know how to build with AI at all. So if you expect them to do the thing, they're probably not going to do it. Yeah, exactly.Stan [00:27:29]: So the auto-GPT approach obviously is extremely exciting, but we know that the agentic capability of models are not quite there yet. It just gets lost. So we're starting, we're starting where it works. Same with the XP one. And where it works is pretty simple. It's like simple workflows that involve a couple tools where you don't even need to have the model decide which tools it's used in the sense of you just want people to put it in the instructions. It's like take that page, do that search, pick up that document, do the work that I want in the format I want, and give me the results. There's no smartness there, right? In terms of orchestrating the tools, it's mostly using English for people to program a workflow where you don't have the constraint of having compatible API between the two.Swyx [00:28:17]: That kind of personal automation, would you say it's kind of like an LLM Zapier type ofStan [00:28:22]: thing?Swyx [00:28:22]: Like if this, then that, and then, you know, do this, then this. You're programming with English?Stan [00:28:28]: So you're programming with English. So you're just saying, oh, do this and then that. You can even create some form of APIs. You say, when I give you the command X, do this. When I give you the command Y, do this. And you describe the workflow. But you don't have to create boxes and create the workflow explicitly. It just needs to describe what are the tasks supposed to be and make the tool available to the agent. The tool can be a semantic search. The tool can be querying into a structured database. The tool can be searching on the web. And obviously, the interesting tools that we're only starting to scratch are actually creating external actions like reimbursing something on Stripe, sending an email, clicking on a button in the admin or something like that.Swyx [00:29:11]: Do you maintain all these integrations?Stan [00:29:13]: Today, we maintain most of the integrations. We do always have an escape hatch for people to kind of custom integrate. But the reality is that the reality of the market today is that people just want it to work, right? And so it's mostly us maintaining the integration. As an example, a very good source of information that is tricky to productize is Salesforce. Because Salesforce is basically a database and a UI. And they do the f**k they want with it. And so every company has different models and stuff like that. So right now, we don't support it natively. And the type of support or real native support will be slightly more complex than just osing into it, like is the case with Slack as an example. Because it's probably going to be, oh, you want to connect your Salesforce to us? Give us the SQL. That's the Salesforce QL language. Give us the queries you want us to run on it and inject in the context of dust. So that's interesting how not only integrations are cool, and some of them require a bit of work on the user. And for some of them that are really valuable to our users, but we don't support yet, they can just build them internally and push the data to us.Swyx [00:30:18]: I think I understand the Salesforce thing. But let me just clarify, are you using browser automation because there's no API for something?Stan [00:30:24]: No, no, no, no. In that case, so we do have browser automation for all the use cases and apply the public web. But for most of the integration with the internal system of the company, it really runs through API.Swyx [00:30:35]: Haven't you felt the pull to RPA, browser automation, that kind of stuff?Stan [00:30:39]: I mean, what I've been saying for a long time, maybe I'm wrong, is that if the future is that you're going to stand in front of a computer and looking at an agent clicking on stuff, then I'll hit my computer. And my computer is a big Lenovo. It's black. Doesn't sound good at all compared to a Mac. And if the APIs are there, we should use them. There is going to be a long tail of stuff that don't have APIs, but as the world is moving forward, that's disappearing. So the core API value in the past has really been, oh, this old 90s product doesn't have an API. So I need to use the UI to automate. I think for most of the ICP companies, the companies that ICP for us, the scale ups that are between 500 and 5,000 people, tech companies, most of the SaaS they use have APIs. Now there's an interesting question for the open web, because there are stuff that you want to do that involve websites that don't necessarily have APIs. And the current state of web integration from, which is us and OpenAI and Anthropic, I don't even know if they have web navigation, but I don't think so. The current state of affair is really, really broken because you have what? You have basically search and headless browsing. But headless browsing, I think everybody's doing basically body.innertext and fill that into the model, right?Swyx [00:31:56]: MARK MIRCHANDANI There's parsers into Markdown and stuff.Stan [00:31:58]: FRANCESC CAMPOY I'm super excited by the companies that are exploring the capability of rendering a web page into a way that is compatible for a model, being able to maintain the selector. So that's basically the place where to click in the page through that process, expose the actions to the model, have the model select an action in a way that is compatible with model, which is not a big page of a full DOM that is very noisy, and then being able to decompress that back to the original page and take the action. And that's something that is really exciting and that will kind of change the level of things that agents can do on the web. That I feel exciting, but I also feel that the bulk of the useful stuff that you can do within the company can be done through API. The data can be retrieved by API. The actions can be taken through API.Swyx [00:32:44]: For listeners, I'll note that you're basically completely disagreeing with David Wan. FRANCESC CAMPOY Exactly, exactly. I've seen it since it's summer. ADEPT is where it is, and Dust is where it is. So Dust is still standing.Alessio [00:32:55]: Can we just quickly comment on function calling? You mentioned you don't need the models to be that smart to actually pick the tools. Have you seen the models not be good enough? Or is it just like, you just don't want to put the complexity in there? Like, is there any room for improvement left in function calling? Or do you feel you usually consistently get always the right response, the right parametersStan [00:33:13]: and all of that?Alessio [00:33:13]: FRANCESC CAMPOY So that's a tricky product question.Stan [00:33:15]: Because if the instructions are good and precise, then you don't have any issue, because it's scripted for you. And the model will just look at the scripts and just follow and say, oh, he's probably talking about that action, and I'm going to use it. And the parameters are kind of abused from the state of the conversation. I'll just go with it. If you provide a very high level, kind of an auto-GPT-esque level in the instructions and provide 16 different tools to your model, yes, we're seeing the models in that state making mistakes. And there is obviously some progress can be made on the capabilities. But the interesting part is that there is already so much work that can assist, augment, accelerate by just going with pretty simply scripted for actions agents. What I'm excited about by pushing our users to create rather simple agents is that once you have those working really well, you can create meta agents that use the agents as actions. And all of a sudden, you can kind of have a hierarchy of responsibility that will probably get you almost to the point of the auto-GPT value. It requires the construction of intermediary artifacts, but you're probably going to be able to achieve something great. I'll give you some example. We have our incidents are shared in Slack in a specific channel, or shipped are shared in Slack. We have a weekly meeting where we have a table about incidents and shipped stuff. We're not writing that weekly meeting table anymore. We have an assistant that just go find the right data on Slack and create the table for us. And that assistant works perfectly. It's trivially simple, right? Take one week of data from that channel and just create the table. And then we have in that weekly meeting, obviously some graphs and reporting about our financials and our progress and our ARR. And we've created assistants to generate those graphs directly. And those assistants works great. By creating those assistants that cover those small parts of that weekly meeting, slowly we're getting to in a world where we'll have a weekly meeting assistance. We'll just call it. You don't need to prompt it. You don't need to say anything. It's going to run those different assistants and get that notion page just ready. And by doing that, if you get there, and that's an objective for us to us using Dust, get there, you're saving an hour of company time every time you run it. Yeah.Alessio [00:35:28]: That's my pet topic of NPM for agents. How do you build dependency graphs of agents? And how do you share them? Because why do I have to rebuild some of the smaller levels of what you built already?Swyx [00:35:40]: I have a quick follow-up question on agents managing other agents. It's a topic of a lot of research, both from Microsoft and even in startups. What you've discovered best practice for, let's say like a manager agent controlling a bunch of small agents. It's two-way communication. I don't know if there should be a protocol format.Stan [00:35:59]: To be completely honest, the state we are at right now is creating the simple agents. So we haven't even explored yet the meta agents. We know it's there. We know it's going to be valuable. We know it's going to be awesome. But we're starting there because it's the simplest place to start. And it's also what the market understands. If you go to a company, random SaaS B2B company, not necessarily specialized in AI, and you take an operational team and you tell them, build some tooling for yourself, they'll understand the small agents. If you tell them, build AutoGP, they'll be like, Auto what?Swyx [00:36:31]: And I noticed that in your language, you're very much focused on non-technical users. You don't really mention API here. You mention instruction instead of system prompt, right? That's very conscious.Stan [00:36:41]: Yeah, it's very conscious. It's a mark of our designer, Ed, who kind of pushed us to create a friendly product. I was knee-deep into AI when I started, obviously. And my co-founder, Gabriel, was a Stripe as well. We started a company together that got acquired by Stripe 15 years ago. It was at Alain, a healthcare company in Paris. After that, it was a little bit less so knee-deep in AI, but really focused on product. And I didn't realize how important it is to make that technology not scary to end users. It didn't feel scary to me, but it was really seen by Ed, our designer, that it was feeling scary to the users. And so we were very proactive and very deliberate about creating a brand that feels not too scary and creating a wording and a language, as you say, that really tried to communicate the fact that it's going to be fine. It's going to be easy. You're going to make it.Alessio [00:37:34]: And another big point that David had about ADAPT is we need to build an environment for the agents to act. And then if you have the environment, you can simulate what they do. How's that different when you're interacting with APIs and you're kind of touching systems that you cannot really simulate? If you call it the Salesforce API, you're just calling it.Stan [00:37:52]: So I think that goes back to the DNA of the companies that are very different. ADAPT, I think, was a product company with a very strong research DNA, and they were still doing research. One of their goals was building a model. And that's why they raised a large amount of money, et cetera. We are 100% deliberately a product company. We don't do research. We don't train models. We don't even run GPUs. We're using the models that exist, and we try to push the product boundary as far as possible with the existing models. So that creates an issue. Indeed, so to answer your question, when you're interacting in the real world, well, you cannot simulate, so you cannot improve the models. Even improving your instructions is complicated for a builder. The hope is that you can use models to evaluate the conversations so that you can get at least feedback and you could get contradictive information about the performance of the assistance. But if you take actual trace of interaction of humans with those agents, it is even for us humans extremely hard to decide whether it was a productive interaction or a really bad interaction. You don't know why the person left. You don't know if they left happy or not. So being extremely, extremely, extremely pragmatic here, it becomes a product issue. We have to build a product that identifies the end users to provide feedback so that as a first step, the person that is building the agent can iterate on it. As a second step, maybe later when we start training model and post-training, et cetera, we can optimize around that for each of those companies. Yeah.Alessio [00:39:17]: Do you see in the future products offering kind of like a simulation environment, the same way all SaaS now kind of offers APIs to build programmatically? Like in cybersecurity, there are a lot of companies working on building simulative environments so that then you can use agents like Red Team, but I haven't really seen that.Stan [00:39:34]: Yeah, no, me neither. That's a super interesting question. I think it's really going to depend on how much, because you need to simulate to generate data, you need to train data to train models. And the question at the end is, are we going to be training models or are we just going to be using frontier models as they are? On that question, I don't have a strong opinion. It might be the case that we'll be training models because in all of those AI first products, the model is so close to the product surface that as you get big and you want to really own your product, you're going to have to own the model as well. Owning the model doesn't mean doing the pre-training, that would be crazy. But at least having an internal post-training realignment loop, it makes a lot of sense. And so if we see many companies going towards that all the time, then there might be incentives for the SaaS's of the world to provide assistance in getting there. But at the same time, there's a tension because those SaaS, they don't want to be interacted by agents, they want the human to click on the button. Yeah, they got to sell seats. Exactly.Swyx [00:40:41]: Just a quick question on models. I'm sure you've used many, probably not just OpenAI. Would you characterize some models as better than others? Do you use any open source models? What have been the trends in models over the last two years?Stan [00:40:53]: We've seen over the past two years kind of a bit of a race in between models. And at times, it's the OpenAI model that is the best. At times, it's the Anthropic models that is the best. Our take on that is that we are agnostic and we let our users pick their model. Oh, they choose? Yeah, so when you create an assistant or an agent, you can just say, oh, I'm going to run it on GP4, GP4 Turbo, or...Swyx [00:41:16]: Don't you think for the non-technical user, that is actually an abstraction that you should take away from them?Stan [00:41:20]: We have a sane default. So we move the default to the latest model that is cool. And we have a sane default, and it's actually not very visible. In our flow to create an agent, you would have to go in advance and go pick your model. So this is something that the technical person will care about. But that's something that obviously is a bit too complicated for the...Swyx [00:41:40]: And do you care most about function calling or instruction following or something else?Stan [00:41:44]: I think we care most for function calling because you want to... There's nothing worse than a function call, including incorrect parameters or being a bit off because it just drives the whole interaction off.Swyx [00:41:56]: Yeah, so got the Berkeley function calling.Stan [00:42:00]: These days, it's funny how the comparison between GP4O and GP4 Turbo is still up in the air on function calling. I personally don't have proof, but I know many people, and I'm probably part of them, to think that GP4 Turbo is still better than GP4O on function calling. Wow. We'll see what comes out of the O1 class if it ever gets function calling. And Cloud 3.5 Summit is great as well. They kind of innovated in an interesting way, which was never quite publicized. But it's that they have that kind of chain of thought step whenever you use a Cloud model or Summit model with function calling. That chain of thought step doesn't exist when you just interact with it just for answering questions. But when you use function calling, you get that step, and it really helps getting better function calling.Swyx [00:42:43]: Yeah, we actually just recorded a podcast with the Berkeley team that runs that leaderboard this week. So they just released V3.Stan [00:42:49]: Yeah.Swyx [00:42:49]: It was V1 like two months ago, and then they V2, V3. Turbo is on top.Stan [00:42:53]: Turbo is on top. Turbo is over 4.0.Swyx [00:42:54]: And then the third place is XLAM from Salesforce, which is a large action model they've been trying to popularize.Stan [00:43:01]: Yep.Swyx [00:43:01]: O1 Mini is actually on here, I think. O1 Mini is number 11.Stan [00:43:05]: But arguably, O1 Mini has been in a line for that. Yeah.Alessio [00:43:09]: Do you use leaderboards? Do you have your own evals? I mean, this is kind of intuitive, right? Like using the older model is better. I think most people just upgrade. Yeah. What's the eval process like?Stan [00:43:19]: It's funny because I've been doing research for three years, and we have bigger stuff to cook. When you're deploying in a company, one thing where we really spike is that when we manage to activate the company, we have a crazy penetration. The highest penetration we have is 88% daily active users within the entire employee of the company. The kind of average penetration and activation we have in our current enterprise customers is something like more like 60% to 70% weekly active. So we basically have the entire company interacting with us. And when you're there, there is so many stuff that matters most than getting evals, getting the best model. Because there is so many places where you can create products or do stuff that will give you the 80% with the work you do. Whereas deciding if it's GPT-4 or GPT-4 Turbo or et cetera, you know, it'll just give you the 5% improvement. But the reality is that you want to focus on the places where you can really change the direction or change the interaction more drastically. But that's something that we'll have to do eventually because we still want to be serious people.Swyx [00:44:24]: It's funny because in some ways, the model labs are competing for you, right? You don't have to do any effort. You just switch model and then it'll grow. What are you really limited by? Is it additional sources?Stan [00:44:36]: It's not models, right?Swyx [00:44:37]: You're not really limited by quality of model.Stan [00:44:40]: Right now, we are limited by the infrastructure part, which is the ability to connect easily for users to all the data they need to do the job they want to do.Swyx [00:44:51]: Because you maintain all your own stuff.Stan [00:44:53]: You know, there are companies out thereSwyx [00:44:54]: that are starting to provide integrations as a service, right? I used to work in an integrations company. Yeah, I know.Stan [00:44:59]: It's just that there is some intricacies about how you chunk stuff and how you process information from one platform to the other. If you look at the end of the spectrum, you could think of, you could say, oh, I'm going to support AirByte and AirByte has- I used to work at AirByte.Swyx [00:45:12]: Oh, really?Stan [00:45:13]: That makes sense.Swyx [00:45:14]: They're the French founders as well.Stan [00:45:15]: I know Jean very well. I'm seeing him today. And the reality is that if you look at Notion, AirByte does the job of taking Notion and putting it in a structured way. But that's the way it is not really usable to actually make it available to models in a useful way. Because you get all the blocks, details, et cetera, which is useful for many use cases.Swyx [00:45:35]: It's also for data scientists and not for AI.Stan [00:45:38]: The reality of Notion is that sometimes you have a- so when you have a page, there's a lot of structure in it and you want to capture the structure and chunk the information in a way that respects that structure. In Notion, you have databases. Sometimes those databases are real tabular data. Sometimes those databases are full of text. You want to get the distinction and understand that this database should be considered like text information, whereas this other one is actually quantitative information. And to really get a very high quality interaction with that piece of information, I haven't found a solution that will work without us owning the connection end-to-end.Swyx [00:46:15]: That's why I don't invest in, there's Composio, there's All Hands from Graham Newbig. There's all these other companies that are like, we will do the integrations for you. You just, we have the open source community. We'll do off the shelf. But then you are so specific in your needs that you want to own it.Swyx [00:46:28]: Yeah, exactly.Stan [00:46:29]: You can talk to Michel about that.Swyx [00:46:30]: You know, he wants to put the AI in there, but you know. Yeah, I will. I will.Stan [00:46:35]: Cool. What are we missing?Alessio [00:46:36]: You know, what are like the things that are like sneakily hard that you're tackling that maybe people don't even realize they're like really hard?Stan [00:46:43]: The real parts as we kind of touch base throughout the conversation is really building the infra that works for those agents because it's a tenuous walk. It's an evergreen piece of work because you always have an extra integration that will be useful to a non-negligible set of your users. I'm super excited about is that there's so many interactions that shouldn't be conversational interactions and that could be very useful. Basically, know that we have the firehose of information of those companies and there's not going to be that many companies that capture the firehose of information. When you have the firehose of information, you can do a ton of stuff with models that are just not accelerating people, but giving them superhuman capability, even with the current model capability because you can just sift through much more information. An example is documentation repair. If I have the firehose of Slack messages and new Notion pages, if somebody says, I own that page, I want to be updated when there is a piece of information that should update that page, this is not possible. You get an email saying, oh, look at that Slack message. It says the opposite of what you have in that paragraph. Maybe you want to update or just ping that person. I think there is a lot to be explored on the product layer in terms of what it means to interact productively with those models. And that's a problem that's extremely hard and extremely exciting.Swyx [00:48:00]: One thing you keep mentioning about infra work, obviously, Dust is building that infra and serving that in a very consumer-friendly way. You always talk about infra being additional sources, additional connectors. That is very important. But I'm also interested in the vertical infra. There is an orchestrator underlying all these things where you're doing asynchronous work. For example, the simplest one is a cron job. You just schedule things. But also, for if this and that, you have to wait for something to be executed and proceed to the next task. I used to work on an orchestrator as well, Temporal.Stan [00:48:31]: We used Temporal. Oh, you used Temporal? Yeah. Oh, how was the experience?Swyx [00:48:34]: I need the NPS.Stan [00:48:36]: We're doing a self-discovery call now.Swyx [00:48:39]: But you can also complain to me because I don't work there anymore.Stan [00:48:42]: No, we love Temporal. There's some edges that are a bit rough, surprisingly rough. And you would say, why is it so complicated?Swyx [00:48:49]: It's always versioning.Stan [00:48:50]: Yeah, stuff like that. But we really love it. And we use it for exactly what you said, like managing the entire set of stuff that needs to happen so that in semi-real time, we get all the updates from Slack or Notion or GitHub into the system. And whenever we see that piece of information goes through, maybe trigger workflows to run agents because they need to provide alerts to users and stuff like that. And Temporal is great. Love it.Swyx [00:49:17]: You haven't evaluated others. You don't want to build your own. You're happy with...Stan [00:49:21]: Oh, no, we're not in the business of replacing Temporal. And Temporal is so... I mean, it is or any other competitive product. They're very general. If it's there, there's an interesting theory about buy versus build. I think in that case, when you're a high-growth company, your buy-build trade-off is very much on the side of buy. Because if you have the capability, you're just going to be saving time, you can focus on your core competency, etc. And it's funny because we're seeing, we're starting to see the post-high-growth company, post-SKF company, going back on that trade-off, interestingly. So that's the cloud news about removing Zendesk and Salesforce. Do you believe that, by the way?Alessio [00:49:56]: Yeah, I did a podcast with them.Stan [00:49:58]: Oh, yeah?Alessio [00:49:58]: It's true.Swyx [00:49:59]: No, no, I know.Stan [00:50:00]: Of course they say it's true,Swyx [00:50:00]: but also how well is it going to go?Stan [00:50:02]: So I'm not talking about deflecting the customer traffic. I'm talking about building AI on top of Salesforce and Zendesk, basically, if I understand correctly. And all of a sudden, your product surface becomes much smaller because you're interacting with an AI system that will take some actions. And so all of a sudden, you don't need the product layer anymore. And you realize that, oh, those things are just databases that I pay a hundred times the price, right? Because you're a post-SKF company and you have tech capabilities, you are incentivized to reduce your costs and you have the capability to do so. And then it makes sense to just scratch the SaaS away. So it's interesting that we might see kind of a bad time for SaaS in post-hyper-growth tech companies. So it's still a big market, but it's not that big because if you're not a tech company, you don't have the capabilities to reduce that cost. If you're a high-growth company, always going to be buying because you go faster with that. But that's an interesting new space, new category of companies that might remove some SaaS. Yeah, Alessio's firmSwyx [00:51:02]: has an interesting thesis on the future of SaaS in AI.Alessio [00:51:05]: Service as a software, we call it. It's basically like, well, the most extreme is like, why is there any software at all? You know, ideally, it's all a labor interface where you're asking somebody to do something for you, whether that's a person, an AI agent or whatnot.Stan [00:51:17]: Yeah, yeah, that's interesting. I have to ask.Swyx [00:51:19]: Are you paying for Temporal Cloud or are you self-hosting?Stan [00:51:22]: Oh, no, no, we're paying, we're paying. Oh, okay, interesting.Swyx [00:51:24]: We're paying way too much.Stan [00:51:26]: It's crazy expensive, but it makes us-Swyx [00:51:28]: That's why as a shareholder, I like to hear that. It makes us go faster,Stan [00:51:31]: so we're happy to pay.Swyx [00:51:33]: Other things in the infrastack, I just want a list for other founders to think about. Ops, API gateway, evals, you know, anything interesting there that you build or buy?Stan [00:51:41]: I mean, there's always an interesting question. We've been building a lot around the interface between models and because Dust, the original version, was an orchestration platform and we basically provide a unified interface to every model providers.Swyx [00:51:56]: That's what I call gateway.Stan [00:51:57]: That we add because Dust was that and so we continued building upon and we own it. But that's an interesting question was in you, you want to build that or buy it?Swyx [00:52:06]: Yeah, I always say light LLM is the current open source consensus.Stan [00:52:09]: Exactly, yeah. There's an interesting question there.Swyx [00:52:12]: Ops, Datadog, just tracking.Stan [00:52:14]: Oh yeah, so Datadog is an obvious... What are the mistakes that I regret? I started as pure JavaScript, not TypeScript, and I think you want to, if you're wondering, oh, I want to go fast, I'll do a little bit of JavaScript. No, don't, just start with TypeScript. I see, okay.Swyx [00:52:30]: So interesting, you are a research engineer that came out of OpenAI that bet on TypeScript.Stan [00:52:36]: Well, the reality is that if you're building a product, you're going to be doing a lot of JavaScript, right? And Next, we're using Next as an example. It's
It's the Red Team vs the Blue Team in the US's greatest spectator sport: The 2024 Presidential Election! Tom and Chuck spent the first part of the evening over on the Two Cops One Donut podcast and then asked the hosts of that show, Erik and Banning to join us on Locker Room LIVE for the election results and to talk about what they do on their show. Remember to like, subscribe, and leave a review to help us grow the podcast. Go to www.warstoriesofficial.com to listen to older episodes or to support us by buying our merch. You can also support us at https://patron.podbean.com/warstories... and follow us on Instagram @war_stories_official and Facebook at www.facebook.com/WarStoriesOfficialPodcast
Here are the 3 Big things you need to know this hour— Number One— The markets are fired up for Donald Trump to be President—now he has to make good on the promises he made to America—and that work is already underway I can tell you that— Number Two— For the first time—the voter turnout model had more independents than Democrats—maybe that's why so many of the pollsters really blew it this time around—or maybe they are just dishonest— Number Three— To be honest—I cannot get enough of the liberal meltdowns following the collapse of Kamala Harris and the Democrats all over America—that appear to have made way for a trifecta in Washington for the Red Team—
Alethe Denis, Senior Security Consultant at Bishop Fox, is a red team hacker, physical pen tester, and social engineer. In this episode, she joins host Heather Engel to discuss her work, including how she prepares for social engineering engagements, common vulnerabilities encountered, and how the cybersecurity threat landscape continues to evolve. For more information about Alethe, visit https://linktr.ee/alethedenis. • For more on cybersecurity, visit https://cybersecurityventures.com/
Apologies for lower audio quality; we lost recordings and had to use backup tracks. Our guests today are Anastasios Angelopoulos and Wei-Lin Chiang, leads of Chatbot Arena, fka LMSYS, the crowdsourced AI evaluation platform developed by the LMSys student club at Berkeley, which became the de facto standard for comparing language models. Arena ELO is often more cited than MMLU scores to many folks, and they have attracted >1,000,000 people to cast votes since its launch, leading top model trainers to cite them over their own formal academic benchmarks:The Limits of Static BenchmarksWe've done two benchmarks episodes: Benchmarks 101 and Benchmarks 201. One issue we've always brought up with static benchmarks is that 1) many are getting saturated, with models scoring almost perfectly on them 2) they often don't reflect production use cases, making it hard for developers and users to use them as guidance. The fundamental challenge in AI evaluation isn't technical - it's philosophical. How do you measure something that increasingly resembles human intelligence? Rather than trying to define intelligence upfront, Arena let users interact naturally with models and collect comparative feedback. It's messy and subjective, but that's precisely the point - it captures the full spectrum of what people actually care about when using AI.The Pareto Frontier of Cost vs IntelligenceBecause the Elo scores are remarkably stable over time, we can put all the chat models on a map against their respective cost to gain a view of at least 3 orders of magnitude of model sizes/costs and observe the remarkable shift in intelligence per dollar over the past year:This frontier stood remarkably firm through the recent releases of o1-preview and price cuts of Gemini 1.5:The Statistics of SubjectivityIn our Benchmarks 201 episode, Clémentine Fourrier from HuggingFace thought this design choice was one of shortcomings of arenas: they aren't reproducible. You don't know who ranked what and what exactly the outcome was at the time of ranking. That same person might rank the same pair of outputs differently on a different day, or might ask harder questions to better models compared to smaller ones, making it imbalanced. Another argument that people have brought up is confirmation bias. We know humans prefer longer responses and are swayed by formatting - Rob Mulla from Dreadnode had found some interesting data on this in May:The approach LMArena is taking is to use logistic regression to decompose human preferences into constituent factors. As Anastasios explains: "We can say what components of style contribute to human preference and how they contribute." By adding these style components as parameters, they can mathematically "suck out" their influence and isolate the core model capabilities.This extends beyond just style - they can control for any measurable factor: "What if I want to look at the cost adjusted performance? Parameter count? We can ex post facto measure that." This is one of the most interesting things about Arena: You have a data generation engine which you can clean and turn into leaderboards later. If you wanted to create a leaderboard for poetry writing, you could get existing data from Arena, normalize it by identifying these style components. Whether or not it's possible to really understand WHAT bias the voters have, that's a different question.Private EvalsOne of the most delicate challenges LMSYS faces is maintaining trust while collaborating with AI labs. The concern is that labs could game the system by testing multiple variants privately and only releasing the best performer. This was brought up when 4o-mini released and it ranked as the second best model on the leaderboard:But this fear misunderstands how Arena works. Unlike static benchmarks where selection bias is a major issue, Arena's live nature means any initial bias gets washed out by ongoing evaluation. As Anastasios explains: "In the long run, there's way more fresh data than there is data that was used to compare these five models." The other big question is WHAT model is actually being tested; as people often talk about on X / Discord, the same endpoint will randomly feel “nerfed” like it happened for “Claude European summer” and corresponding conspiracy theories:It's hard to keep track of these performance changes in Arena as these changes (if real…?) are not observable.The Future of EvaluationThe team's latest work on RouteLLM points to an interesting future where evaluation becomes more granular and task-specific. But they maintain that even simple routing strategies can be powerful - like directing complex queries to larger models while handling simple tasks with smaller ones.Arena is now going to expand beyond text into multimodal evaluation and specialized domains like code execution and red teaming. But their core insight remains: the best way to evaluate intelligence isn't to simplify it into metrics, but to embrace its complexity and find rigorous ways to analyze it. To go after this vision, they are spinning out Arena from LMSys, which will stay as an academia-driven group at Berkeley.Full Video PodcastChapters* 00:00:00 - Introductions* 00:01:16 - Origin and development of Chatbot Arena* 00:05:41 - Static benchmarks vs. Arenas* 00:09:03 - Community building* 00:13:32 - Biases in human preference evaluation* 00:18:27 - Style Control and Model Categories* 00:26:06 - Impact of o1* 00:29:15 - Collaborating with AI labs* 00:34:51 - RouteLLM and router models* 00:38:09 - Future of LMSys / ArenaShow Notes* Anastasios Angelopoulos* Anastasios' NeurIPS Paper Conformal Risk Control* Wei-Lin Chiang* Chatbot Arena* LMSys* MTBench* ShareGPT dataset* Stanford's Alpaca project* LLMRouter* E2B* DreadnodeTranscriptAlessio [00:00:00]: Hey everyone, welcome to the Latent Space podcast. This is Alessio, Partner and CTO in Residence at Decibel Partners, and I'm joined by my co-host Swyx, founder of Smol.ai.Swyx [00:00:14]: Hey, and today we're very happy and excited to welcome Anastasios and Wei Lin from LMSys. Welcome guys.Wei Lin [00:00:21]: Hey, how's it going? Nice to see you.Anastasios [00:00:23]: Thanks for having us.Swyx [00:00:24]: Anastasios, I actually saw you, I think at last year's NeurIPS. You were presenting a paper, which I don't really super understand, but it was some theory paper about how your method was very dominating over other sort of search methods. I don't remember what it was, but I remember that you were a very confident speaker.Anastasios [00:00:40]: Oh, I totally remember you. Didn't ever connect that, but yes, that's definitely true. Yeah. Nice to see you again.Swyx [00:00:46]: Yeah. I was frantically looking for the name of your paper and I couldn't find it. Basically I had to cut it because I didn't understand it.Anastasios [00:00:51]: Is this conformal PID control or was this the online control?Wei Lin [00:00:55]: Blast from the past, man.Swyx [00:00:57]: Blast from the past. It's always interesting how NeurIPS and all these academic conferences are sort of six months behind what people are actually doing, but conformal risk control, I would recommend people check it out. I have the recording. I just never published it just because I was like, I don't understand this enough to explain it.Anastasios [00:01:14]: People won't be interested.Wei Lin [00:01:15]: It's all good.Swyx [00:01:16]: But ELO scores, ELO scores are very easy to understand. You guys are responsible for the biggest revolution in language model benchmarking in the last few years. Maybe you guys want to introduce yourselves and maybe tell a little bit of the brief history of LMSysWei Lin [00:01:32]: Hey, I'm Wei Lin. I'm a fifth year PhD student at UC Berkeley, working on Chatbot Arena these days, doing crowdsourcing AI benchmarking.Anastasios [00:01:43]: I'm Anastasios. I'm a sixth year PhD student here at Berkeley. I did most of my PhD on like theoretical statistics and sort of foundations of model evaluation and testing. And now I'm working 150% on this Chatbot Arena stuff. It's great.Alessio [00:02:00]: And what was the origin of it? How did you come up with the idea? How did you get people to buy in? And then maybe what were one or two of the pivotal moments early on that kind of made it the standard for these things?Wei Lin [00:02:12]: Yeah, yeah. Chatbot Arena project was started last year in April, May, around that. Before that, we were basically experimenting in a lab how to fine tune a chatbot open source based on the Llama 1 model that I released. At that time, Lama 1 was like a base model and people didn't really know how to fine tune it. So we were doing some explorations. We were inspired by Stanford's Alpaca project. So we basically, yeah, grow a data set from the internet, which is called ShareGPT data set, which is like a dialogue data set between user and chat GPT conversation. It turns out to be like pretty high quality data, dialogue data. So we fine tune on it and then we train it and release the model called V2. And people were very excited about it because it kind of like demonstrate open way model can reach this conversation capability similar to chat GPT. And then we basically release the model with and also build a demo website for the model. People were very excited about it. But during the development, the biggest challenge to us at the time was like, how do we even evaluate it? How do we even argue this model we trained is better than others? And then what's the gap between this open source model that other proprietary offering? At that time, it was like GPT-4 was just announced and it's like Cloud One. What's the difference between them? And then after that, like every week, there's a new model being fine tuned, released. So even until still now, right? And then we have that demo website for V2 now. And then we thought like, okay, maybe we can add a few more of the model as well, like API model as well. And then we quickly realized that people need a tool to compare between different models. So we have like a side by side UI implemented on the website to that people choose, you know, compare. And we quickly realized that maybe we can do something like, like a battle on top of ECLMs, like just anonymize it, anonymize the identity, and that people vote which one is better. So the community decides which one is better, not us, not us arguing, you know, our model is better or what. And that turns out to be like, people are very excited about this idea. And then we tweet, we launch, and that's, yeah, that's April, May. And then it was like first two, three weeks, like just a few hundred thousand views tweet on our launch tweets. And then we have regularly double update weekly, beginning at a time, adding new model GPT-4 as well. So it was like, that was the, you know, the initial.Anastasios [00:04:58]: Another pivotal moment, just to jump in, would be private models, like the GPT, I'm a little,Wei Lin [00:05:04]: I'm a little chatty. That was this year. That was this year.Anastasios [00:05:07]: Huge.Wei Lin [00:05:08]: That was also huge.Alessio [00:05:09]: In the beginning, I saw the initial release was May 3rd of the beta board. On April 6, we did a benchmarks 101 episode for a podcast, just kind of talking about, you know, how so much of the data is like in the pre-training corpus and blah, blah, blah. And like the benchmarks are really not what we need to evaluate whether or not a model is good. Why did you not make a benchmark? Maybe at the time, you know, it was just like, Hey, let's just put together a whole bunch of data again, run a, make a score that seems much easier than coming out with a whole website where like users need to vote. Any thoughts behind that?Wei Lin [00:05:41]: I think it's more like fundamentally, we don't know how to automate this kind of benchmarks when it's more like, you know, conversational, multi-turn, and more open-ended task that may not come with a ground truth. So let's say if you ask a model to help you write an email for you for whatever purpose, there's no ground truth. How do you score them? Or write a story or a creative story or many other things like how we use ChatterBee these days. It's more open-ended. You know, we need human in the loop to give us feedback, which one is better. And I think nuance here is like, sometimes it's also hard for human to give the absolute rating. So that's why we have this kind of pairwise comparison, easier for people to choose which one is better. So from that, we use these pairwise comparison, those to calculate the leaderboard. Yeah. You can add more about this methodology.Anastasios [00:06:40]: Yeah. I think the point is that, and you guys probably also talked about this at some point, but static benchmarks are intrinsically, to some extent, unable to measure generative model performance. And the reason is because you cannot pre-annotate all the outputs of a generative model. You change the model, it's like the distribution of your data is changing. New labels to deal with that. New labels are great automated labeling, right? Which is why people are pursuing both. And yeah, static benchmarks, they allow you to zoom in to particular types of information like factuality, historical facts. We can build the best benchmark of historical facts, and we will then know that the model is great at historical facts. But ultimately, that's not the only axis, right? And we can build 50 of them, and we can evaluate 50 axes. But it's just so, the problem of generative model evaluation is just so expansive, and it's so subjective, that it's just maybe non-intrinsically impossible, but at least we don't see a way. We didn't see a way of encoding that into a fixed benchmark.Wei Lin [00:07:47]: But on the other hand, I think there's a challenge where this kind of online dynamic benchmark is more expensive than static benchmark, offline benchmark, where people still need it. Like when they build models, they need static benchmark to track where they are.Anastasios [00:08:03]: It's not like our benchmark is uniformly better than all other benchmarks, right? It just measures a different kind of performance that has proved to be useful.Swyx [00:08:14]: You guys also published MTBench as well, which is a static version, let's say, of Chatbot Arena, right? That people can actually use in their development of models.Wei Lin [00:08:25]: Right. I think one of the reasons we still do this static benchmark, we still wanted to explore, experiment whether we can automate this, because people, eventually, model developers need it to fast iterate their model. So that's why we explored LM as a judge, and ArenaHard, trying to filter, select high-quality data we collected from Chatbot Arena, the high-quality subset, and use that as a question and then automate the judge pipeline, so that people can quickly get high-quality signal, benchmark signals, using this online benchmark.Swyx [00:09:03]: As a community builder, I'm curious about just the initial early days. Obviously when you offer effectively free A-B testing inference for people, people will come and use your arena. What do you think were the key unlocks for you? Was it funding for this arena? Was it marketing? When people came in, do you see a noticeable skew in the data? Which obviously now you have enough data sets, you can separate things out, like coding and hard prompts, but in the early days, it was just all sorts of things.Anastasios [00:09:31]: Yeah, maybe one thing to establish at first is that our philosophy has always been to maximize organic use. I think that really does speak to your point, which is, yeah, why do people come? They came to use free LLM inference, right? And also, a lot of users just come to the website to use direct chat, because you can chat with the model for free. And then you could think about it like, hey, let's just be kind of like more on the selfish or conservative or protectionist side and say, no, we're only giving credits for people that battle or so on and so forth. Strategy wouldn't work, right? Because what we're trying to build is like a big funnel, a big funnel that can direct people. And some people are passionate and interested and they battle. And yes, the distribution of the people that do that is different. It's like, as you're pointing out, it's like, that's not as they're enthusiastic.Wei Lin [00:10:24]: They're early adopters of this technology.Anastasios [00:10:27]: Or they like games, you know, people like this. And we've run a couple of surveys that indicate this as well, of our user base.Wei Lin [00:10:36]: We do see a lot of developers come to the site asking polling questions, 20-30%. Yeah, 20-30%.Anastasios [00:10:42]: It's obviously not reflective of the general population, but it's reflective of some corner of the world of people that really care. And to some extent, maybe that's all right, because those are like the power users. And you know, we're not trying to claim that we represent the world, right? We represent the people that come and vote.Swyx [00:11:02]: Did you have to do anything marketing-wise? Was anything effective? Did you struggle at all? Was it success from day one?Wei Lin [00:11:09]: At some point, almost done. Okay. Because as you can imagine, this leaderboard depends on community engagement participation. If no one comes to vote tomorrow, then no leaderboard.Anastasios [00:11:23]: So we had some period of time when the number of users was just, after the initial launch, it went lower. Yeah. And, you know, at some point, it did not look promising. Actually, I joined the project a couple months in to do the statistical aspects, right? As you can imagine, that's how it kind of hooked into my previous work. At that time, it wasn't like, you know, it definitely wasn't clear that this was like going to be the eval or something. It was just like, oh, this is a cool project. Like Wayland seems awesome, you know, and that's it.Wei Lin [00:11:56]: Definitely. There's in the beginning, because people don't know us, people don't know what this is for. So we had a hard time. But I think we were lucky enough that we have some initial momentum. And as well as the competition between model providers just becoming, you know, became very intense. Intense. And then that makes the eval onto us, right? Because always number one is number one.Anastasios [00:12:23]: There's also an element of trust. Our main priority in everything we do is trust. We want to make sure we're doing everything like all the I's are dotted and the T's are crossed and nobody gets unfair treatment and people can see from our profiles and from our previous work and from whatever, you know, we're trustworthy people. We're not like trying to make a buck and we're not trying to become famous off of this or that. It's just, we're trying to provide a great public leaderboard community venture project.Wei Lin [00:12:51]: Yeah.Swyx [00:12:52]: Yes. I mean, you are kind of famous now, you know, that's fine. Just to dive in more into biases and, you know, some of this is like statistical control. The classic one for human preference evaluation is humans demonstrably prefer longer contexts or longer outputs, which is actually something that we don't necessarily want. You guys, I think maybe two months ago put out some length control studies. Apart from that, there are just other documented biases. Like, I'd just be interested in your review of what you've learned about biases and maybe a little bit about how you've controlled for them.Anastasios [00:13:32]: At a very high level, yeah. Humans are biased. Totally agree. Like in various ways. It's not clear whether that's good or bad, you know, we try not to make value judgments about these things. We just try to describe them as they are. And our approach is always as follows. We collect organic data and then we take that data and we mine it to get whatever insights we can get. And, you know, we have many millions of data points that we can now use to extract insights from. Now, one of those insights is to ask the question, what is the effect of style, right? You have a bunch of data, you have votes, people are voting either which way. We have all the conversations. We can say what components of style contribute to human preference and how do they contribute? Now, that's an important question. Why is that an important question? It's important because some people want to see which model would be better if the lengths of the responses were the same, were to be the same, right? People want to see the causal effect of the model's identity controlled for length or controlled for markdown, number of headers, bulleted lists, is the text bold? Some people don't, they just don't care about that. The idea is not to impose the judgment that this is not important, but rather to say ex post facto, can we analyze our data in a way that decouples all the different factors that go into human preference? Now, the way we do this is via statistical regression. That is to say the arena score that we show on our leaderboard is a particular type of linear model, right? It's a linear model that takes, it's a logistic regression that takes model identities and fits them against human preference, right? So it regresses human preference against model identity. What you get at the end of that logistic regression is a parameter vector of coefficients. And when the coefficient is large, it tells you that GPT 4.0 or whatever, very large coefficient, that means it's strong. And that's exactly what we report in the table. It's just the predictive effect of the model identity on the vote. The other thing that you can do is you can take that vector, let's say we have M models, that is an M dimensional vector of coefficients. What you can do is you say, hey, I also want to understand what the effect of length is. So I'll add another entry to that vector, which is trying to predict the vote, right? That tells me the difference in length between two model responses. So we have that for all of our data. We can compute it ex post facto. We added it into the regression and we look at that predictive effect. And then the idea, and this is formally true under certain conditions, not always verifiable ones, but the idea is that adding that extra coefficient to this vector will kind of suck out the predictive power of length and put it into that M plus first coefficient and quote, unquote, de-bias the rest so that the effect of length is not included. And that's what we do in style control. Now we don't just do it for M plus one. We have, you know, five, six different style components that have to do with markdown headers and bulleted lists and so on that we add here. Now, where is this going? You guys see the idea. It's a general methodology. If you have something that's sort of like a nuisance parameter, something that exists and provides predictive value, but you really don't want to estimate that. You want to remove its effect. In causal inference, these things are called like confounders often. What you can do is you can model the effect. You can put them into your model and try to adjust for them. So another one of those things might be cost. You know, what if I want to look at the cost adjusted performance of my model, which models are punching above their weight, parameter count, which models are punching above their weight in terms of parameter count, we can ex post facto measure that. We can do it without introducing anything that compromises the organic nature of theWei Lin [00:17:17]: data that we collect.Anastasios [00:17:18]: Hopefully that answers the question.Wei Lin [00:17:20]: It does.Swyx [00:17:21]: So I guess with a background in econometrics, this is super familiar.Anastasios [00:17:25]: You're probably better at this than me for sure.Swyx [00:17:27]: Well, I mean, so I used to be, you know, a quantitative trader and so, you know, controlling for multiple effects on stock price is effectively the job. So it's interesting. Obviously the problem is proving causation, which is hard, but you don't have to do that.Anastasios [00:17:45]: Yes. Yes, that's right. And causal inference is a hard problem and it goes beyond statistics, right? It's like you have to build the right causal model and so on and so forth. But we think that this is a good first step and we're sort of looking forward to learning from more people. You know, there's some good people at Berkeley that work on causal inference for the learning from them on like, what are the really most contemporary techniques that we can use in order to estimate true causal effects if possible.Swyx [00:18:10]: Maybe we could take a step through the other categories. So style control is a category. It is not a default. I have thought that when you wrote that blog post, actually, I thought it would be the new default because it seems like the most obvious thing to control for. But you also have other categories, you have coding, you have hard prompts. We consider that.Anastasios [00:18:27]: We're still actively considering it. It's just, you know, once you make that step, once you take that step, you're introducing your opinion and I'm not, you know, why should our opinion be the one? That's kind of a community choice. We could put it to a vote.Wei Lin [00:18:39]: We could pass.Anastasios [00:18:40]: Yeah, maybe do a poll. Maybe do a poll.Swyx [00:18:42]: I don't know. No opinion is an opinion.Wei Lin [00:18:44]: You know what I mean?Swyx [00:18:45]: Yeah.Wei Lin [00:18:46]: There's no neutral choice here.Swyx [00:18:47]: Yeah. You have all these others. You have instruction following too. What are your favorite categories that you like to talk about? Maybe you tell a little bit of the stories, tell a little bit of like the hard choices that you had to make.Wei Lin [00:18:57]: Yeah. Yeah. Yeah. I think the, uh, initially the reason why we want to add these new categories is essentially to answer some of the questions from our community, which is we won't have a single leaderboard for everything. So these models behave very differently in different domains. Let's say this model is trend for coding, this model trend for more technical questions and so on. On the other hand, to answer people's question about like, okay, what if all these low quality, you know, because we crowdsource data from the internet, there will be noise. So how do we de-noise? How do we filter out these low quality data effectively? So that was like, you know, some questions we want to answer. So basically we spent a few months, like really diving into these questions to understand how do we filter all these data because these are like medias of data points. And then if you want to re-label yourself, it's possible, but we need to kind of like to automate this kind of data classification pipeline for us to effectively categorize them to different categories, say coding, math, structure, and also harder problems. So that was like, the hope is when we slice the data into these meaningful categories to give people more like better signals, more direct signals, and that's also to clarify what we are actually measuring for, because I think that's the core part of the benchmark. That was the initial motivation. Does that make sense?Anastasios [00:20:27]: Yeah. Also, I'll just say, this does like get back to the point that the philosophy is to like mine organic, to take organic data and then mine it x plus factor.Alessio [00:20:35]: Is the data cage-free too, or just organic?Anastasios [00:20:39]: It's cage-free.Wei Lin [00:20:40]: No GMO. Yeah. And all of these efforts are like open source, like we open source all of the data cleaning pipeline, filtering pipeline. Yeah.Swyx [00:20:50]: I love the notebooks you guys publish. Actually really good just for learning statistics.Wei Lin [00:20:54]: Yeah. I'll share this insights with everyone.Alessio [00:20:59]: I agree on the initial premise of, Hey, writing an email, writing a story, there's like no ground truth. But I think as you move into like coding and like red teaming, some of these things, there's like kind of like skill levels. So I'm curious how you think about the distribution of skill of the users. Like maybe the top 1% of red teamers is just not participating in the arena. So how do you guys think about adjusting for it? And like feels like this where there's kind of like big differences between the average and the top. Yeah.Anastasios [00:21:29]: Red teaming, of course, red teaming is quite challenging. So, okay. Moving back. There's definitely like some tasks that are not as subjective that like pairwise human preference feedback is not the only signal that you would want to measure. And to some extent, maybe it's useful, but it may be more useful if you give people better tools. For example, it'd be great if we could execute code with an arena, be fantastic.Wei Lin [00:21:52]: We want to do it.Anastasios [00:21:53]: There's also this idea of constructing a user leaderboard. What does that mean? That means some users are better than others. And how do we measure that? How do we quantify that? Hard in chatbot arena, but where it is easier is in red teaming, because in red teaming, there's an explicit game. You're trying to break the model, you either win or you lose. So what you can do is you can say, Hey, what's really happening here is that the models and humans are playing a game against one another. And then you can use the same sort of Bradley Terry methodology with some, some extensions that we came up with in one of you can read one of our recent blog posts for, for the sort of theoretical extensions. You can attribute like strength back to individual players and jointly attribute strength to like the models that are in this jailbreaking game, along with the target tasks, like what types of jailbreaks you want.Wei Lin [00:22:44]: So yeah.Anastasios [00:22:45]: And I think that this is, this is a hugely important and interesting avenue that we want to continue researching. We have some initial ideas, but you know, all thoughts are welcome.Wei Lin [00:22:54]: Yeah.Alessio [00:22:55]: So first of all, on the code execution, the E2B guys, I'm sure they'll be happy to helpWei Lin [00:22:59]: you.Alessio [00:23:00]: I'll please set that up. They're big fans. We're investors in a company called Dreadnought, which we do a lot in AI red teaming. I think to me, the most interesting thing has been, how do you do sure? Like the model jailbreak is one side. We also had Nicola Scarlini from DeepMind on the podcast, and he was talking about, for example, like, you know, context stealing and like a weight stealing. So there's kind of like a lot more that goes around it. I'm curious just how you think about the model and then maybe like the broader system, even with Red Team Arena, you're just focused on like jailbreaking of the model, right? You're not doing kind of like any testing on the more system level thing of the model where like, maybe you can get the training data back, you're going to exfiltrate some of the layers and the weights and things like that.Wei Lin [00:23:43]: So right now, as you can see, the Red Team Arena is at a very early stage and we are still exploring what could be the potential new games we can introduce to the platform. So the idea is still the same, right? And we build a community driven project platform for people. They can have fun with this website, for sure. That's one thing, and then help everyone to test these models. So one of the aspects you mentioned is stealing secrets, stealing training sets. That could be one, you know, it could be designed as a game. Say, can you still use their credential, you know, we hide, maybe we can hide the credential into system prompts and so on. So there are like a few potential ideas we want to explore for sure. Do you want to add more?Anastasios [00:24:28]: I think that this is great. This idea is a great one. There's a lot of great ideas in the Red Teaming space. You know, I'm not personally like a Red Teamer. I don't like go around and Red Team models, but there are people that do that and they're awesome. They're super skilled. When I think about the Red Team arena, I think those are really the people that we're building it for. Like, we want to make them excited and happy, build tools that they like. And just like chatbot arena, we'll trust that this will end up being useful for the world. And all these people are, you know, I won't say all these people in this community are actually good hearted, right? They're not doing it because they want to like see the world burn. They're doing it because they like, think it's fun and cool. And yeah. Okay. Maybe they want to see, maybe they want a little bit.Wei Lin [00:25:13]: I don't know. Majority.Anastasios [00:25:15]: Yeah.Wei Lin [00:25:16]: You know what I'm saying.Anastasios [00:25:17]: So, you know, trying to figure out how to serve them best, I think, I don't know where that fits. I just, I'm not expressing. And give them credits, right?Wei Lin [00:25:24]: And give them credit.Anastasios [00:25:25]: Yeah. Yeah. So I'm not trying to express any particular value judgment here as to whether that's the right next step. It's just, that's sort of the way that I think we would think about it.Swyx [00:25:35]: Yeah. We also talked to Sander Schulhoff of the HackerPrompt competition, and he's pretty interested in Red Teaming at scale. Let's just call it that. You guys maybe want to talk with him.Wei Lin [00:25:45]: Oh, nice.Swyx [00:25:46]: We wanted to cover a little, a few topical things and then go into the other stuff that your group is doing. You know, you're not just running Chatbot Arena. We can also talk about the new website and your future plans, but I just wanted to briefly focus on O1. It is the hottest, latest model. Obviously, you guys already have it on the leaderboard. What is the impact of O1 on your evals?Wei Lin [00:26:06]: Made our interface slower.Anastasios [00:26:07]: It made it slower.Swyx [00:26:08]: Yeah.Wei Lin [00:26:10]: Because it needs like 30, 60 seconds, sometimes even more to, the latency is like higher. So that's one. Sure. But I think we observe very interesting things from this model as well. Like we observe like significant improvement in certain categories, like more technical or math. Yeah.Anastasios [00:26:32]: I think actually like one takeaway that was encouraging is that I think a lot of people before the O1 release were thinking, oh, like this benchmark is saturated. And why were they thinking that? They were thinking that because there was a bunch of models that were kind of at the same level. They were just kind of like incrementally competing and it sort of wasn't immediately obvious that any of them were any better. Nobody, including any individual person, it's hard to tell. But what O1 did is it was, it's clearly a better model for certain tasks. I mean, I used it for like proving some theorems and you know, there's some theorems that like only I know because I still do a little bit of theory. Right. So it's like, I can go in there and ask like, oh, how would you prove this exact thing? Which I can tell you has never been in the public domain. It'll do it. It's like, what?Wei Lin [00:27:19]: Okay.Anastasios [00:27:20]: So there's this model and it crushed the benchmark. You know, it's just like really like a big gap. And what that's telling us is that it's not saturated yet. It's still measuring some signal. That was encouraging. The point, the takeaway is that the benchmark is comparative. There's no absolute number. There's no maximum ELO. It's just like, if you're better than the rest, then you win. I think that was actually quite helpful to us.Swyx [00:27:46]: I think people were criticizing, I saw some of the academics criticizing it as not apples to apples. Right. Like, because it can take more time to reason, it's basically doing some search, doing some chain of thought that if you actually let the other models do that same thing, they might do better.Wei Lin [00:28:03]: Absolutely.Anastasios [00:28:04]: To be clear, none of the leaderboard currently is apples to apples because you have like Gemini Flash, you have, you know, all sorts of tiny models like Lama 8B, like 8B and 405B are not apples to apples.Wei Lin [00:28:19]: Totally agree. They have different latencies.Anastasios [00:28:21]: Different latencies.Wei Lin [00:28:22]: Control for latency. Yeah.Anastasios [00:28:24]: Latency control. That's another thing. We can do style control, but latency control. You know, things like this are important if you want to understand the trade-offs involved in using AI.Swyx [00:28:34]: O1 is a developing story. We still haven't seen the full model yet, but it's definitely a very exciting new paradigm. I think one community controversy I just wanted to give you guys space to address is the collaboration between you and the large model labs. People have been suspicious, let's just say, about how they choose to A-B test on you. I'll state the argument and let you respond, which is basically they run like five anonymous models and basically argmax their Elo on LMSYS or chatbot arena, and they release the best one. Right? What has been your end of the controversy? How have you decided to clarify your policy going forward?Wei Lin [00:29:15]: On a high level, I think our goal here is to build a fast eval for everyone, and including everyone in the community can see the data board and understand, compare the models. More importantly, I think we want to build the best eval also for model builders, like all these frontier labs building models. They're also internally facing a challenge, which is how do they eval the model? That's the reason why we want to partner with all the frontier lab people, and then to help them testing. That's one of the... We want to solve this technical challenge, which is eval. Yeah.Anastasios [00:29:54]: I mean, ideally, it benefits everyone, right?Wei Lin [00:29:56]: Yeah.Anastasios [00:29:57]: And people also are interested in seeing the leading edge of the models. People in the community seem to like that. Oh, there's a new model up. Is this strawberry? People are excited. People are interested. Yeah. And then there's this question that you bring up of, is it actually causing harm?Wei Lin [00:30:15]: Right?Anastasios [00:30:16]: Is it causing harm to the benchmark that we are allowing this private testing to happen? Maybe stepping back, why do you have that instinct? The reason why you and others in the community have that instinct is because when you look at something like a benchmark, like an image net, a static benchmark, what happens is that if I give you a million different models that are all slightly different, and I pick the best one, there's something called selection bias that plays in, which is that the performance of the winning model is overstated. This is also sometimes called the winner's curse. And that's because statistical fluctuations in the evaluation, they're driving which model gets selected as the top. So this selection bias can be a problem. Now there's a couple of things that make this benchmark slightly different. So first of all, the selection bias that you include when you're only testing five models is normally empirically small.Wei Lin [00:31:12]: And that's why we have these confidence intervals constructed.Anastasios [00:31:16]: That's right. Yeah. Our confidence intervals are actually not multiplicity adjusted. One thing that we could do immediately tomorrow in order to address this concern is if a model provider is testing five models and they want to release one, and we're constructing the models at level one minus alpha, we can just construct the intervals instead at level one minus alpha divided by five. That's called Bonferroni correction. What that'll tell you is that the final performance of the model, the interval that gets constructed, is actually formally correct. We don't do that right now, partially because we know from simulations that the amount of selection bias you incur with these five things is just not huge. It's not huge in comparison to the variability that you get from just regular human voters. So that's one thing. But then the second thing is the benchmark is live, right? So what ends up happening is it'll be a small magnitude, but even if you suffer from the winner's curse after testing these five models, what'll happen is that over time, because we're getting new data, it'll get adjusted down. So if there's any bias that gets introduced at that stage, in the long run, it actually doesn't matter. Because asymptotically, basically in the long run, there's way more fresh data than there is data that was used to compare these five models against these private models.Swyx [00:32:35]: The announcement effect is only just the first phase and it has a long tail.Anastasios [00:32:39]: Yeah, that's right. And it sort of like automatically corrects itself for this selection adjustment.Swyx [00:32:45]: Every month, I do a little chart of Ellim's ELO versus cost, just to track the price per dollar, the amount of like, how much money do I have to pay for one incremental point in ELO? And so I actually observe an interesting stability in most of the ELO numbers, except for some of them. For example, GPT-4-O August has fallen from 12.90
"On a piégé plus de machines que de gens à qui on avait envoyé le fichier piégé" Le D.E.V. de la semaine est Alain Mowat, Head of Research and Development @ Orange Cyberdefense. Alain nous raconte comment il est entré dans le monde de la cyberdéfense via Insomni'hack (qui existe toujours avec Orange Cyberdefense), ainsi que son activité au sein d'Orange Cyberdefense. Il nous parle de RedTeam, et d'histoires croustillantes, mais surtout de navigateurs. Nous creusons avec lui les heures sombres des navigateurs et leurs multiples failles, en bonne partie aujourd'hui corrigées. Et cela permet de voir les attaques qui peuvent encore subsister dans nos environnements web. Un épisode indispensable à écouter si vous voulez en savoir plus sur les points de vigilance à avoir ! Liens évoqués pendant l'émission site en ligne pour pratiquer les vul W3Challs Root Me Hack The Box"This is how they tell me the world ends" **Lancez votre boutique en ligne avec Shopify** Vous cherchez à lancer un site e-commerce rapidement et sans tracas techniques ? Shopify est la solution SaaS idéale pour créer une boutique en ligne professionnelle en quelques clics. Plus besoin de partir de zéro ou de gérer des frameworks complexes ! Shopify s'occupe de tout, de l'hébergement aux mises à jour, tout en offrant une flexibilité totale pour adapter votre boutique à vos besoins.
“Send us a Hey Now!”Well we finally went racing again and wow, what a return it was!COTA usually serves up some great racing and it did not disappoint this time around either!It was action packed as we had the sprint to contend with as well as an epic finish to the race itself.We won't even get to draw breath before we're off to Mexico as this was the first in a triple header that will end in Brazil.Episode running order as always is...1) News & SocialAll the best bits from both the sports news out there as well as what caught our eye on the various social channels2) Brian's Video Vault https://www.youtube.com/watch?v=ctQO8e6jJ1o. Neil deGrasse Tyson Explains the Physics of Formula One Racing. StarTalk channel. Nearly 17 mins. This was good - if you're a huge F1 nerd, you probably know most of this. But I enjoyed how a planes wing and a rear wing in F1 are basically doing the same thing, but in reverse…https://www.youtube.com/watch?v=IwAVrkHs5EQ. Lake Como Unfiltered: Filming the Peroni Nastro Azzurro 0.0% Campaign. Charles Leclerc channel. Nearly 6 mins. If you like Chuck, this is for you. Otherwise pass.And one on instagram: https://www.instagram.com/reel/DBOs0mmuBOy/?igsh=MTE5YjV2c2ZiZDdsOA== Valtteri's insta and HylandGram's insta. Not too long, shows how they “made music” with his bio data… uh ok. The point of this is as he crosses the line towards the end, a radio message telling him how well he did comes across, and I'm 106% sure it's Pit Lane Paul. The video found by Adam, the PLP notice by Sap.ESPN clip asking them college mascots - so GOOD. Ohio State Humpty Dumpties… omg3) United States GP ReviewPractice, qualy, sprint, and race reviewKing of the pits vs the pitsPitlane Paul4) Mexican GP PreviewA look ahead to next weeks race in MexicoWe would love you to join our Discord server so use this invite link to join us https://discord.gg/XCyemDdzGBIf you would like to sign up for the 100 Seconds of DRS then drop us an email stating your time zone to dirtysideofthetrack@gmail.comAlso please like, follow, and share our content on Threads, X, Facebook, & Instagram, links to which can be found on our website.One last call to arms is that if you do listen along and like us then first of all thanks, but secondly could we ask that you leave a review and a 5 star rating - please & thanks!If you would like to help the Dirty Side promote the show then we are now on Buy me a coffee where 100% of anything we get will get pumped into advertising the show https://www.buymeacoffee.com/dirtysideofthetrackDirty Side of the Track is hosted on Buzzsprout https://www.buzzsprout.com/Support the show
Let's Talk Automated Red TeamingExplore automated red teaming and red-blue team synergy with Ryan Hays, Global Head of Red Team at Citi, tackling misconceptions and fostering cross-team collaboration.+ + +Find more episodes on YouTube or wherever you listen to podcasts, as well as at netspi.com/agentofinfluence.
In this episode, Spencer and Brad discuss a recent Trend Micro research project and associated white paper "Red Team Tools in the Hands of Cybercriminals and Nation States". Spencer and Brad dig into what red teaming is, what red team tools (often referred to as offensive security tools) are and why they are used. They also cover the abuse of red team tools, the speed of exploitation after public release and supply chain attacks against red team tools.From Defense to Offense: The Misuse of Red Teaming Tools by Cybercriminals | Trend Micro (US)Blog: https://offsec.blog/Youtube: https://www.youtube.com/@cyberthreatpovTwitter: https://twitter.com/cyberthreatpovWork with Us: https://securit360.com
In today's podcast we cover four crucial cyber and technology topics, including: 1. Hong Kong arrest 27 individuals associate with romance scams 2. Calgary Public library dealing with apparent ransomware 3. Red Team security tool adopted by criminals to evade detection 4. North Korean criminals abuse patched Microsoft flaw to deliver malware I'd love feedback, feel free to send your comments and feedback to | cyberandtechwithmike@gmail.com
Cyberpunk Edgerunners take to the stars once again! Join our team of Highriders as they navigate and Crystal Palace, Mars, and more during the time of Cyberpunk Red! We're live every Friday at 5pm PT on the Cybernation Uncensored twitch channel! Join us! Calling all game masters, players, edgerunners, choombas, wastelanders, vault dwellers, spice traders & space folders! We have a very active community for Cyberpunk, Fallout, Dune & more! If you're looking to join a game, run a game, network, learn something new, contribute an idea, chat or just hang out, we have the home for you! Check out the ttrpg related options below and be sure to say hello! https://discord.gg/VJv4FPC https://www.twitch.tv/cybernationuncensored https://twitter.com/CNUncensored https://www.patreon.com/CybernationUncensored https://www.youtube.com/cybernationuncensored/join https://www.instagram.com/cybernationuncensored/ https://www.facebook.com/CyberNationUncensored https://www.facebook.com/groups/2951164338265802 Explore our website! https://www.CybernationUncensored.com/ We're a brand dedicated to everything and anything Cyberpunk, dystopian and scifi! We stream live Cyberpunk RED, 2020, Fallout 2D20 & Dune 2D20 gameplay, a Game Master Tips series, Deep Dive series, Night City Live series and a GM Round Table series on the Cybernation Uncensored youtube and twitch channels! We discuss everything and anything Cyberpunk, including but not limited to 2020, RED, 2077, fallout & dune on our Cybernation Uncensored podcast! We also have a Cybernation Uncensored community blog, discord and group! Join us and let's network and have fun! We have a passion for creating Cyberpunk genre content and would really appreciate your support! Sound & music by Syrinscape https://syrinscape.com/ Because Epic Games Need Epic Sound Complete list of credits here: https://syrinscape.com/attributions/ #cyberpunk #cyberpunkred #cyberpunk2020 --- Support this podcast: https://podcasters.spotify.com/pod/show/cybernationuncensored/support
Prof. Koonin is an American theoretical physicist and former director of the Center for Urban Science and Progress at NYU, as well as a professor in the Department of Civil and Urban Engineering at NYU School of Engineering. In the past he was the Chief Scientist of BP's oil and gas division, served as Under Secretary for Science in the Department of Energy, in the Obama administration, and was the vice-president of Caltech, one of the most prestigious scientific institutes in the world. Steven is the author of the book “Unsettled: What Climate Science Tells Us, What It Doesn't, and Why It Matters”, where he argues that while there are some basic facts about climate change that experts agree upon, the meaning of those facts is not so settled, and mainstream scientific studies do not support the notion that there is any kind of climate crisis at all. From Efrat Fenigson: “I cover politics, health, climate, money, economics & bitcoin, propaganda, and more. From time to time I cover the state of affairs in Israel, the broader picture of global events, and our role as sovereign citizens.” This conversation discusses Climate Realism - the sane approach to the “Climate Change” alarmism, and the role of media in shaping public perception. We touched on topics such as the use of the term 'climate denier,' bias in the energy industry, the challenges faced by young scientists who question the climate narrative, the role of journalists in spreading misinformation, and the influence of organizations like the UN and Covering Climate Now. We talked about the viral documentary 'Climate the Movie' and censorship attempts. Lastly we touched on the funding dynamic in climate research, and geoengineering / chemtrails. Steven emphasizes the need for open scientific discussion and the importance of prudence in considering these interventions. We end with the challenges & optimism in maintaining integrity and truth-telling in a corrupted world. Tom's Twitter: https://x.com/TomANelson Tom's Substack: https://tomn.substack.com/ Tom's links: https://linktr.ee/tomanelson1 Efrat's Twitter: https://x.com/efenigson Efrat's Telegram: https://t.me/efenigson Watch/listen on all platforms: https://linktr.ee/yourethevoice Support Efrat's work: https://www.buymeacoffee.com/efenigson Support Efrat with Bitcoin: https://geyser.fund/project/efenigson -- CHAPTERS – 00:00 Coming Up 01:14 Introductions 03:54 Challenging the Term 'Climate Denier' 06:43 The Climate Discussion "Silence" 09:03 Impacts on Those Speaking Out 10:53 Steven's Evolution to Climate Realism 16:33 Misrepresentation of Facts 21:37 Organized Online Propaganda 27:34 Climate - The Movie 32:03 Geoengineering & Chemtrails 41:10 Red Team, Blue Team 44:33 Dating CO2 in Deep Ice 45:55 Playing Bongos with Richard Feynman 49:06 Message of Hope
My guest today is Prof. Steven Koonin, co-hosted with Tom Nelson - host of The Tom Nelson Podcast. Prof. Koonin is an American theoretical physicist and former director of the Center for Urban Science and Progress at NYU, as well as a professor in the Department of Civil and Urban Engineering at NYU School of Engineering. In the past he was the Chief Scientist of BP's oil and gas division, served as Under Secretary for Science in the Department of Energy, in the Obama administration, and was the vice-president of Caltech, one of the most prestigious scientific institutes in the world. Steven is the author of the book “Unsettled: What Climate Science Tells Us, What It Doesn't, and Why It Matters”, where he argues that while there are some basic facts about climate change that experts agree upon, the meaning of those facts is not so settled, and mainstream scientific studies do not support the notion that there is any kind of climate crisis at all. This conversation discusses Climate Realism - the sane approach to the “Climate Change” alarmism, and the role of media in shaping public perception. We touched on topics such as the use of the term 'climate denier,' bias in the energy industry, the challenges faced by young scientists who question the climate narrative, the role of journalists in spreading misinformation, and the influence of organizations like the UN and Covering Climate Now. We talked about the viral documentary 'Climate the Movie' and censorship attempts. Lastly we touched on the funding dynamic in climate research, and geoengineering / chemtrails. Steven emphasizes the need for open scientific discussion and the importance of prudence in considering these interventions. We end with the challenges & optimism in maintaining integrity and truth-telling in a corrupted world. ► If you got value, please like, comment, share, subscribe and support my work. Thank you! -- SPONSORS – ►► Get your TREZOR wallet & accessories, with a 5% discount, using my code at checkout (get my discount code from the episode - yep, you'll have to watch it): https://affil.trezor.io/SHUn -- LINKS – Prof. Koonin's book - Unsettled: What Climate Science Tells Us, What It Doesn't, and Why It Matters: https://www.amazon.com.au/Unsettled-Climate-Science-Doesnt-Matters/dp/1950665798 Climate: The Movie: https://rumble.com/v4klh96-climate-the-movie-the-cold-truth.html Tom's Twitter: https://x.com/TomANelson Tom's Substack: https://tomn.substack.com/ Tom's links: https://linktr.ee/tomanelson1 Efrat's Twitter: https://twitter.com/efenigson Efrat's Telegram: https://t.me/efenigson Watch/listen on all platforms: https://linktr.ee/yourethevoice Support Efrat's work: https://www.buymeacoffee.com/efenigson Support Efrat with Bitcoin: https://geyser.fund/project/efenigson -- CHAPTERS – 00:00 Coming Up 01:14 Introductions 03:54 Challenging the Term 'Climate Denier' 06:43 The Climate Discussion "Silence" 09:03 Impacts on Those Speaking Out 10:53 Steven's Evolution to Climate Realism 16:33 Misrepresentation of Facts 21:37 Organized Online Propaganda 27:34 Climate - The Movie 32:03 Geoengineering & Chemtrails 41:10 Red Team, Blue Team 44:33 Dating CO2 in Deep Ice 45:55 Playing Bongos with Richard Feynman 49:06 Message of Hope
Episode #472 Red Team 2024 Avec Gregory Draperi The post Red Team 2024 appeared first on NoLimitSecu.
In this Emergency Pod of The Cognitive Revolution, Nathan provides crucial insights into OpenAI's new o1 and o1-mini reasoning models. Featuring exclusive interviews with members of the o1 Red Team from Apollo Research and Haize Labs, we explore the models' capabilities, safety profile, and OpenAI's pre-release testing approach. Dive into the implications of these advanced AI systems, including their potential to match or exceed expert performance in many areas. Join us for an urgent and informative discussion on the latest developments in AI technology and their impact on the future. o1 Safety Card Haize Labs Endless Jailbreaks with Bijection Learning: a Powerful, Scale-Agnostic Attack Method Haize Labs Job board Papers mentioned: https://arxiv.org/pdf/2407.21792 https://far.ai/post/2024-07-robust-llm/paper.pdf Apply to join over 400 Founders and Execs in the Turpentine Network: https://www.turpentinenetwork.co/ SPONSORS: Oracle: Oracle Cloud Infrastructure (OCI) is a single platform for your infrastructure, database, application development, and AI needs. OCI has four to eight times the bandwidth of other clouds; offers one consistent price, and nobody does data better than Oracle. If you want to do more and spend less, take a free test drive of OCI at https://oracle.com/cognitive Brave: The Brave search API can be used to assemble a data set to train your AI models and help with retrieval augmentation at the time of inference. All while remaining affordable with developer first pricing, integrating the Brave search API into your workflow translates to more ethical data sourcing and more human representative data sets. Try the Brave search API for free for up to 2000 queries per month at https://bit.ly/BraveTCR Omneky: Omneky is an omnichannel creative generation platform that lets you launch hundreds of thousands of ad iterations that actually work customized across all platforms, with a click of a button. Omneky combines generative AI and real-time advertising data. Mention "Cog Rev" for 10% off https://www.omneky.com/ Squad: Head to Squad to access global engineering without the headache and at a fraction of the cost: head to https://choosesquad.com/ and mention “Turpentine” to skip the waitlist. RECOMMENDED PODCAST: This Won't Last. Eavesdrop on Keith Rabois, Kevin Ryan, Logan Bartlett, and Zach Weinberg's monthly backchannel. They unpack their hottest takes on the future of tech, business, venture, investing, and politics. Apple Podcasts: https://podcasts.apple.com/us/podcast/id1765665937 Spotify: https://open.spotify.com/show/2HwSNeVLL1MXy0RjFPyOSz YouTube: https://www.youtube.com/@ThisWontLastpodcast CHAPTERS: (00:00:00) About the Show (00:00:22) About the Episode (00:05:03) Introduction and Haize Labs Overview (00:07:36) Universal Jailbreak Technique and Attacks (00:13:47) Automated vs Manual Red Teaming (00:17:15) Qualitative Assessment of Model Jailbreaking (Part 1) (00:19:38) Sponsors: Oracle | Brave (00:21:42) Qualitative Assessment of Model Jailbreaking (Part 2) (00:26:21) Context-Specific Safety Considerations (00:32:26) Model Capabilities and Safety Correlation (Part 1) (00:36:22) Sponsors: Omneky | Squad (00:37:48) Model Capabilities and Safety Correlation (Part 2) (00:44:42) Model Behavior and Defense Mechanisms (00:52:47) Challenges in Preventing Jailbreaks (00:56:24) Safety, Capabilities, and Model Scale (01:00:56) Model Classification and Preparedness (01:04:40) Concluding Thoughts on o1 and Future Work (01:05:54) Outro
Send us a text
In this episode, we sit down with Santiago, a Senior Security Engineer at Canva, to talk about the complexities of building and managing an incident response team, especially in high-growth companies. Santiago shares his experience transitioning from penetration testing to incident response and highlights the unique challenges that come with protecting a rapidly expanding organization. We explore the differences between incident response in high-growth versus established companies, the importance of having the right personnel, and the critical skills needed for effective incident response. Guest Socials: Santiago's Linkedin Podcast Twitter - @CloudSecPod If you want to watch videos of this LIVE STREAMED episode and past episodes - Check out our other Cloud Security Social Channels: - Cloud Security Podcast- Youtube - Cloud Security Newsletter - Cloud Security BootCamp Questions asked: (00:00) Introduction (01:58) A word from our sponsor - SentinelOne (02:48) A bit about Santiago (03:18) What is Incident Response? (04:06) How IR differs in different organisations? (04:48) Red Team vs Incident Response Team (06:17) Challenges for Incident Response in Cloud (07:16) Incident Response in a High Growth Company (07:56) Skillsets required for high growth (09:14) Cloud vs On Prem Incident Response (10:03) Building Incident Response in High Growth Company (11:39) Responding to incidents that are not high risk (14:41) Transition from pentesting to incident responder (17:20) Endpoint vulnerability management at scale (25:32) The Fun Section Resources from the episode: Endpoint Vulnerability Management at Scale
Send us a Text Message.
Send us a Text Message.Join Inspired by Ms Amber Red from Las Vegas for this fun and interesting podcast where we break down the art of investing in yourself and your business with a sprinkle of sass! Find out why there has never been a better time to invest in yourself than NOW! If you're ready to level up your life and stack those coins, this is the podcast for you.We're diving into smart money moves, self-care that pays off, and business strategies that work as hard as you do. Rachel Santiago our go-to finance bestie makes finance fun, relatable, and totally doable. She even tells you…how much you should have invested RIGHT NOW (based on your age) in order to retire and live comfortably in retirement! Did you know there's actually a math equation for it!Tune in, get inspired, and watch your world—and wallet—grow!
Send us a Text Message.If you're part of our email list and were able to join us LIVE on Tiktok this last week for a full ombre brow procedure, boy were you lucky! We had so much fun and I tattooed one of our very own talented in-house artist Tania D Artisty. But don't stress if you were unable to make the live, that's what we are discussing in today's episode! We discuss the struggles with going LIVE on Tiktok and how we battle the trolls, how we have grown our account to what it is today, and you guessed it we discuss ALL THINGS OMBRE BROWS!For supporting our podcast we'd like to offer you a very special offer on the presale of my newest course: PIXELATED & BOLD! For a very limited time you can get this course for only $199Follow the link here to get your savings: https://www.inspiredbymsamberred.com/offers/z3GKsno9?coupon_code=YT199PIXELATEDWant to watch the TikTok live recap? Click here: https://www.inspiredbymsamberred.com/blog/tiktok-live-recap-and-q-aAs always don't forget to like, comment, share, and subscribe.Follow Ms Amber Red on all social media platformsIG: inspiredbymsamberredTikTok: msamberredFB: msamberred permanent makeupCheckout our previous episode: Interviewing Ms Amber Red - https://youtu.be/V_nYSQe7nocSupport the Show.
Send us a Text Message.We'd like to invite you to join us for Inspired by Ms Amber Reds 50th podcast episode in Las Vegas‼️Join us for a LIVE episode as the Ms Amber Red team & our private FB Group interview Amber! Yup, you heard that right! Get all of your questions answered!Some questions that the team asks Amber:How do you feel about being called boss?What was your biggest OH SHIT moment?Top 5 things to get eyeliner to heal black?How important is it to have a supportive partner?What is your biggest fear?Is social media important?Any new found interests since opening a business?What would you like to level up on?And so much more!If you've been wanting to really get to know Amber and what the future holds THIS IS THE EPISODE to listen to‼️One important topic I'd like to elaborate on is:What is Inspired by Ms Amber Red? Read more here: https://www.inspiredbymsamberred.com/blog/what-is-inspired-by-ms-amber-redAs always don't forget to like, comment, share, and subscribe!Did you know we have a private PMU FB community to join? Join us here: https://www.facebook.com/share/GHZPfanLqX8NTVVq/?mibextid=K35XfPFollow Ms Amber Red:IG - msamberredTikTok - msamberredFB - msamberred permanent makeupCheckout out our previous episode here: Better together - https://youtu.be/e99pvCUyrnsSupport the Show.
Guest: Allyn Stott, Senior Staff Engineer, meoward.coOn LinkedIn | https://www.linkedin.com/in/whyallynOn Twitter | https://x.com/whyallyn____________________________Host: Sean Martin, Co-Founder at ITSPmagazine [@ITSPmagazine] and Host of Redefining CyberSecurity Podcast [@RedefiningCyber]On ITSPmagazine | https://www.itspmagazine.com/sean-martinView This Show's Sponsors___________________________Episode NotesIn this episode of The Redefining CyberSecurity Podcast, host Sean Martin converses with Allyn Stott, who shares his insights on rethinking how we measure detection and response in cybersecurity. The episode explores the nuances of cybersecurity metrics, emphasizing that it's not just about having metrics, but having the right metrics that truly reflect the effectiveness and efficiency of a security program.Stott discusses his journey from red team operations to blue team roles, where he has focused on detection and response. His dual perspective provides a nuanced understanding of both offensive and defensive security strategies. Stott highlights a common issue in cybersecurity: the misalignment of metrics with organizational goals. He points out that many teams inherit metrics that may not accurately reflect their current state or objectives. Instead, metrics should be strategically chosen to guide decision-making and improve security posture. One of his key messages is the importance of understanding what specific metrics are meant to convey and ensuring they are directly actionable.In his framework, aptly named SAVER (Streamlined, Awareness, Vigilance, Exploration, Readiness), Stott outlines a holistic approach to security metrics. Streamlined focuses on operational efficiencies achieved through better tools and processes. Awareness pertains to the dissemination of threat intelligence and ensuring that the most critical information is shared across the organization. Vigilance involves preparing for and understanding top threats through informed threat hunting. Exploration encourages the proactive discovery of vulnerabilities and security gaps through threat hunts and incident analysis. Finally, Readiness measures the preparedness and efficacy of incident response plans, emphasizing the coverage and completeness of playbooks over mere response times.Martin and Stott also discuss the challenge of metrics in smaller organizations, where resources may be limited. Stott suggests that simplicity can be powerful, advocating for a focus on key risks and leveraging publicly available threat intelligence. His advice to smaller teams is to prioritize understanding the most significant threats and tailoring responses accordingly.The conversation underscores a critical point: metrics should not just quantify performance but also drive strategic improvements. By asking the right questions and focusing on actionable insights, cybersecurity teams can better align their efforts with their organization's broader goals.For those interested in further insights, Stott mentions his upcoming talks at B-Sides Las Vegas and Blue Team Con in Chicago, where he will expand on these concepts and share more about his Threat Detection and Response Maturity Model.In conclusion, this episode serves as a valuable guide for cybersecurity professionals looking to refine their approach to metrics, making them more meaningful and aligned with their organization's strategic objectives.___________________________Watch this and other videos on ITSPmagazine's YouTube ChannelRedefining CyberSecurity Podcast with Sean Martin, CISSP playlist:
Send us a Text Message.Today on Inspired we have the whole team! Amber, Sasha, Tania, Brisa, Victor, and our visiting artist Maddie from Massachusetts. Listen as we discuss topics such as PMU conference Dos and Don'ts, the importance of legit training and apprenticeships and more!Amber also discusses how it feels to be a mentor, to so many different personalities at once and how she balances it all. No matter the differences it truly is what makes us Better Together!Interested in learning more about our 6 month apprenticeship program? Click the following link: https://www.inspiredbymsamberred.com/6-month-apprenticeshipAs always don't forget to like, comment, share, and subscribe!Looking for an judgement free, no gatekeeping community to join? Join us on our private FB Group: https://www.facebook.com/share/GHZPfanLqX8NTVVq/?mibextid=K35XfPFollow Ms Amber Red:IG - msamberredTikTok - msamberredFB - msamberred permanent makeup“The views, thoughts, and opinions expressed are the speaker's own and do not represent the views, thoughts, and opinions of the University of California. The material and information presented here is for general information purposes only. The "University of California" name and all forms and abbreviations are the property of its owner and its use does not imply endorsement of or opposition to any specific organization, product, or service.”Check out our previous episode: "Stop Gatekeeping" - https://youtu.be/mmdYI6SrMwYSupport the Show.
Support the D.A.W.G.Z. @ patreon.com/MSsecretpod Support Nate and Lemeez @ patreon.com/pitm Go See Lemaire at Cap City in Austin @ https://www.capcitycomedy.com/shows/252000 Go See Shawn at Helium Philly @ https://philadelphia.heliumcomedy.com/shows/247133 Go See Matt Live @ mattmccusker.com/dates Go See Shane Live @ shanemgillis.com Get Merch @ mssecretpodcast.com/merch Hello everyone. Big SG's away on bidness so the mice had to play without him. We miss you Shane - see you soon!!!!!! In the kahuna's absence Matt assembled the bros to chill in Josh's stu - god bless josh. Also Matt was feelin kinda sick but he worked anyways - nbd. Please enjoy. God Bless you all. Upgrade your wardrobe and get up to 25% OFF @trueclassic at https://trueclassictees.com/DRENCHED! #trueclassicpod Support the show & get Lucy Breakers for 20% off & free shipping at https://www.lucy.co promo code DRENCHED