POPULARITY
Today, I'm talking with Kolton Andrus, CEO and Founder of Gremlin. You may remember previously when we spoke with Matt, the prior CTO of the company. Since that time frame, a lot has changed at the company, going through several arcs and foundational changes that are leading to not only assessing weaknesses in your infrastructure, but walking you through how to fix it (and eventually, fixing it for you).Questions:Tell us a little bit of an overview about you.Your time at Amazon & Netflix were big influences on the importance of chaos engineering and reliability testing. Can you tell me what was so foundational about your time there?What is next iteration of Gremlin? What has changed in the platform primarily? Tell me about the arcs of the company here.In 2022, there was a leadership transition and you increased your focus on the product. What are some of the most exciting developments that came from these last 3 years?Where does AI fit into Chaos Engineering? And where does it not fit? Can you unpack your viewpoint here?What are you most excited about in the next chapter for Gremlin, and for the broader SRE space?What advice would you give a founder just getting started?I couldn't be more excited about the future of Gremlin. Given the arcs the company has gone through, it's evident that Kolton has built foundational layers into the platform, and is steering the ship towards responsible chaos engineering, reliability, automation and much more.Thank you for listening to today's episode. If you'd like to learn more about Gremlin, please visit gremlin.com.SponsorsPaddle.comSema SoftwarePropelAuthPostmanMeilisearchMailtrap.TECH Domains (https://get.tech/codestory)Linkshttps://www.gremlin.com/https://www.linkedin.com/in/kolton-andrus-77315a2/https://codestory.co/podcast/e9-matt-fornaciari-gremlin/Support this podcast at — https://redcircle.com/code-story/donationsAdvertising Inquiries: https://redcircle.com/brandsPrivacy & Opt-Out: https://redcircle.com/privacy
Robby is joined by Sara Jackson, Senior Developer at thoughtbot, to explore the practical ways teams can foster resilience—not just in their infrastructure, but in their everyday habits. They talk about why documentation is more than a chore, how to build trust in test suites, and how Chaos Engineering at the application layer can help make the case for long-term investment in maintainability.Sara shares why she advocates for writing documentation on day one, how “WET” test practices have helped her avoid brittle test suites, and why she sees ports as a powerful alternative to full rewrites. They also dive into why so many teams overlook failure scenarios that matter deeply to end users—and how being proactive about those situations can shape better products and stronger teams.Episode Highlights[00:01:28] What Well-Maintained Software Looks Like: Sara champions documentation that's trusted, updated, and valued by the team.[00:07:23] Invisible Work and Team Culture: Robby and Sara discuss how small documentation improvements often go unrecognized—and why leadership buy-in matters.[00:10:34] Why Documentation Should Start on Day One: Sara offers a “hot take” about writing things down early to reduce cognitive load.[00:16:00] What Chaos Engineering Really Is: Sara explains the scientific roots of the practice and its DevOps origins.[00:20:00] Application-Layer Chaos Engineering: How fault injection can reveal blind spots in the user experience.[00:24:36] Observability First: Why you need the right visibility before meaningful chaos experiments can begin.[00:28:32] Pitching Resilience to Stakeholders: Robby and Sara explore how chaos experiments can justify broader investments in system quality.[00:33:24] WET Tests vs. DRY Tests: Sara explains why test clarity and context matter more than clever abstractions.[00:40:43] Working on Client Refactors: How Sara approaches improving test coverage before diving into major changes.[00:42:11] Rewrite vs. Refactor vs. Port: Sara introduces “porting” as a more intentional middle path for teams looking to evolve their systems.[00:50:45] Delete More Code: Why letting go of unused features can create forward momentum.[00:51:13] Recommended Reading: Being Wrong by Kathryn Schulz.Resources & LinksSara on MastodonthoughtbotRubyConf 2024 Talk – Chaos Engineering on the Death StarBook: Being Wrong by Kathryn SchulzFlu Shot on GitHubChaosRB on GitHubSemian from Shopify — a chaos engineering toolkit for RubyThanks to Our Sponsor!Turn hours of debugging into just minutes! AppSignal is a performance monitoring and error-tracking tool designed for Ruby, Elixir, Python, Node.js, Javascript, and other frameworks.It offers six powerful features with one simple interface, providing developers with real-time insights into the performance and health of web applications.Keep your coding cool and error-free, one line at a time! Use the code maintainable to get a 10% discount for your first year. Check them out! Subscribe to Maintainable on:Apple PodcastsSpotifyOr search "Maintainable" wherever you stream your podcasts.Keep up to date with the Maintainable Podcast by joining the newsletter.
In this episode of Building Better Developers with AI, Rob Broadhead and Michael Meloche revisit a popular question: What Happens When Software Fails? Originally titled When Coffee Hits the Fan: Developer Disaster Recovery, this AI-enhanced breakdown explores real-world developer mistakes, recovery strategies, and the tools that help turn chaos into control. Whether you're managing your first deployment or juggling enterprise infrastructure, you'll leave this episode better equipped for the moment when software fails. When Software Fails and Everything Goes Down The podcast kicks off with a dramatic (but realistic) scenario: CI passes, coffee is in hand, and then production crashes. While that might sound extreme, it's a situation many developers recognize. Rob and Michael cover some familiar culprits: Dropping a production database Misconfigured cloud infrastructure costing hundreds overnight Accidentally publishing secret keys Over-provisioned “default” environments meant for enterprise use Takeaway: Software will fail. Being prepared is the difference between a disaster and a quick fix. Why Software Fails: Avoiding Costly Dev Mistakes Michael shares an all-too-common situation: connecting to the wrong environment and running production-breaking SQL. The issue wasn't the code—it was the context. Here are some best practices to avoid accidental failure: Color-code terminal environments (green for dev, red for prod) Disable auto-commit in production databases Always preview changes with a SELECT before running DELETE or UPDATE Back up databases or individual tables before making changes These simple habits can save hours—or days—of cleanup. How to Recover When Software Fails Rob and Michael outline a reliable recovery framework that works in any team or tech stack: Monitoring and alerts: Tools like Datadog, Prometheus, and Sentry help detect issues early Rollback plans: Scripts, snapshots, and container rebuilds should be ready to go Runbooks: Documented recovery steps prevent chaos during outages Postmortems: Blameless reviews help teams learn and improve Clear communication: Everyone on the team should know who's doing what during a crisis Pro Tip: Practice disaster scenarios ahead of time. Simulations help ensure you're truly ready. Essential Tools for Recovery Tools can make or break your ability to respond quickly when software fails. Rob and Michael recommend: Docker & Docker Compose for replicable environments Terraform & Ansible for consistent infrastructure GitHub Actions, GitLab CI, Jenkins for automated testing and deployment Chaos Engineering tools like Gremlin and Chaos Monkey Snapshot and backup automation to enable fast data restoration Michael emphasizes: containers are the fastest way to spin up clean environments, test recovery steps, and isolate issues safely. Mindset Matters: Staying Calm When Software Fails Technical preparation is critical—but so is mindset. Rob notes that no one makes smart decisions in panic mode. Having a calm, repeatable process in place reduces pressure when systems go down. Cultural and team-based practices: Use blameless postmortems to normalize failure Avoid root access in production whenever possible Share mistakes in standups so others can learn Make local environments mirror production using containers Reminder: Recovery is a skill—one you should build just like any feature. Think you're ready for a failure scenario? Prove it. This week, simulate a software failure in your development environment: Turn off a service your app depends on Delete (then restore) a local database from backup Use Docker to rebuild your environment from scratch Trigger a mock alert in your monitoring tool Then answer these questions: How fast can you recover? What broke that you didn't expect? What would you do differently in production? Recovery isn't just theory—it's a skill you build through practice. Start now, while the stakes are low. Final Thought Software fails. That's a reality of modern development. But with the right tools, smart workflows, and a calm, prepared team, you can recover quickly—and even improve your system in the process. Learn from failure. Build with resilience. And next time something breaks, you'll know exactly what to do. Stay Connected: Join the Developreneur Community We invite you to join our community and share your coding journey with us. Whether you're a seasoned developer or just starting, there's always room to learn and grow together. Contact us at info@develpreneur.com with your questions, feedback, or suggestions for future episodes. Together, let's continue exploring the exciting world of software development. Additional Resources System Backups – Prepare for the Worst Using Dropbox To Provide A File Store and Reliable Backup Testing Your Backups – Disaster Recovery Requires Verification Virtual Systems On A Budget – Realistic Cloud Pricing Building Better Developers With AI Podcast Videos – With Bonus Content
Join Jenn Bergstrom and the vBrownBag crew for a deep dive into chaos engineering for machine learning and AI models. Discover how deliberate failure injection can improve system resilience, explore real-world experiments on AI model vulnerabilities, and learn why testing for failure is critical in today's fast-moving AI landscape. Whether you're an engineer, data scientist, or tech leader, this conversation is packed with practical insights, cautionary tales, and a touch of humor. #ChaosEngineering #MachineLearning #AI #vBrownBag #AIOps #ModelResilience #TechTalk Chapters: 00:00 – Introduction & vBrownBag Welcome 02:45 – What Is Chaos Engineering? 10:45 – Netflix, Chaos Monkey, and the Origins 16:40 – Chaos Engineering for AI & ML Models 27:00 – Non-Determinism in LLMs and Testing Challenges 46:00 – Organizational Adoption & Q&A Resources: https://www.linkedin.com/in/jenn-bergstrom/ https://amzn.to/44JBw5D - "Security Chaos Engineering: Sustaining Resilience in Software and Systems" https://amzn.to/4l3fFMi - "Chaos Engineering: Site reliability through controlled disruption" https://amzn.to/4kiqY1W - "Chaos Engineering: System Resiliency in Practice" https://netflix.github.io/chaosmonkey/
Christophe Rochefolle a un beau parcours de CTO, est une référence sur le sujet du Chaos engineering et est l'auteur d'un des livres de référence sur le DevOps.Dans cet épisode, nous discutons des multiples facettes d'une réponse appropriée aux crises : Chaos Engineering, robustesse et santé mentale.Accès rapide03:00 : le Chaos Engineering09:50 : l'incident de Prod qui devient un trauma12:15 : les “Days of chaos”16:40 : La gestion d'incident plus que la gestion de problème21:20 : le concept de robustesse (contre le culte de la performance)29:30 : le “Truck Factor”, des compétences en I au peigne et la charge mentale35:20 : le Platform Engineering42:40 : “Mettre en oeuvre DevOps” a 10 ans, qu'est-ce qui s'est passé depuis 2015 ?52:00 : les polycrises55:20 : santé mentale et robustesse, premiers secours en santé mentale1:03:03 : manager avec un grand coeur1:07:40 : les pratiques managériales recommandéesRecommandations“Mettre en oeuvre DevOps”, Alain Sacquet, Christophe Rochefolle“Ne coupez jamais la poire en deux”, Chris Voss“Klara et le soleil”, Kazuo Ishiguro“Friends” Hébergé par Acast. Visitez acast.com/privacy pour plus d'informations.
Part two of our chaos engineering series is here! Join Andrey, Mattias, and Paulina as they talk through practical strategies for chaos engineering. Who should do it? How can you start? And what are the essential prerequisites? Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
In dieser Folge tauchen wir tief in die Welt der Künstlichen Intelligenz (KI) ein und betrachten sie aus einer systemischen Perspektive. Inspiriert vom klassischen Cartoon Spy vs. Spy geht es um die Frage, wie KI-Systeme miteinander interagieren — sowohl im Sinne von Kooperation als auch Konkurrenz — und welche Herausforderungen dies für uns Menschen mit sich bringt. Ich beleuchte die evolutionäre Natur technischen Fortschritts, die Risiken von Komplexität und Kontrollverlust sowie die möglichen Folgen einer Zukunft, in der KI von einem passiven Werkzeug zu einem aktiven Akteur wird. Begleiten Sie mich auf einer Reise durch emergente Phänomene und Fulgurationen, systemische Abhängigkeiten und die Frage, wie wir die Kontrolle über immer komplexere Systeme behalten können. Denn es stellen sich zahlreiche Fragen: Wie könnte sich die Interaktion zwischen KI-Systemen in der Zukunft entwickeln – eher kooperativ oder kompetitiv? Welche Rolle spielt der Mensch noch, wenn KI-Systeme immer schneller Entscheidungen treffen und autonome Akteure werden? Welche Erfahrungen haben Sie mit technischen Systemen gemacht, die durch Komplexität oder Automatisierung an Kontrolle verloren haben? Was bedeutet es für unsere Gesellschaft, wenn wir durch Automatisierung Know-how und Kontrolle an externe Akteure abgeben? Wie können wir den Übergang von KI als Spielzeug zu einem systemisch notwendigen Element gestalten, ohne in eine Abhängigkeit zu geraten? Glauben Sie, dass KI jemals eine eigene Motivation oder Intentionalität entwickeln könnte – und wenn ja, wie sollten wir darauf reagieren? Welche Strategien könnten helfen, die Kontrolle über komplexe, nicht-deterministische Systeme wie KI-Agenten zu behalten? Wie sehen Sie die geopolitischen Herausforderungen der KI-Entwicklung, insbesondere in Bezug auf Energie und Innovation? Welche positiven Szenarien können Sie sich für eine Zukunft mit KI und autonomen Agenten vorstellen? Wie bereiten Sie sich persönlich oder in Ihrem Umfeld auf die kommenden Entwicklungen in der KI vor? Ich freue mich Ihre Gedanken und Ansichten zu hören! Schreiben Sie mir und lassen Sie uns gemeinsam über die Zukunft nachdenken. Bis zum nächsten Mal beim gemeinsamen Nachdenken über die Zukunft! Referenzen Andere Episoden Episode 109: Was ist Komplexität? Ein Gespräch mit Dr. Marco Wehr Episode 107: How to Organise Complex Societies? A Conversation with Johan Norberg Episode 104: Aus Quantität wird Qualität Episode 103: Schwarze Schwäne in Extremistan; die Welt des Nassim Taleb, ein Gespräch mit Ralph Zlabinger Episode 99: Entkopplung, Kopplung, Rückkopplung Episode 94: Systemisches Denken und gesellschaftliche Verwundbarkeit, ein Gespräch mit Herbert Saurugg Episode 90: Unintended Consequences (Unerwartete Folgen) Episode 69: Complexity in Software Episode 40: Software Nachhaltigkeit, ein Gespräch mit Philipp Reisinger Episode 31: Software in der modernen Gesellschaft – Gespräch mit Tom Konrad Fachliche Referenzen Spy vs. Spy, The Complete Casebook Rupert Riedl, Strukturen der Komplexität: Eine Morphologie des Erkennens und Erklärens, Springer (2000) Doug Meil, The U.K. Post Office Scandal: Software Malpractice At Scale – Communications of the ACM (2024) Casey Rosenthal, Chaos Engineering, O'Reilly (2017)
Chaos engineering—is it really chaos, or something more structured? Andrey, Paulina, and Mattias talk about what chaos engineering means, how it started, and why you might already be using it unintentionally. Connect with us on LinkedIn or Twitter (see info at https://devsecops.fm/about/). We are happy to answer any questions, hear suggestions for new episodes, or hear from you, our listeners.
November 6, 2024: George Pappas, CEO at Intraprise Health joins Drex for the news. The conversation delves into the challenges of maintaining robust cybersecurity practices amidst resource constraints and evolving regulatory burdens. Can a split CISO role better navigate the increasing demands of compliance and governance? The show also highlights Mainline Health's strategic use of chaos engineering to test vulnerabilities and improve resilience—how does such a proactive approach redefine preparedness in an era where digital and analog systems must coexist seamlessly? Key Points:02:00 Unmanaged Cloud Credentials and Monitoring09:01 CISO Role and Regulatory Burden14:09 Cybersecurity Risk Management18:26 Chaos Engineering at Mainline HealthNews articles:Nearly Half of Organizations Face Data Breach Risks from Long-Lived CredentialsCISOs Push for Role Split Amid Rising Regulatory PressuresMain Line Health deploys chaos engineering to bolster healthcare resilienceThis Week Health SubscribeThis Week Health TwitterThis Week Health LinkedinAlex's Lemonade Stand: Foundation for Childhood Cancer Donate
On this Replay, we're revisiting our conversation with Jason Yee, Staff Technical Advocate at Datadog. At the time of this recording, he was the Director of Advocacy at Gremlin, an enterprise-grade chaos engineering platform. Join Corey and Jason as they talk about what Gremlin is and what a director of advocacy does, making chaos engineering more accessible for the masses, how it's hard to calculate ROI for developer advocates, how developer advocacy and DevRel changes from one company to the next, why developer advocates need to focus on meaningful connections, why you should start chaos engineering as a mental game, qualities to look for in good developer advocates, the Break Things On Purpose podcast, and more.Show Highlights(0:00) Intro(0:31) Blackblaze sponsor read(0:58) The role of a Director of Advocacy(3:34) DevRel and twisting job definitions(5:50) How DevRel confusion manifests into marketing(11:37) Being able to measure and define a team's success(13:42) Building respect and a community in tech(15:22) Effectively courting a community(18:02) The challenges of Jason's job(21:06) Planning for failure modes(22:30) Determining your value in tech(25:41) The growth of Gremlin(30:16) Where you can find more from JasonAbout Jason YeeJason Yee is Staff Technical Avdocate at Datadog, where he works to inspire developers and ops engineers with the power of metrics and monitoring. Previously, he was the community manager for DevOps & Performance at O'Reilly Media and a software engineer at MongoDB.LinksBreak Things On Purpose podcast: https://www.gremlin.com/podcast/Twitter: https://twitter.com/gitbisectOriginal episodehttps://www.lastweekinaws.com/podcast/screaming-in-the-cloud/chaos-engineering-for-gremlins-with-jason-yee/SponsorBackblaze: https://www.backblaze.com/
In today's episode of Category Visionaries, we speak with Casey Rosenthal, CEO of Verica. Topics Discussed: Chaos engineering - what does it mean? Why is it so important and what impact can it have on an enterprise tech setup Why more and more companies are seeking to add chaos engineering to their approach, and why it's already popular in fintech Why diversity is key for modern tech companies, driving better performance and ultimately more success Continuous verification and the DevOps revolution - where does this new category stand in the story? How harmonizing schedules between different departments can be a real challenge for a company looking to expand Favorite book: What Works: Gender Equality by Design
In this episode, we spoke to Karthik Satchitanand. Karthik is a principal software engineer at Harness and co-founder and maintainer of LitmusChaos, a CNCF incubated project. We talked about Chaos engineering , the Litmus project and more. Do you have something cool to share? Some questions? Let us know: - web: kubernetespodcast.com - mail: kubernetespodcast@google.com - twitter: @kubernetespod News of the week Kubernetes 1.31 release blog Kubernetes 1.31 release episode of the Kubernetes Podcast from Google KubeCon NA 2024 Schedule Score accepted as a CNCF Sandbox Project Links from the interview LitmusChaos principlesofchaos.org Okteto LitmusChaosCon community.cncf.io Links from the post-interview chat Chaos Monkey Chapter 5 of “Chaos Engineering” by Casey Rosenthal, Nora Jones, published by O'Reilly, covers DiRT LitmusChaos ChaosHub Klustered on YouTube Rawkode Academy
Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com/. Benjamin Wilms is a developer and software architect at heart, with 20 years of experience. He fell in love with chaos engineering. Benjamin now spreads his enthusiasm and new knowledge as a speaker and author – especially in the field of chaos and resilience engineering. Retrieval Augmented Generation // MLOps podcast #237 with Benjamin Wilms, CEO & Co-Founder of Steadybit. Huge thank you to Amazon Web Services for sponsoring this episode. AWS - https://aws.amazon.com/ // Abstract How to build reliable systems under unpredictable conditions with Chaos Engineering. // Bio Benjamin has over 20 years of experience as a developer and software architect. He fell in love with chaos engineering 7 years ago and shares his knowledge as a speaker and author. In October 2019, he founded the startup Steadybit with two friends, focusing on developers and teams embracing chaos engineering. He relaxes by mountain biking when he's not knee-deep in complex and distributed code. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Website: https://steadybit.com/ --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Benjamin on LinkedIn: https://www.linkedin.com/in/benjamin-wilms/ Timestamps: [00:00] Benjamin's preferred coffee [00:28] Takeaways [02:10] Please like, share, leave a review, and subscribe to our MLOps channels! [02:53] Chaos Engineering tldr [06:13] Complex Systems for smaller Startups [07:21] Chaos Engineering benefits [10:39] Data Chaos Engineering trend [15:29] Chaos Engineering vs ML Resilience [17:57 - 17:58] AWS Trainium and AWS Infecentia Ad [19:00] Chaos engineering tests system vulnerabilities and solutions [23:24] Data distribution issues across different time zones [27:07] Expertise is essential in fixing systems [31:01] Chaos engineering integrated into machine learning systems [32:25] Pre-CI/CD steps and automating experiments for deployments [36:53] Chaos engineering emphasizes tool over value [38:58] Strong integration into observability tools for repeatable experiments [45:30] Invaluable insights on chaos engineering [46:42] Wrap up
Chaos Engineering is no longer a nice to have, as Ananth Movva explains in this episode of the SREpath podcast. His experiences with it drove a reduced number and severity of serious incidents and outages.He's been at the helm of reliability-focused decision-making at one of Canada's largest banks, BMO, since 2020. Having completed 12 years at the bank, Ananth has seen the evolution of banking technology from archaic to user-centric, where incidents are considered seriously.Ananth highlighted the use of chaos principles and tooling to identify future points of failure well ahead of time. He also talked about issues in bringing developers to integrate chaos into SDLC. You will not want to miss this conversation!You can connect with Ananth via LinkedIn This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com
For some strange reason, “maintenance” has been in the news quite a bit lately. Is there ever a time when maintenance is enjoyable, or appreciated? SHOW: 814SHOW TRANSCRIPT: The Cloudcast #814SHOW VIDEO: https://youtube.com/@TheCloudcastNET CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW NOTES:AWS increased the price of longer-running EKS clusters by 6xBroadcom changes VMware licensing from perpetual to subscriptionBroadcom offers security patches to perpetual license customersIncreasing the Kubernetes support window to 1 yearDiscovering the XZ backdoor (Oxide and Friends, podcast)IS MAINTENANCE EVER APPRECIATED OR ENJOYABLE?Spent the day surrounded by maintenance activities (oil, AC, power-wash)The costs of maintenance are real and opportunityMaintenance often goes unappreciated and unseenNaming: Release Notes, Technical Debt, Chaos EngineeringTECHNICAL DEBT VS. MAINTENANCEShould we encourage a lack of maintenance vs. innovation as a priority?Should we encourage active maintenance with lower hard costs?Is there a way to put respect on maintenance? (e.g. OSS maintainers)Do we undervalue maintenance (e.g. Backup/Recovery, DisasterRecovery, etc.)?What maintenance best practices do you use? What are the good and bad of them?FEEDBACK?Email: show at the cloudcast dot netTwitter: @cloudcastpodInstagram: @cloudcastpodTikTok: @cloudcastpod
Jelle Niemantsverdriet joins us in this episode to discuss how the mindset around security is evolving, both from organisations and from professionals. My favourite takeaway is that security is on the same path as testing and becoming part of quality in software development. Connect with Jelle Niemantsverdriet: https://www.linkedin.com/in/jelleniemantsverdriet https://twitter.com/jelle_n References: Digital Defense Report - https://www.microsoft.com/nl-nl/security/security-insider/microsoft-digital-defense-report-2023 Data Breach Investigations Report (DBIR) - https://www.verizon.com/business/resources/reports/dbir/?CMP=OOH_SMB_OTH_22222_MC_20200501_NA_NM20200079_00001 Sidney Dekker - https://sidneydekker.com Kelly Shortridge - https://kellyshortridge.com/blog/ Chaos Engineering - https://www.securitychaoseng.com OUTLINE 00:00:00 - Intro 00:00:25 - Security is a matter of software quality 00:02:19 - Security way of working 00:04:37 - Professional pride 00:06:53 - Layers of defense, or excuse? 00:09:05 - The industrial revolution in IT 00:10:48 - Security as speciality 00:13:18 - Collaborating with the security department 00:14:29 - Building bridges 00:16:22 - Willingness to listen 00:19:29 - Scenario analysis workshops 00:21:01 - Unpredictable human behaviour 00:23:21 - Seemless and friction in security solutions 00:25:28 - Instant cake 00:26:38 - Red, blue and purple teaming 00:28:34 - Exploring the boundaries in AI 00:31:38 - Gamified security 00:32:46 - With risk comes reward 00:36:17 - Security costs vs. benefit 00:38:49 - Frequent password changes 00:41:20 - Verizon Data Breach Investigations Report 00:43:55 - Sidney Dekker - Human error doesn't exist 00:46:23 - Kelly Shortridge - Sensemaking 00:47:14 - Sharing knowledge around security
Chaos engineering is all about resilience and reliability… it just takes the harder path to get there. By injecting random and unpredictable behavior to the point of failure, chaos engineers observe systems' weak points, apply preventative maintenance, and develop a failover plan. Matt Schillerstrom from Harness introduces Ned and Ethan to this wild corner of... Read more »
Chaos engineering is all about resilience and reliability… it just takes the harder path to get there. By injecting random and unpredictable behavior to the point of failure, chaos engineers observe systems' weak points, apply preventative maintenance, and develop a failover plan. Matt Schillerstrom from Harness introduces Ned and Ethan to this wild corner of... Read more »
Chaos engineering is all about resilience and reliability… it just takes the harder path to get there. By injecting random and unpredictable behavior to the point of failure, chaos engineers observe systems' weak points, apply preventative maintenance, and develop a failover plan. Matt Schillerstrom from Harness introduces Ned and Ethan to this wild corner of... Read more »
On this week's episode, host Conor Bronsdon sit down with Guilherme Sesterheim, SAP DevOps SRE Engineer at AWS. Guilherme delves into applying Chaos Engineering and DevOps principles to SAP, a domain traditionally seen as risk-averse and resistant to rapid innovation.With expertise in both open-source technologies and SAP, Guilherme shares how he's bringing modern practices to SAP environments at AWS. He explores how Chaos Engineering can be used to test and improve the resilience of SAP systems, focusing on HANA, SAP's in-memory database. The discussion also touches on the challenges of integrating these practices within the SAP framework and the broader implications for SAP users and the tech industry.Episode Highlights:00:20 What does it mean to apply chaos engineering to testing SAP installation04:05 What does it mean to have DevOps around SAP?05:58 Guilherme's approach to DevOps practices around SAP10:01 The challenge of handling installation and migration11:50 How to Start Applying Chaos Engineering to Your SAP Instance16:57 The 12 Scenarios When You Inject Failures on SAP19:24 How Guilherme ended up at AWS working on SAP23:14 What's Next in DevOps Guilherme is Excited About?Show Notes:Wiring the Winning Organization - IT RevolutionSupport the show: Subscribe to our Substack Leave us a review Subscribe on YouTube Follow us on Twitter or LinkedIn Offers: Learn about Continuous Merge with gitStream Get your DORA Metrics free forever
The resilience discipline of controlled stress test experimentation in continuous integration/continuous delivery environments, CI/CD environments, to uncover systemic weaknesses. CyberWire Glossary link: https://thecyberwire.com/glossary/chaos-engineering Audio reference link: Farnam Street, 2009. Richard Feynman Teaches you the Scientific Method [Website]. Farnam Street. URL https://fs.blog/mental-model-scientific-method/
The resilience discipline of controlled stress test experimentation in continuous integration/continuous delivery environments, CI/CD environments, to uncover systemic weaknesses. CyberWire Glossary link: https://thecyberwire.com/glossary/chaos-engineering Audio reference link: Farnam Street, 2009. Richard Feynman Teaches you the Scientific Method [Website]. Farnam Street. URL https://fs.blog/mental-model-scientific-method/ Learn more about your ad choices. Visit megaphone.fm/adchoices
We've reached an inflection point in security. There are a handful of organizations regularly and successfully stopping cyber attacks. Most companies haven't gotten there, however. What separates these two groups? Why does it seem like we're still failing as an industry, despite seeming to collectively have all the tools, intel, and budget we've asked for? Kelly Shortridge has studied this problem in depth. She has created tools (https://www.deciduous.app/), and written books (https://www.securitychaoseng.com/) to help the community approach security challenges in a more logical and structured way. We'll discuss what hasn't worked for infosec in the past, and what Kelly thinks might work as we go into the future. Show Notes: https://securityweekly.com/esw-339
We've reached an inflection point in security. There are a handful of organizations regularly and successfully stopping cyber attacks. Most companies haven't gotten there, however. What separates these two groups? Why does it seem like we're still failing as an industry, despite seeming to collectively have all the tools, intel, and budget we've asked for? Kelly Shortridge has studied this problem in depth. She has created tools (https://www.deciduous.app/), and written books (https://www.securitychaoseng.com/) to help the community approach security challenges in a more logical and structured way. We'll discuss what hasn't worked for infosec in the past, and what Kelly thinks might work as we go into the future. Show Notes: https://securityweekly.com/esw-339
Podcast: Unsolicited Response (LS 34 · TOP 5% what is this?)Episode: Kelly Shortridge - Security Chaos Engineering in ICSPub date: 2023-11-01Kelly joins Dale to discuss her new book Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly points out the second part of the title is the most descriptive, and she is not a big fan of the Chaos term that has taken hold. They discuss: A quick description of Security Chaos Engineering Is there similarity or overlap with the CCE or CIE approach? The value of decision trees Her view of checklists of security controls like CISA's CPG Lesson 1 - "Start in Nonproduction environments" The experiment / scientific method approach and how it can start small The Danger Zone: tight coupling and complex interactions How should ICS use Chaos Engineering The podcast and artwork embedded on this page are from Dale Peterson: ICS Security Catalyst and S4 Conference Chair, which is the property of its owner and not affiliated with or endorsed by Listen Notes, Inc.
Podcast: Unsolicited Response (LS 33 · TOP 5% what is this?)Episode: Kelly Shortridge - Security Chaos Engineering in ICSPub date: 2023-11-01Kelly joins Dale to discuss her new book Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly points out the second part of the title is the most descriptive, and she is not a big fan of the Chaos term that has taken hold. They discuss: A quick description of Security Chaos Engineering Is there similarity or overlap with the CCE or CIE approach? The value of decision trees Her view of checklists of security controls like CISA's CPG Lesson 1 - "Start in Nonproduction environments" The experiment / scientific method approach and how it can start small The Danger Zone: tight coupling and complex interactions How should ICS use Chaos Engineering The podcast and artwork embedded on this page are from Dale Peterson: ICS Security Catalyst and S4 Conference Chair, which is the property of its owner and not affiliated with or endorsed by Listen Notes, Inc.
Kelly joins Dale to discuss her new book Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly points out the second part of the title is the most descriptive, and she is not a big fan of the Chaos term that has taken hold. They discuss: A quick description of Security Chaos Engineering Is there similarity or overlap with the CCE or CIE approach? The value of decision trees Her view of checklists of security controls like CISA's CPG Lesson 1 - "Start in Nonproduction environments" The experiment / scientific method approach and how it can start small The Danger Zone: tight coupling and complex interactions How should ICS use Chaos Engineering
Ready to inject a little chaos into your systems? Richard talks to Kelly Shortridge about her book Security Chaos Engineering. Kelly discusses the challenges of modern cybersecurity - how do you find weaknesses in your infrastructure and security systems? This leads to a discussion about challenging assumptions by exploring the workflows that exist in your infrastructure today. Exploring the workflows shows where assumptions exist, and that opens the door to testing them. There's sure to be some low-hanging fruit you can deal with, but eventually, you're left with tests that have to be set loose on your system - and you'll find out how resilient you really are!Links:FastlySecurity Chaos EngineeringDeciduousRecorded August 22, 2023
Welcome to another episode of the DevOps Toolchain podcast! In today's episode, we have the pleasure of talking with Daryl Dunn from Gremlin about the exciting topics of Gremlin certification, reliability testing, and chaos engineering. Daryl is a senior solution tech at Gremlin and has extensive experience in the field of testing and automation. We dive into the world of chaos engineering and how it plays a crucial role in ensuring system resilience and observability in today's cloud-native environments. Daryl shares his insights on how chaos engineering is a controlled process that allows organizations to proactively test their systems' resilience and ability to recover from failures. We discuss the importance of incorporating chaos engineering into the DevOps toolchain to identify and address vulnerabilities before they become critical issues. Daryl also highlights the significance of observability and how it plays a crucial role in optimizing system performance and response time. Take advantage of this episode to upskill your career in one of the hottest testing topics. Join us as we explore the world of Gremlin certification, reliability testing, and chaos engineering with Daryl Dunn. Let's dive in!
In today's Kubernetes Unpacked, Michael and Kristina catch up with Prithvi Raj and Sayan Mondal to talk about all things Chaos Engineering in the Kubernetes space! We chat about the open source and CNCF incubating project, Litmus, and various other topics including why Chaos Engineering is important, how it can help all organizations, how every engineer can use it, and more.
In today's Kubernetes Unpacked, Michael and Kristina catch up with Prithvi Raj and Sayan Mondal to talk about all things Chaos Engineering in the Kubernetes space! We chat about the open source and CNCF incubating project, Litmus, and various other topics including why Chaos Engineering is important, how it can help all organizations, how every engineer can use it, and more. The post Kubernetes Unpacked 035: Chaos Engineering In Kubernetes And The Litmus Project appeared first on Packet Pushers.
In today's Kubernetes Unpacked, Michael and Kristina catch up with Prithvi Raj and Sayan Mondal to talk about all things Chaos Engineering in the Kubernetes space! We chat about the open source and CNCF incubating project, Litmus, and various other topics including why Chaos Engineering is important, how it can help all organizations, how every engineer can use it, and more.
In today's Kubernetes Unpacked, Michael and Kristina catch up with Prithvi Raj and Sayan Mondal to talk about all things Chaos Engineering in the Kubernetes space! We chat about the open source and CNCF incubating project, Litmus, and various other topics including why Chaos Engineering is important, how it can help all organizations, how every engineer can use it, and more. The post Kubernetes Unpacked 035: Chaos Engineering In Kubernetes And The Litmus Project appeared first on Packet Pushers.
In today's Kubernetes Unpacked, Michael and Kristina catch up with Prithvi Raj and Sayan Mondal to talk about all things Chaos Engineering in the Kubernetes space! We chat about the open source and CNCF incubating project, Litmus, and various other topics including why Chaos Engineering is important, how it can help all organizations, how every engineer can use it, and more.
Speaking at Black Hat 2023, Kelly Shortridge is bringing cybersecurity out of the dark ages by infusing security by design to create secure patterns and practices. It's a subject of her new book on Security Chaos Computing, and it's a topic that's long overdue to be discussed in the field.
Guest: Kelly Shortridge, Senior Principal Engineer in the Office of the CTO at Fastly Topics: So what is Security Chaos Engineering? “Chapter 5. Operating and Observing” is Anton's favorite. One thing that mystifies me, however, is that you outline how to fail with alerts (send too many), but it is not entirely clear how to practically succeed with them? How does chaos engineering help security alerting / detection? How chaos engineering (or is it really about software resilience?) intersects with Cloud security - is this peanut butter and chocolate or more like peanut butter and pickles? How can organizations get started with chaos engineering for software resilience and security? What is your favorite chaos engineering experiment that you have ever done? We often talk about using the SRE lessons for security, and yet many organizations do security the 1990s way. Are there ways to use chaos engineering as a forcing function to break people out of their 1990s thinking and time warp them to 2023? Resources: Video (LinkedIn, YouTube) “Security Chaos Engineering: Sustaining Resilience in Software and Systems” by Kelly Shortridge, Aaron Rinehart “Cybersecurity Myths and Misconceptions” book “Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems“ book “Normal Accidents: Living with High-Risk Technologies” book “Deploy Security Capabilities at Scale: SRE Explains How” (ep85) “The Good, the Bad, and the Epic of Threat Detection at Scale with Panther” (ep123) “Can a Small Team Adopt an Engineering-Centric Approach to Cybersecurity?” (ep117) IKEA Effect “Modernizing SOC ... Introducing Autonomic Security Operations” blog
In today's episode of Category Visionaries, we speak with Benjamin Wilms, CEO and Co-Founder of Steadybit, a chaos engineering platform that's raised $7.8 Million in funding, about why resilience is everything in the modern software economy, and why so many companies struggle to build it into their more complex systems. Working with a small team of dedicated engineers, Steadybit wants to move beyond chaos engineering to create holistic resilience engineering solutions and empower everybody who builds and runs software. We also speak about Benjamin's background as a software developer and what it was like transitioning to the role of CEO, the original inspiration for a resilience-focused tech platform like Steadybit, having a real passion for problem solving and sharing knowledge, and why chaos engineering is key to making sure everything runs the way it should in the real world. Topics Discussed: Benjamin's background in software development, and his history of developing resilience-first software solutions The transition from developer to CEO, and confronting the challenges of time management and decision making The central role of chaos engineering in building long-term resilience for systems operating in an unpredictable digital space Why the term ‘chaos engineering' itself has great marketing potential, and how attracting initial interest is a big part of business growth How learning from failure is such an important part of aspect of developing a business, but why that doesn't mean it's always easy
Steadybit is a Chaos and Resilience Engineering platform that helps to proactively reduce downtime and provide visibility into systems to detect issues. Your business deserves the best possible preventive action, no matter how complex your system landscape is. steady bit adds that little extra certainty to your development and testing workflow. Connect with Benjamin
Chris - For those not familiar with Security Chaos Engineering, how would you summarize it, and what made you decide to author the new book on it?Nikki - In one of your sections of Security Chaos Engineering, you talk about what a modern security program looks like. Can you talk about what this means compared to security programs maybe 5 to 10 years ago? Chris - When approaching leadership, it can be tough to sell the concept of being disruptive, what advice do you have for security professionals looking to get buy-in from their leadership to introduce security chaos engineering?Nikki - One of the hallmarks of chaos engineering is actually building resilience into development and application environments, but people here 'chaos engineering' and don't quite know what to make of it. Can you talk about how security chaos engineering can build resiliency into infrastructure?Chris - I've cited several of your articles, such as Markets DGAF Security and others. You often take a counter-culture perspective to some of the groupthink in our industry. Why do you think we tend to rally around concepts even when the data doesn't prove them out and have your views been met with defensiveness among some who hold those views? Nikki - One of my favorite parts of chaos engineering is the hyptohesis-based approach and framework for building a security chaos engineering program. It may seem counter-intuitive to the 'chaos' in 'chaos engineering'. What do you think about the scientific method approach? Chris - Another topic I've been seeing you write and talk about is increasing the burden/cost on malicious actors to drive down their ROI. Can you touch on this topic with us?
When we peel back the layers of the stack, there's one human characteristic we're sure to find: errors. Mistakes, mishaps, and miscalculations are fundamental to being human, and as such, error is built into every piece of infrastructure and code we create. Of course, learning from our errors is critical in our effort to create functional, reliable tech. But could our mistakes be as important to technological development as our ideas? And what happens when we try to change our attitude towards errors…or remove them entirely? In this fascinating episode of Traceroute, we start back in 1968, when “The Mother of All Demos“ was supposed to change the face of personal computing…before the errors started. We're then joined by Andrew Clay Shafer, a DevOps pioneer who has seen the evolution of “errors” to “incidents” through practices like Scrum, Agile, and Chaos Engineering. We also speak with Courtney Nash, a Cognitive Neuroscientist and Researcher whose Verica Open Incident Directory (VOID) has changed the way we look at incident reporting. Additional ResourcesConnect with Amy Tobey: LinkedIn or TwitterConnect with Fen Aldrich: LinkedIn or TwitterConnect with John Taylor on LinkedInConnect With Courtney Nash on TwitterConnect with Andrew Clay Shafter on TwitterVisit Origins.dev for more informationEnjoyed This Episode?If you did, be sure to follow and share it with your friends! We'd also appreciate a five-star review on Apple Podcasts - it really helps people find the show!Traceroute is a podcast from Equinix and is a production of Stories Bureau. This episode was produced by John Taylor with help from Tim Balint and Cat Bagsic. It was edited by Joshua Ramsey and mixed by Jeremy Tuttle, with additional editing and sound design by Mathr de Leon. Our theme song was composed by Ty Gibbons.
Kelly Shortridge, Senior Principal Engineer at Fastly, joins Corey on Screaming in the Cloud to discuss their recently released book, Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly explains why a resilient strategy is far preferable to a bubble-wrapped approach to cybersecurity, and how developer teams can use evidence to mitigate security threats. Corey and Kelly discuss how the risks of working with complex systems is perfectly illustrated by Jurassic Park, and Kelly also highlights why it's critical to address both system vulnerabilities and human vulnerabilities in your development environment rather than pointing fingers when something goes wrong.About KellyKelly Shortridge is a senior principal engineer at Fastly in the office of the CTO and lead author of "Security Chaos Engineering: Sustaining Resilience in Software and Systems" (O'Reilly Media). Shortridge is best known for their work on resilience in complex software systems, the application of behavioral economics to cybersecurity, and bringing security out of the dark ages. Shortridge has been a successful enterprise product leader as well as a startup founder (with an exit to CrowdStrike) and investment banker. Shortridge frequently advises Fortune 500s, investors, startups, and federal agencies and has spoken at major technology conferences internationally, including Black Hat USA, O'Reilly Velocity Conference, and SREcon. Shortridge's research has been featured in ACM, IEEE, and USENIX, spanning behavioral science in cybersecurity, deception strategies, and the ROI of software resilience. They also serve on the editorial board of ACM Queue.Links Referenced: Fastly: https://www.fastly.com/ Personal website: https://kellyshortridge.com Book website: https://securitychaoseng.com LinkedIn: https://www.linkedin.com/in/kellyshortridge/ Twitter: https://twitter.com/swagitda_ Bluesky: https://shortridge.bsky.social TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Have you listened to the new season of Traceroute yet? Traceroute is a tech podcast that peels back the layers of the stack to tell the real, human stories about how the inner workings of our digital world affect our lives in ways you may have never thought of before. Listen and follow Traceroute on your favorite platform, or learn more about Traceroute at origins.dev. My thanks to them for sponsoring this ridiculous podcast. Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. My guest today is Kelly Shortridge, who is a Senior Principal Engineer over at Fastly, as well as the lead author of the recently released Security Chaos Engineering: Sustaining Resilience in Software and Systems. Kelly, welcome to the show.Kelly: Thank you so much for having me.Corey: So, I want to start with the honest truth that in that title, I think I know what some of the words mean, but when you put them together in that particular order, I want to make sure we're talking about the same thing. Can you explain that like I'm five, as far as what your book is about?Kelly: Yes. I'll actually start with an analogy I make in the book, which is, imagine you were trying to rollerblade to some destination. Now, one thing you could do is wrap yourself in a bunch of bubble wrap and become the bubble person, and you can waddle down the street trying to make it to your destination on the rollerblades, but if there's a gust of wind or a dog barks or something, you're going to flop over, you're not going to recover. However, if you instead do what everybody does, which is you know, kneepads and other things that keep you flexible and nimble, the gust you know, there's a gust of wind, you can kind of be agile, navigate around it; if a dog barks, you just roller-skate around it; you can reach your destination. The former, the bubble person, that's a lot of our cybersecurity today. It's just keeping us very rigid, right? And then the alternative is resilience, which is the ability to recover from failure and adapt to evolving conditions.Corey: I feel like I am about to torture your analogy to death because back when I was in school in 2000, there was an annual tradition at the school I was attending before failing out, where a bunch of us would paint ourselves green every year and then bike around the campus naked. It was the green bike ride. So, one year I did this on rollerblades. So, if you wind up looking—there's the bubble wrap, there's the safety gear, and then there's wearing absolutely nothing, which feels—Kelly: [laugh]. Yes.Corey: —kind of like the startup approach to InfoSec. It's like, “It'll be fine. What's the worst that happens?” And you're super nimble, super flexible, until suddenly, oops, now I really wish I'd done things differently.Kelly: Well, there's a reason why I don't say rollerblade naked, which other than it being rather visceral, what you described is what I've called YOLOSec before, which is not what you want to do. Because the problem when you think about it from a resilience perspective, again, is you want to be able to recover from failure and adapt. Sure, you can oftentimes move quickly, but you're probably going to erode software quality over time, so to a certain point, there's going to be some big incident, and suddenly, you aren't fast anymore, you're actually pretty slow. So, there's this, kind of, happy medium where you have enough, I would like security by design—we can talk about that a bit if you want—where you have enough of the security by design baked in and you can think of it as guardrails that you're able to withstand and recover from any failure. But yeah, going naked, that's a recipe for not being able to rollerblade, like, ever again, potentially [laugh].Corey: I think, on some level, that the correct dialing in of security posture is going to come down to context, in almost every case. I'm building something in my spare time in the off hours does not need the same security posture—mostly—as we are a bank. It feels like there's a very wide gulf between those two extremes. Unfortunately, I find that there's a certain tone-deafness coming from a lot of the security industry around oh, everyone must have security as their number one thing, ever. I mean, with my clients who I fixed their AWS bills, I have to care about security contractually, but the secrets that I hold are boring: how much money certain companies pay another very large company.Yes, I'll get sued into oblivion if that leaks, but nobody dies. Nobody is having their money stolen as a result. It's slightly embarrassing in the tech press for a cycle and then it's over and done with. That's not the same thing as a brief stint I did running tech ops at Grindr ten years ago where, leak that database and people will die. There's a strong difference between those threat models, and on some level, being able to act accordingly has been one of the more eye-opening approaches to increasing velocity in my experience. Does that align with the thesis of your book, since my copy has not yet arrived for this recording?Kelly: Yes. The book, I am not afraid to say it depends on the book, and you're right, it depends on context. I actually talk about this resilience potion recipe that you can check out if you want, these ingredients so we can sustain resilience. A key one is defining your critical functions, just what is your system's reason for existence, and that is what you want to make sure it can recover and still operate under adverse conditions, like you said.Another example I give all the time is most SaaS apps have some sort of reporting functionality. Guess what? That's not mission-critical. You don't need the utmost security on that, for the most part. But if it's processing transactions, yeah, probably you want to invest more security there. So yes, I couldn't agree more that it's context-dependent and oh, my God, does the security industry ignore that so much of the time, and it's been my gripe for, I feel like as long as I've been in the industry.Corey: I mean, there was a great talk that Netflix gave years ago where they mentioned in passing, that all developers have root in production. And that's awesome and the person next to him was super excited and I looked at their badge, and holy hell, they worked at an actual bank. That seems like a bad plan. But talking to the Netflix speaker after the fact, Dave Hahn, something that I found that was extraordinarily insightful, was that, yeah, well we just isolate off the PCI environment so the rest and sensitive data lives in its own compartmentalized area. So, at that point, yeah, you're not going to be able to break much in that scenario.It's like, that would have been helpful context to put in talk. Which I'm sure he did, but my attention span had tripped out and I missed that. But that's, on some level, constraining blast radius and not having compliance and regulatory issues extending to every corner of your environment really frees you up to do things appropriately. But there are some things where you do need to care about this stuff, regardless of how small the surface area is.Kelly: Agreed. And I introduced the concept of the effort investment portfolio in the book, which is basically, that is where does it matter to invest effort and where can you kind of like, maybe save some resources up. I think one thing you touched on, though, is, we're really talking about isolation and I actually think people don't think about isolation in as detailed or maybe as expansively as they could. Because we want both temporal and logical and spatial isolation. What you talked about is, yeah, there are some cases where you want to isolate data, you want to isolate certain subsystems, and that could be containers, it could also be AWS security groups.It could take a bunch of different forms, it could be something like RLBox in WebAssembly land. But I think that's something that I really try to highlight in the book is, there's actually a huge opportunity for security engineers starting from the design of a system to really think about how can we infuse different forms of isolation to sustain resilience.Corey: It's interesting that you use the word investment. When fixing AWS bills for a living, I've learned over the last almost seven years now of doing this that cost and architecture and cloud are fundamentally the same thing. And resilience is something that comes with a very real cost, particularly when you start looking at what the architectural choices are. And one of the big reasons that I only ever work on a fixed-fee basis is because if I'm charging for a percentage of savings or something, it inspires me to say really uncomfortable things like, “Backups are for cowards.” And, “When was the last time you saw an entire AWS availability zone go down for so long that it mattered? You don't need to worry about that.” And it does cut off an awful lot of cost issues, at the price of making the environment more fragile.That's where one of the context thing starts to come in. I mean, in many cases, if AWS is having a bad day in a given region, well does your business need that workload to be functional? For my newsletter, I have a publication system that's single-homed out of the Oregon region. If that whole thing goes down for multiple days, I'm writing that week's issue by hand because I'm going to have something different to talk about anyway. For me, there is no value in making that investment. But for companies, there absolutely is, but there's also seems to be a lack of awareness around, how much is a reasonable investment in that area when do you start making that investment? And most critically, when do you stop?Kelly: I think that's a good point, and luckily, what's on my side is the fact that there's a lot of just profligate spending in cybersecurity and [laugh] that's really what I'm focused on is, how can we spend those investments better? And I actually think there's an opportunity in many cases to ditch a ton of cybersecurity tools and focus more on some of the stuff he talked about. I agree, by the way that I've seen some threat models where it's like, well, AWS, all regions go down. I'm like, at that point, we have, like, a severe, bigger-than-whatever-you're-thinking-about problem, right?Corey: Right. So, does your business continuity plan account for every one of your staff suddenly quitting on the spot because there's a whole bunch of companies with very expensive consulting, like, problems that I'm going to go work for a week and then buy a house in cash. It's one of those areas where, yeah, people are not going to care about your environment more than they are about their families and other things that are going on. Plan accordingly. People tend to get so carried away with these things with tabletop planning exercises. And then of course, they forget little things like I overwrote the database by dropping the wrong thing. Turns out that was production. [laugh]. Remembering for [a me 00:10:00] there.Kelly: Precisely. And a lot of the chaos experiments that I talk about in the book are a lot of those, like, let's validate some of those basics, right? That's actually some of the best investments you can make. Like, if you do have backups, I can totally see your argument about backups are for cowards, but if you do have them, like, maybe you conduct experiments to make sure that they're available when you need them, and the same thing, even on the [unintelligible 00:10:21] side—Corey: No one cares about backups, but everyone really cares about restores, suddenly, right after—Kelly: Yeah.Corey: —they really should have cared about backups.Kelly: Exactly. So, I think it's looking at those experiments where it's like, okay, you have these basic assumptions in place that you assume to be invariance or assume that they're going to bail you out if something goes wrong. Let's just verify. That's a great place to start because I can tell you—I know you've been to the RSA hall floor—how many cybersecurity teams are actually assessing the efficacy and actually experimenting to see if those tools really help them during incidents. It's pretty few.Corey: Oh, vendors do not want to do those analyses. They don't want you to do those analyses, either, and if you do, for God's sakes, shut up about it. They're trying to sell things here, mostly firewalls.Kelly: Yeah, cybersecurity vendors aren't necessarily happy about my book and what I talk about because I have almost this ruthless focus on evidence and [unintelligible 00:11:08] cybersecurity vendors kind of thrive on a lack of evidence. So.Corey: There's so much fear, uncertainty, and doubt in that space and I do feel for them. It's a hard market to sell in without having to talk about here's the thing that you're defending against. In my case, it's easy to sell the AWS bill is high because if I don't have to explain why more or less setting money on fire as a bad thing, I don't really know what to tell you. I'm going to go look for a slightly different customer profile. That's not really how it works in security, I'm sure there are better go-to-market approaches, but they're hard to find, at least, ones that work holistically.Kelly: There are. And one of my priorities with the book was to really enumerate how many opportunities there are to take software engineering practices that people already know, let's say something like type systems even, and how those can actually help sustain resilience. Even things like integration testing or infrastructure as code, there are a lot of opportunities just to extend what we already do for systems reliability to sustain resilience against things that aren't attacks and just make sure that, you know, we cover a few of those cases as well. A lot of it should be really natural to software engineering teams. Again, security vendors don't like that because it turns out software engineering teams don't particularly like security vendors.Corey: I hadn't noticed that. I do wonder, though, for those who are unaware, chaos engineering started off as breaking things on purpose, which I feel like one person had a really good story and thought about it super quickly when they were about to get fired. Like, “No, no, it's called Chaos Engineering.” Good for them. It's now a well-regarded discipline. But I've always heard of it in the context of reliability of, “Oh, you think your site is going to work if the database falls over? Let's push it over and see what happens.” How does that manifest in a security context?Kelly: So, I will clarify, I think that's a slight misconception. It's really about fixing things in production, and that's the end goal. I think we should not break things just to break them, right? But I'll give a simple example, which I know it's based on what Aaron Rinehart conducted at UnitedHealth Group, which is, okay, let's inject a misconfigured port as an experiment and see what happens, end-to-end. In their case, the firewall only detected the misconfigured port 60% of the time, so 60% of the time, it works every time.But it was actually the cloud, the very common, like, Cloud configuration management tool that caught the change and alerted responders. So, it's that kind of thing where we're still trying to verify those assumptions that we have about our systems and how they behave, again, end-to-end. In a lot of cases, again, with security tools, they are not behaving as we expect. But I still argue security is just a subset of software quality, so if we're experimenting to verify, again, our assumptions and observe system behavior, we're benefiting software quality, and security is just a subset of that. Think about C code, right? It's not like there's, like, a healthy memory corruption, so it's bad for both the quality and security reason.Corey: One problem that I've had in the security space for a while is—let's [unintelligible 00:14:05] on this to AWS for a second because that is the area in which I spend the most of my time, which probably explains a lot about my personality challenges. But the problem that I keep smacking into is if I go ahead and configure everything the way that I should according to best practices and the rest, I wind up with a firehose torrent of information in terms of CloudTrail logs, et cetera. And it's expensive in its own right. But then to sort through it or to do a lot of things in security, there are basically two options. I can either buy a vendor's product, which generally tends to start around $12,000 a year and goes up rapidly from there on my current $6,000 a year bill, so okay, twice as much as the infrastructure for security monitoring. Okay.Or alternately, find a bunch of different random scripts and tools on GitHub of wildly diverging quality and sort of hope for the best on that. It feels like there's nothing in between. And the reason I care about this is not because I'm cheap but because when you have an individual learner who is either a student or a career switcher or someone just trying to experiment with this, you want them to begin as you want them to go on, and things that are no money for an enterprise are all the money to them. They're going to learn to work with the tools that they can afford. That feels like it's a big security swing and a miss. Do you agree or disagree? What's the nuance I'm missing here?Kelly: No, I don't think there's nuance you're missing. I think security observability, for one, isn't a buzzword that particularly exists. I've been trying to make it a thing, but I'm solely one individual screaming into the void. But observability just hasn't been a thing. We haven't really focused on, okay, so what, like, we get data and what do we do with it?And I think, again, from a software engineering perspective, I think there's a lot we can do. One, we can just avoid duplicating efforts. We can treat observability, again, of any sort of issue as similar, whether that's an attack or a performance issue. I think this is another place where security, or any sort of chaos experiment, shines though because if you have an idea of here's an adverse scenario we care about, you can actually see how does it manifest in the logs and you can start to figure out, like, what signals do we actually need to be looking for, what signals mattered to be able to narrow it down. Which again, it involves time and effort, but also, I can attest when you're buying the security vendor tool and, in theory, absolving some of that time and effort, it's maybe, maybe not, because it can be hard to understand what the outcomes are or what the outputs are from the tool and it can also be very difficult to tune it and to be able to explain some of the outputs. It's kind of like trading upfront effort versus long-term overall overhead if that makes sense.Corey: It does. On that note, the title of your book includes the magic key phrase ‘sustaining resilience.' I have found that security effort and investment tends to resemble a fire drill in—Kelly: [laugh].Corey: —an awful lot of places, where, “We care very much about security,” says the company, right after they very clearly failed to care about security, and I know this because I'm reading getting an email about a breach that they've just sent me. And then there's a whole bunch of running around and hair-on-fire moments. But then there's a new shiny that always comes up, a new strategic priority, and it falls to the wayside again. What do you see the drives that sustained effort and focus on resilience in a security context?Kelly: I think it's really making sure you have a learning culture, which sounds very [unintelligible 00:17:30], but things again, like, experiments can help just because when you do simulate those adverse scenarios and you see how your system behaves, it's almost like running an incident and you can use that as very fresh, kind of, like collective memory. And I even strongly recommend starting off with prior incidents in simulating those, just to see like, hey, did the improvements we make actually help? If they didn't, that can be kind of another fire under the butt, so to speak, to continue investing. So, definitely in practice—and there's some case studies in the book—it can be really helpful just to kind of like sustain that memory and sustain that learning and keep things feeling a bit fresh. It's almost like prodding the nervous system a little, just so it doesn't go back to that complacent and convenient feeling.Corey: It's one of the hard problems because—I'm sure I'm going to get castigated for this by some of the listeners—but computers are easy, particularly compared to the people. There are deterministic ways to solve almost any computer problem, but people are always going to be a little bit different, and getting them to perform the same way today that they did yesterday is an exercise in frustration. Changing the culture, changing the approach and the attitude that people take toward a lot of these things feels, from my perspective, like, something of an impossible job. Cultural transformations are things that everyone talks about, but it's rare to see them succeed.Kelly: Yes, and that's actually something that I very strongly weaved throughout the book is that if your security solutions rely on human behavior, they're going to fail. We want to either reduce hazards or eliminate hazards by design as much as possible. So, my view is very much again, like, can you make processes more repeatable? That's going to help security. I definitely do not think that if anyone takes away from my book that they need to have, like, a thousand hours of training to change hearts and minds, then they have completely misunderstood most of the book.The idea is very much like, what are practices that we want for other outcomes anyway—again, reliability or faster time to market—and how can we harness those to also be improving resilience or security at the same time? It's very much trying to think about those opportunities rather than, you know, trying to drill into people's heads, like, “Thou shalt not,” or, “Thou shall.”Corey: Way back in 2018, you gave a keynote at some conference or another and you built the entire thing on the story of Jurassic Park, specifically Ian Malcolm as one of your favorite fictional heroes, and you tied it into security in a bunch of different ways. You hadn't written this book then unless the authorship process is way longer than I think it is. So, I'm curious to get your take on what Jurassic Park can teach us about software security.Kelly: Yes, so I talk about Jurassic Park as a reference throughout the book, frequently. I've loved that book since I was a very young child. Jurassic Park is a great example of a complex system gone wrong because you can't point to any one thing. Like there's Dennis Nedry, you know, messing up the power system, but then there's also the software was looking for a very specific count of dinosaurs and they didn't anticipate there could be more in the count. Like, there are so many different factors that influenced it, you can't actually blame just, like, human error or point fingers at one thing.That's a beautiful example of how things go wrong in our software systems because like you said, there's this human element and then there's also how the humans interact and how the software components interact. But with Jurassic Park, too, I think the great thing is dinosaurs are going to do dinosaur things like eating people, and there are also equivalents in software, like C code. C code is going to do C code things, right? It's not a memory safe language, so we shouldn't be surprised when something goes wrong. We need to prepare accordingly.Corey: “How could this happen? Again?” Yeah.Kelly: Right. At a certain point, it's like, there's probably no way to sufficiently introduce isolation for dinosaurs unless you put them in a bunker where no one can see them, and it's the same thing sometimes with things like C code. There's just no amount of effort you can invest, and you're just kind of investing for a really unclear and generally not fortuitous outcome. So, I like it as kind of this analogy to think about, okay, where do our effort investments make sense and where is it sometimes like, we really just do need to refactor because we're dealing with dinosaurs here.Corey: When I was a kid, that was one of my favorite books, too. The problem is, I didn't realize I was getting a glimpse of my future at a number of crappy startups that I worked at. Because you have John Hammond, who was the owner of the park talking constantly about how, “We spared no expense,” but then you look at what actually happened and he spared every frickin expense. You have one IT person who is so criminally underpaid that smuggling dinosaur embryos off the island becomes a viable strategy for this. He wound up, “Oh, we couldn't find the right DNA, so we're just going to, like, splice some other random stuff in there. It'll be fine.”Then you have the massive overconfidence because it sounds very much like he had this almost Muskian desire to fire anyone who disagreed with him, and yeah, there was a certain lack of investment that could have been made, despite loud protestations to the contrary. I'd say that he is the root cause, he is the proximate reason for the entire failure of the park. But I'm willing to entertain disagreement on that point.Kelly: I think there are other individuals, like Dr. Wu, if you recall, like, deciding to do the frog DNA and not thinking that maybe something could go wrong. I think there was a lot of overconfidence, which you're right, we do see a lot in software. So, I think that's actually another very important lesson is that incentives matter and incentives are very hard to change, kind of like what you talked about earlier. It doesn't mean that we shouldn't include incentives in our threat model.So like, in the book I talked about, our threat models should include things like maybe yeah, people are underpaid or there is a ton of pressure to deliver things quickly or, you know, do things as cheaply as possible. That should be just as much of our threat models as all of the technical stuff too.Corey: I think that there's a lot that was in that movie that was flat-out wrong. For example, one of the kids—I forget her name; it's been a long time—was logging in and said, “Oh, this is Unix. I know Unix.” And having learned Unix as my first basically professional operating system, “No, you don't. No one knows Unix. They get very confused at some point, the question is, just how far down what rabbit hole it is.”I feel so sorry for that kid. I hope she wound up seeking therapy when she was older to realize that, no, you don't actually know Unix. It's not that you're bad at computers, it's that Unix is user-hostile, actively so. Like, the raptors, like, that's the better metaphor when everything winds up shaking out.Kelly: Yeah. I don't disagree with that. The movie definitely takes many liberties. I think what's interesting, though, is that Michael Creighton, specifically, when he talks about writing the book—I don't know how many people know this—dinosaurs were just a mechanism. He knew people would want to read it in airport.What he cared about was communicating really the danger of complex systems and how if you don't respect them and respect that interactivity and that it can baffle and surprise us, like, things will go wrong. So, I actually find it kind of beautiful in a way that the dinosaurs were almost like an afterthought. What he really cared about was exactly what we deal with all the time in software, is when things go wrong with complexity.Corey: Like one of his other books, Airframe, talked about an air disaster. There's a bunch of contributing factors in the rest, and for some reason, that did not receive the wild acclaim that Jurassic Park did to become a cultural phenomenon that we're still talking about, what, 30 years later.Kelly: Right. Dinosaurs are very compelling.Corey: They really are. I have to ask though—this is the joy of having a kid who is almost six—what is your favorite dinosaur? Not a question most people get asked very often, but I am going to trot that one out.Kelly: No. Oh, that is such a good question. Maybe a Deinonychus.Corey: Oh, because they get so angry they spit and kill people? That's amazing.Kelly: Yeah. And I like that, kind of like, nimble, smarter one, and also the fact that most of the smaller ones allegedly had feathers, which I just love this idea of, like, feather-ful murder machines. I have the classic, like, nerd kid syndrome, though, where I read all these dinosaur names as a kid and I've never pronounced them out loud. So, I'm sure there are others—Corey: Yep.Kelly: —that I would just word salad. But honestly, it's hard to go wrong with choosing a favorite dinosaur.Corey: Oh, yeah. I'm sure some paleontologist is sitting out there in the field on the dig somewhere listening to this podcast, just getting very angry at our pronunciation and things. But for God's sake, I call the database Postgres-squeal. Get in line. There's a lot of that out there where looking at a complex system failures and different contributing factors and the rest makes stuff—that's what makes things interesting.I think that there's this the idea of a root cause is almost always incorrect. It's not, “Okay, who tripped over the buried landmine,” is not the interesting question. It's, “Who buried the thing?” What were all the things that wound up contributing to this? And you can't even frame it that way in the blaming context, just because you start doing that and people clam up, and good luck figuring out what really happened.Kelly: Exactly. That's so much of what the cybersecurity industry is focused on is how do we assign blame? And it's, you know, the marketing person clicked on a link. And it's like, they do that thousands of times, like a month, and the one time, suddenly, they were stupid for doing it? That doesn't sound right.So, I'm a big fan of, yes, vanquishing root cause, thinking about contributing factors, and in particular, in any sort of incident review, you have to think about, was there a designer process problem? You can't just think about the human behavior; you have to think about where are the opportunities for us to design things better, to make this secure way more of the default way.Corey: When you talk about resilience and reliability and big, notable outages, most forward-thinking companies are going to go and do a variety of incident reviews and disclosures around everything that happened to it, depending upon levels of trust and whether your NDA'ed or not, and how much gets public is going to vary from place to place. But from a security perspective, that feels like the sort of thing that companies will clam up about and never say a word.Kelly: Yes.Corey: Because I can wind up pouring a couple of drinks into people and get the real story of outages, or the AWS bill, but security stuff, they start to wonder if I'm a state actor, on some level. When you were building all of this, how did you wind up getting people to talk candidly and forthrightly about issues that if it became tied to them that they were talking to this in public would almost certainly have negative career impact for them?Kelly: Yes, so that's almost like a trade secret, I feel like. A lot of it is yes, over the years talking with people over, generally at a conference where you know, things are tipsy. I never want to betray confidentiality, to be clear, but certainly pattern-matching across people's stories.Corey: Yeah, we're both in positions where if even the hint of they can't be trusted enters the ecosystem, I think both of our careers explode and never recover. Like it's—Kelly: Exactly.Corey: —yeah. Oh, yeah. They play fast and loose with secrets is never the reputation you want as a professional.Kelly: No. No, definitely not. So, it's much more pattern matching and trying to generalize. But again, a lot of what can go wrong is not that different when you think about a developer being really tired and making a bunch of mistakes versus an attacker. A lot of times they're very much the same, so luckily there's commonality there.I do wish the security industry was more forthright and less clandestine because frankly, all of the public postmortems that are out there about performance issues are just such, such a boon for everyone else to improve what they're doing. So, that's a change I wish would happen.Corey: So, I have to ask, given that you talk about security, chaos engineering, and resilience-and of course, software and systems—all in the title of the O'Reilly book, who is the target audience for this? Is it folks who have the word security featured three times in their job title? Is it folks who are new to the space? What is your target audience start and stop?Kelly: Yes, so I have kept it pretty broad and it's anyone who works with software, but I'll talk about the software engineering audience because that is, honestly, probably out of anyone who I would love to read the book the most because I firmly believe that there's so much that software engineering teams can do to sustain resilience and security and they don't have to be security experts. So, I've tried to demystify security, make it much less arcane, even down to, like, how attackers, you know, they have their own development lifecycle. I try to demystify that, too. So, it's very much for any team, especially, like, platform engineering teams, SREs, to think about, hey, what are some of the things maybe I'm already doing that I can extend to cover, you know, the security cases as well? So, I would love for every software engineer to check it out to see, like, hey, what are the opportunities for me to just do things slightly differently and have these great security outcomes?Corey: I really want to thank you for taking the time to talk with me about how you view these things. If people want to learn more, where's the best place for them to find you?Kelly: Yes, I have all of the social media which is increasingly fragmented, [laugh] I feel like, but I also have my personal site, kellyshortridge.com. The official book site is securitychaoseng.com as well. But otherwise, find me on LinkedIn, Twitter, [Mastodon 00:30:22], Bluesky. I'm probably blanking on the others. There's probably already a new one while we've spoken.Corey: Blue-ski is how I insist on pronouncing it as well, while we're talking about—Kelly: Blue-ski?Corey: Funhouse pronunciation on things.Kelly: I like it.Corey: Excellent. And we will, of course, put links to all of those things in the [show notes 00:30:37]. Thank you so much for being so generous with your time. I really appreciate it.Kelly: Thank you for having me and being a fellow dinosaur nerd.Corey: [laugh]. Kelly Shortridge, Senior Principal Engineer at Fastly. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment about how our choice of dinosaurs is incorrect, then put the computer away and struggle to figure out how to open a door.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.
Chaos engineering is a discipline within the field of software engineering that focuses on testing and improving the resilience and stability of a system by intentionally introducing controlled instances of chaos and failure. The primary goal of chaos engineering is to identify and address potential weaknesses and vulnerabilities in a system, ultimately making it more The post Chaos Engineering with Uma Mukkara appeared first on Software Engineering Daily.
What is security chaos engineering? You may remember Kelly Shortridge, our very first guest, who came on the show to talk about behavioral economics and cybersecurity. Well Kelly is back to talk about her new book, "Security Chaos Engineering: Sustaining Resilience in Software and Systems". Security chaos engineering is derived from chaos engineering, a relatively new discipline in software development that seeks to test distributed computing systems to ensure that they withstand unexpected disruptions. It's all about resilience, in other words. Security chaos engineering seeks to do the same for the security of such software systems. Kelly breaks down her book during a lively conversation featuring an opinion or two her cat, Link (yes, a Zelda reference!): Who should read this book? Resilience in software and systems Systems-oriented security Architecting and designing Building and delivering Operating and observing (Allan's favorite chapter as it intersects with one of his Zero Trust tenets) Responding and recovering Platform resilience engineering Security chaos experiments (a very fun chapter!) Case studies Note that the book is peppered with references and quotes from other disciplines. We would expect no less from Kelly. Sponsored by our good friends at Dazz: Dazz takes the pain out of the cloud remediation process using automation and intelligence to discover, reduce, and fix security issues—lightning fast. Visit Dazz.io/demo and see for yourself.
Watch on YouTube About the show Sponsored by InfluxDB from Influxdata. Connect with the hosts Michael: @mkennedy@fosstodon.org Brian: @brianokken@fosstodon.org Show: @pythonbytes@fosstodon.org Join us on YouTube at pythonbytes.fm/live to be part of the audience. Usually Tuesdays at 11am PT. Older video versions available there too. Brian #1: huak - A Python package manager written in Rust. Inspired by Cargo Suggested by Owen Tons of workflows activate - activate a virtual environment add add a dependency to a project pip install it into your virtual environment, and add it to the dependency list in pyproject.toml test - run pytest update update dependencies lint - run ruff, installing it first if necessary fix - autofix fixable lint conflicts build - build wheel in isolated virtual environment using hatchling Honestly I was considering building my own workflow tool, but this is darned close to what I want. Even though it's still “in an experimental state”. There are rough edges (ruff edges, get it), but still, way cool. I just don't know how to pronounce it. Is it like “walk”, or more like “whack”? Michael #2: PSF expresses concerns about a proposed EU law that may make it impossible to continue providing Python and PyPI to the European public After reviewing the proposed Cyber Resilience Act and Product Liability Act, the PSF has found issues that put the mission of our organization and the health of the open-source software community at risk. As currently written, the authors of open-source components might bear legal and financial responsibility for the way their components are applied in someone else's commercial product. The risk of huge potential costs would make it impossible in practice for us to continue to provide Python and PyPI to the European public. Brian #3: ChaosToolkit Suggested by the maintainer, Sylvain Hellegouarch Declare and store your Chaos Engineering experiments as JSON/YAML files so you can collaborate and orchestrate them as any other piece of code. Extensible through an Open API Can be automated in CI/CD pipeline Michael #4: PEP 711 – PyBI: a standard format for distributing Python Binaries “Like wheels, but instead of a pre-built python package, it's a pre-built python interpreter” Joke: It's the effort that counts
All links and images for this episode can be found on CISO Series. Is chaos engineering the secret sauce to creating a resilient organization? Purposefully disrupt your architecture to allow for early discovery of weak points. Can we take it even further to company environment, beyond even a tabletop exercise? How far can we test our limits while still allowing the business to operate? This week's episode is hosted by me, David Spark (@dspark), producer of CISO Series and Andy Ellis (@csoandy), operating partner, YL Ventures. Our sponsored guest is Mike Wiacek, CEO, Stairwell. Thanks to our podcast sponsor, Stairwell The standard cybersecurity blueprint is a roadmap for attackers to test and engineer attacks. With Inception, organizations can operate out of sight, out of band, and out of time. Collect, search, and analyze every file in your environment – from malware and supply chain vulnerabilities to unique, low-prevalence files and beyond. Learn about Inception. In this episode: Is chaos engineering the secret sauce to creating a resilient organization? Purposefully disrupt your architecture to allow for early discovery of weak points. Can we take it even further to company environment, beyond even a tabletop exercise? How far can we test our limits while still allowing the business to operate?
Chaos Engineering started in the mid 2000s. It was made famous by the Netflix engineering team under an internal app they developed, called Chaos Monkey, that randomly destroyed pieces of their customer-facing infrastructure, on purpose, so that their network architects could understand resilience engineering down deep in their core. But the concept is much more than simply destroying production systems to see what will happen. This elevates the idea of regression testing to the level of the scientific method designed to uncover potential and unknown architectural designs that may cause catastrophic failure. I make the case that the CSO should probably own that functionality.
In complex service-oriented architectures, failure can happen in individual servers and containers, then cascade through your system. Good engineering takes into account possible failures. But how do you test whether a solution actually mitigates failures without risking the ire of your customers? That's where chaos engineering comes in, injecting failures and uncertainty into complex systems so your team can see where your architecture breaks. On this sponsored episode, our fourth in the series with Intuit, Ben and Ryan chat with Deepthi Panthula, Senior Product Manager, and Shan Anwar, Principal Software Engineer, both of Intuit about how use self-serve chaos engineering tools to control the blast radius of failures, how game day tests and drills keep their systems resilient, and how their investment in open-source software powers their program. Episode notes: Sometimes old practices work in new environments. The Intuit team uses Failure Mode Effect Analysis, (FMEA), a procedure developed by the US military in 1949, to ensure that their developers understand possible points of failure before code makes it to production. The team uses Litmus Chaos to inject failures into their Kubernetes-based system and power their chaos engineering efforts. It's open source and maintained by Intuit and others. If you've been following this series, you'd know that Intuit is a big fan of open-source software. Special shout out to Argo Workflow, which makes their compute-intensive Kubernetes jobs work much smoother. Connect on LinkedIn with Deepthi Panthula and Zeeshan (Shan) Anwar.If you want to see what Stack Overflow users are saying about chaos engineering, check out Chaos engineering best practice, asked by User NingLee two years ago.
2011 was a pivotal year for Netflix: the now hugely successful company was then in the midst of a formidable transformation, changing from a mail-based DVD rental service to the modern streaming service that it is today. It was at this crucial point in the company's history that Jason Chan, our guest in this episode, was hired by Netflix to lay the foundations for its cloud security protocols. Nate Nelson, our Sr. Producer, spoke with Jason about the decade he spent at the company, what he learned during his tenure there, and the ideas that took shape at that time, such as Chaos Engineering.Nate Nelson, our Sr. producer, spoke with Dr. Cohen about his early research into computer viruses, his work with the US army, the panicky response from the US government - and the parallels between computer viruses and mental viruses - i.e. memes.
Benjamin Wilms (@MrBWilms, co-founder/CEO of @Steadybit) talks about the importance of resilience for SREs, DevOps, and developers through chaos engineering platformsSHOW: 661CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Datadog Synthetic Monitoring: Frontend and Backend Modern MonitoringEnsure frontend issues don't impair user experience by detecting user-facing issues with API and browser tests with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt. Granulate, an Intel company - Autonomous, continuous, workload optimizationgProfiler from Granulate - Production profiling, made easyCDN77 - Content Delivery Network Optimized for Video85% of users stop watching a video because of stalling and rebuffering. Rely on CDN77 to deliver a seamless online experience to your audience. Ask for a free trial with no duration or traffic limits.SHOW NOTES:Steadybit (homepage)Steadybit wants developers involved in Chaos engineering before production (TechCrunch)Topic 1 - Benjamin, give everyone a quick introduction.Topic 2 - Let's start with the concept of chaos engineering. In its simplest form, chaos engineering intentionally takes down parts of a test or production environment (typically after software has shipped) randomly so teams, typically SRE's/ops/dev, are forced to make the applications more resilient over time. It's not a matter of if systems will go down, it's a matter of when. This makes the systems better over time. Benjamin, you have a consulting background in this area that ultimately led to founding Steadybit. What were the limitations to this approach?Topic 3 - What you're talking about is a more proactive approach to downtime. I'll call this resilience engineering and it requires a shift in mindset in an organization. How do you get developers onboard to embrace the need? Are we asking developers to share responsibility for outages with the SRE organization?Topic 4 - On the surface, the obvious benefit is reduced downtime. That can be hard to quantify in business value. Outages can be measured, a lack of outages is harder to quantify. Does this become an issue in convincing an organization to embrace this methodology?Topic 5 - When you say we are going to move chaos engineering into the CI/CD pipeline, what does that mean? Is this code that is added? Testing simulations that have to be passed? Real time failures of databases or nodes or simulated? What are the common use cases?FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet
Lithuania sustains a major DDoS attack. Lessons from NotPetya. Conti's brand appears to have gone into hiding. Online extortion now tends to skip the ransomware proper. Josh Ray from Accenture on how social engineering is evolving for underground threat actors. Rick Howard looks at Chaos Engineering. US financial institutions conduct a coordinated cybersecurity exercise. For links to all of today's stories check out our CyberWire daily news briefing: https://thecyberwire.com/newsletters/daily-briefing/11/122 Selected reading. Russia's Killnet hacker group says it attacked Lithuania (Reuters) The hacker group KillNet has published an ultimatum to the Lithuanian authorities (TDPel Media) 5 years after NotPetya: Lessons learned (CSO Online) The cyber security impact of Operation Russia by Anonymous (ComputerWeekly) Conti ransomware finally shuts down data leak, negotiation sites (BleepingComputer) The Conti Enterprise: ransomware gang that published data belonging to 850 companies (Group-IB) Fake copyright infringement emails install LockBit ransomware (BleepingComputer) NCC Group Monthly Threat Pulse – May 2022 (NCC Group) We're now truly in the era of ransomware as pure extortion without the encryption (Register) Wall Street Banks Quietly Test Cyber Defenses at Treasury's Direction (Bloomberg)