Discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems
POPULARITY
In this Google Prodcast episode, Todd Underwood, a reliability expert from Anthropic with experience at Google and OpenAI, discusses the current state and future of AI in SRE. Todd and the hosts focus on the current state and future of AI and ML in production, particularly for SREs. Topics discussed include the challenges of AI-Ops, limitations of current anomaly detection, the potential for AI in config authoring and troubleshooting, trade-offs between product velocity and reliability, the evolving role of SREs in an AI-driven world, and book publication for optimal timing.
This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics such as the scale of Google's data centers, handling incidents like power outages, testing and preparedness strategies, the use of AI for optimizing cooling plants, and more. Peter also emphasizes the importance of community support, proactive planning, and learning from real-world testing and incidents to ensure high availability and resilience in data center operations.
With only nine months to launch Max, Tom Leaman, VP of Site Reliability Engineering at Warner Bros. Discovery had to move fast to keep millions of viewers streaming smoothly. Learn about their innovative approach to measuring efficiency, managing global operations, and building resilient systems at massive scale with your hosts Simon Elisha and Dr. Werner Vogels. Learn More: http://thefrugalarchitect.com/architects/tom-leaman-warner-bros-discovery.html
Jessica Theodat (Senior SRE & Security Tech Lead, Google) joins hosts Jordan Greenberg and Steve McGhee to discuss the intersection of security and site reliability engineering at Google. Jessica touches on risk management, the unique nature of security incident responses, and the shared goals between security and SRE. The crew also delves into the balance between security and SRE, acknowledging the tension and the need for collaboration between teams to achieve business goals and user trust.
In this episode of Autonomous IT, Live!, Landon Miles hosts leads a three-part discussion focused on spring cleaning your IT systems, workflows, and personal well-being. You'll hear candid, practical insights from IT professionals tackling burnout, technical debt, and infrastructure hygiene head-on.This live show originally aired April 16, 2025Ā Ā
Konuklar: İsmail Hakkı Tekin, Mesut ĆzbaÅĀ 74. bƶlümümüzde konuÄumuz SRE ekibi oldu. Ekip yapısını, projelerini, teknoloji stack seƧimlerini ve Ƨok daha fazlasını konuÅtuk!Ā Trendyol Talks'da Trendyol'daki kültürümüzü, kültürümüzden beslenen iÅ yapıŠbiƧimlerimizi ve ritüellerimizi konuÅuyoruz. Trendyol Talks podcast kanalımızı takip etmeyi unutmayın! Ā Ā
In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as a video format and a feedback form.
Crusoe, the industry's first vertically integrated AI infrastructure provider, has announced its European headquarters in Dublin. With the support of the Irish government through IDA Ireland, Crusoe expects to grow its workforce in Ireland to approximately 100 people over the next three years. The company is hiring for multiple roles in Dublin across its Networking, Site Reliability Engineering, Customer Success, and Support departments. Crusoe's new European headquarters will allow the company to deepen its customer and partner relationships across the region. In December of 2023, the company announced its first data centre in Europe, located in Iceland. Powered by geothermal energy, the 100% renewable energy data centre continues to support AI workloads for customers across Europe. Chase Lochmiller, CEO and co-founder, Crusoe said: "Establishing our European Headquarters in Dublin marks another milestone in Crusoe's global strategy. Dublin's reputation as a world-class technology hub, and its exceptional talent pool, makes it the perfect location to forge stronger connections with our European customers and partners." Minister for Enterprise, Tourism and Employment Peter Burke said: "The news that Crusoe AI will establish their European HQ in Dublin, with the creation of 100 jobs, is very welcome. Cloud infrastructure plays a vital role in Ireland's digital and sustainable future by serving as both a critical enabler of AI technology and a potential catalyst for renewable energy adoption. This announcement is a testament to the skilled technology workforce and thriving AI innovation ecosystem here. "Our National AI Strategy: 'AI - Here for Good' which was refreshed in November sets out the Government's intention to harness trustworthy, person-centred AI for our collective economic and societal good. We are delighted to welcome Crusoe AI to Europe and to Ireland." Minister of State for Trade Promotion, Artificial Intelligence and Digital Transformation, Niamh Smyth, TD, said: "Ireland is now ranked as the fifth most advanced digital economy in the EU, and having Crusoe choose Ireland as their European Headquarters is another vote of confidence in our country. These roles will allow for exciting opportunities for many of our skilled workforce." Michael Lohan, CEO of IDA Ireland, said: ''Crusoe's announcement today of plans for 100 jobs in Ireland, as part of establishing their European Headquarters in Dublin is great news and a great vote of confidence in the skilled workforce available here. AI will be a key focus on our upcoming new strategy in IDA Ireland, so I am delighted to see companies like Crusoe join our technology ecosystem here.'' To view open positions on Crusoe's careers page here. See more stories here.
In this insightful episode of the Jon Myer Podcast, Joe Duffy, Founder & CEO of Pulumi, discusses the evolution of infrastructure platforms and their crucial role in modern cloud computing. The conversation explores how infrastructure platforms bridge the gap between development teams and cloud infrastructure, enabling organizations to maintain security, compliance, and efficiency while empowering developers with self-service capabilities.Key Takeaways:
In this insightful episode of the Jon Myer Podcast, Joe Duffy, Founder & CEO of Pulumi, discusses the evolution of infrastructure platforms and their crucial role in modern cloud computing. The conversation explores how infrastructure platforms bridge the gap between development teams and cloud infrastructure, enabling organizations to maintain security, compliance, and efficiency while empowering developers with self-service capabilities.Key Takeaways:
This episode features Javi Beltran, a Google engineering lead who created the "Telebot" theme song. With our beloved hosts, Steve McGhee and Jordan Greenberg, Beltran discusses the origins of the song, created in 2012 for Google's paging system. The song was meant to add a touch of levity to what could be a stressful situation for engineers on-call. Beltran also unveils a new, more modern remix of āTelebotā (created in collaboration with our host, Jordan Greenberg!) which will be used as the intro theme for the podcast's next season.
In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configurations. They discuss the differences between imperative and declarative configuration, explore the benefits and challenges of each approach, and the need for careful consideration when choosing between the two. Ultimately, the goal is to achieve reliable and maintainable systems through effective configuration management.
This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discuss how resilience appears in Software Engineering and SRE and explore the importance of understanding the human factors involved in adapting to system failuresāhighlighting the need for a more qualitative and holistic approach to understanding how engineers successfully adapt to system behavior and improving overall reliability.
Small Batches will return in 2025. Until then, I recommend checking out the Complexity Lounge on YouTube, hosted by my friend Jocko Selberg. ā Support this podcast on Patreon ā
In this episode of the Prodcast, we are joined by guestsĀ Christina Schulman (Staff SRE, Google) and Dr. Laura Maguire PhD (Principal Engineer, Trace Cognitive Engineering). They emphasize the human element of SRE and the importance of fostering a culture of collaboration, learning, and resilience in managing complex systems. They touch upon topics such as the need for diverse perspectives and collaboration in incident response, the necessity of embracing complexity, and explore concepts such as aerodynamic stability, and more.
In this episode, Cody Smith (CTO and Co-founder, Camus Energy) & Trisha Weir (SRE Department Lead, Google) join hostsĀ Steve McGhee and Jordan Greenberg, to discuss their experience developing Maglev, a highly available and distributed network load balancer (NLB) that is an integral part of the cloud architecture that manages traffic that comes in to a datacenter. Starting with Maglev's humble beginnings as a skunkworks effort, Cody and Trisha recount the challenges they faced, and emphasize the importance of psychological safety, collaboration, and adaptability in SRE innovation.
In this episode, Adam reads haikus from The Zen of Programming (1988) by Geoffrey James. This book is unlike any programming book you've encountered. So, let's try something new for the podcast to showcase this poignant, accurate, and funny book.Want more?
In this episode, guests Narayan Desai (Principal SRE, Google) and Pat Somaru (Senior Production Engineer, Meta) join hosts Steve McGhee and Florian Rathgeber to discuss the challenges of observability and working with profiling data. The discussion covers intriguing topics like noise reduction, workload modeling, and the need for better tools and techniques to handle high-cardinality data.
In this episode, Adam reads book four in The Zen of Programming (1988) by Geoffrey James. This book is unlike any programming book you've encountered. So, let's try something new for the podcast to showcase this poignant, accurate, and funny book. This episode features koans from the fabled zen Master Lan-Hsi.Want more?
This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.
Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, joinĀ hosts Steve McGhee and Jordan Greenberg Ā to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.
In this episode, Adam reads book three in The Zen of Programming (1988) by Geoffrey James. This book is unlike any programming book you've encountered. So, let's try something new for the podcast to showcase this poignant, accurate, and funny book. This episode features analects from the fabled zen Master Rinzai.Want more?
Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident responseāparticularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.
Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th editionā) and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!
In this episode, Adam reads book two in The Zen of Programming (1988) by Geoffrey James. This book is unlike any programming book you've encountered. So, let's try something new for the podcast to showcase this poignant, accurate, and funny book. This episode features folktales from the fabled zen Master Noa-Op.Want more?
Liz Fong-Jones (former Google SRE and current Field CTO at honeycomb.io) joins hosts Steve McGhee and Jordan Greenberg for a lively discussion centered around observability, its evolution from monitoring, and its role in modern software development. Tune in for more on the importance of observability as a spectrum, the evolving role of SREs, and advice to aspiring software engineers.
Ben Treynor Sloss (VP of Engineering, Google) joins hosts Steve McGhee and Dr. Jennifer Petoff (Director of Technical Infrastructure Education, Google) to share the evolution of SRE and its impact on software development, how AI and ML significantly impacts SRE practices, and the future of SRE. BenĀ coined the term "Site Reliability Engineering" for his team of (now) 4,000 software engineers, engaged in what were traditionally operations functions. Under Ben's leadership, Google SRE wrote two best-selling books on SRE. Since then, the rest of the SaaS industry has come to adopt the SRE name, mission, and practices.Ā
In this episode,Ā Healfdene Goguen (Principal Engineer, Google) joins hosts Steve McGhee and Jordan Greenberg to discuss the vast amount of work to be done by SREs, and the fascinating challenges to tackle with clear real-world implications. It's a truly exciting time to be an SRE at Google!
In this episode, Adam reads book two in The Zen of Programming (1988) by Geoffrey James. This book is unlike any programming book you've encountered. So, let's try something new for the podcast to showcase this poignant, accurate, and funny book. This episode features chronicles from the fabled zen Master Ninjei.Want more?
In this season of Google Prodcast, current and former SREs, both within and outside of Google, chat with hosts Steve McGhee and Jordan Greenberg to discuss software systems designed and built by SREs.Ā For "episode zero", guestsĀ Amy Tobey (Live Services SRE, Netflix) and Dr.Ā Vladyslav Ukis (Head of R&D, Siemens Healthineers, Author of "Establishing SRE Foundations")Ā will set the stage for the season with a lively discussion about what Software Engineering means to Site Reliability Engineering.
Nikolay and Michael discuss PostgreSQL emergencies ā both the psychological side of incident management, and some technical aspects too.Ā Here are some links to things they mentioned:Site Reliability Engineering resources from Google https://sre.googleGitLab Handbook SRE https://handbook.gitlab.com/job-families/engineering/infrastructure/site-reliability-engineerKeeping Customers Streaming ā The Centralized Site Reliability Practice at Netflix https://netflixtechblog.com/keeping-customers-streaming-the-centralized-site-reliability-practice-at-netflix-205cc37aa9fbOur monitoring checklist episode https://postgres.fm/episodes/monitoring-checklistHannu Krosing talk on Postgres TV ā Do you vacuum everyday? https://www.youtube.com/watch?v=JcRi8Z7rkPgOur episode on corruption https://postgres.fm/episodes/corruptionNikolay's episode on stopping and starting Postgres faster https://postgres.fm/episodes/stop-and-start-postgres-fasterOur episode on out of disk https://postgres.fm/episodes/out-of-diskThe USE method (Brendan Gregg) https://www.brendangregg.com/usemethod.htmlĀ Thundering herd problem https://en.wikipedia.org/wiki/Thundering_herd_problem~~~What did you like or not like? What should we discuss next time? Let us know via a YouTube comment, on social media, or by commenting on our Google doc!~~~Postgres FM is produced by:Michael Christofides, founder of pgMustardNikolay Samokhvalov, founder of Postgres.aiWith special thanks to:Jessie Draws for the elephant artworkĀ
In this episode, Adam reads the preface, forward, and introduction to The Zen of Programming (1988) by Geoffrey James. This book is unlike any programming book you've encountered. So, let's try something new for the podcast to showcase this poignant, accurate, and funny book.Ā Want more?
On this episode of the Six Five in the Booth, host Paul Nashawaty is joined by Flexera's Kristian Dell'Orso, Vice President, Site Reliability Engineering & Site Leader, highlighting their collaboration with Nobl9 for a conversation on becoming an SRE-driven organization. This in-depth discussion explores the transformative impact of adopting Service Level Objectives (SLOs) over traditional Service Level Agreements (SLAs), and how Flexera has shifted its approach to prioritize reliability and enhance customer experiences. Their discussion covers: The transition from SLAs to SLOs at Flexera and its impact on organizational key performance indicators (KPIs), including improvements in reliability and customer experience. The limitations of SLAs in capturing the full spectrum of service reliability and customer satisfaction, and the move towards a more proactive and accountable approach within organizations. How adopting SLOs has led to consistency in measuring reliability across different groups in the company, fostering a culture of accountability and transparency. Ā Learn more how Nobl9 and Flexera articulates its strategy:Ā Customer testimonial/webinarĀ SLOConf PresentationĀ
According to Vlad Ukis, there are a lot of enterprises around whose IT functions are organized around ITIL. What you use SRE for is something completely different. SRE is not for setting up the IT function. It is for enabling the product organization to operate online services reliably at scale.However, the problem is that many in the industry are NOT using SRE principles but instead handing over complex services to a more traditional IT function.Dr. Vladislav Ukis is well qualified to talk about reliability, being at Siemens Healthineers and leading 250 people globally to offer their cloud platform running off Microsoft Azure.We discussed key concepts from his book, Establishing SRE Foundations: A Step-by-Step Guide to Introducing Site Reliability Engineering in Software Delivery Organizations.Unlike other technical books in this field, Dr Ukis' book is aimed at technology professionals who are beginners to the reliability journey. This is different from the Site Reliability Engineering (2016) book by Google, which covers all the bells and whistles that SRE encompasses. That book requires a degree of prior knowledge and also prior experience in the field. Vlad wanted to make it more accessible:What I did with my book is to say, āOkay, so now you've never done operations, but you now are thrown in the world of online services where you have to operate them. How do you get started?' So this is what the book is for. So for people who want to learn how to get started in the world of operating online services.ITIL was originally developed by the UK government in the 80s to improve IT governance. It is best related to SRE through its service management and incident management components. But it's for managing systems that are more predictable and can be handled through strict process control.Modern product delivery doesn't have the luxury of bureaucratic levels of predictability that older IT services have. It requires a more engineer-oriented approach to solving problems/incidents and providing services. So how was Vlad's experience bringing SRE into an organization that previously had run solely on the ITIL model?Siemens Healthineers for many years operated like a traditional software development organization. In other words, they were developing on-prem software, not cloud software. The company would ship the physical software product to its hospital customers and then those hospitals would have the software operated and supported by their IT departments. The change came about when Siemens Healthineers began to work on a new digital health platform, which would be cloud-based from the beginning. So they would no longer ship physical software in discs to customers, but provide online services in the cloud centrally for the customers to use.The early days were haphazardly done with the software deployed to the cloud with no major issues. Not many customers were on the cloud platform so the team could get away with āhandcrafted operating proceduresā.But as traffic and service count started to rise rapidly, the Healthineers team learned that they needed a more professional approach. They began to understand that their initial approach to operations could not continue as-is.This is when Vladislav began to drive SRE practices in the organization. This was a sub-30-minute conversation that covered a lot of ground that would be relevant to the needs of organizations looking to transition to product delivery of online services at scale. Have a listen. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
In this episode, Adam welcomes Dan Slimmon, an experienced Site Reliability Engineer (SRE) to discuss aspects of incident response and troubleshooting in software engineering. Dan explains his methodology for clinical troubleshooting, the importance of maintaining a common mental model, and techniques for leading effective incident response efforts. They also delve into the value of continuous ops reviews and ongoing mental model updates to prevent issues, emphasizing the need for structured processes and effective communication.Want more?
In this episode of Small Batches, host Adam Hawkins welcomes Alex Nesbitt, a strategy expert and member of the Flow Collective, to delve into the nuances of strategic thinking. The discussion covers different types of strategies, pro-tips on strategic thinking, and how strategy relates to the concept of flight levels.Ā Nesbitt shares insights from his extensive consulting career, touching on topics like identifying leverage points, the relationship between strategy and tactics, and why being strategic is often more critical than having a strategy. The episode also stresses the importance of having a clear vision, enabling organizational constraints, and the roles of resilience and maintenance in strategic planning. Alex mentions practical examples, resources, and tips to help software and business leaders enhance their strategic approach.Want more?
Adam discusses strategy in preparation for the next episode.Want more?
Adam discusses three (new-ish) ideas from time on a new gemba.Ā Want more?
Sebastian and I scoured Chapter 5 of the Site Reliability Engineering (2016) book to find nuggets of wisdom on how to reduce toil.We hit the jackpot with concepts like:* what is toil according to a 5-point criteria* why even care about toil?* where you can find toil in your software system* Google's goal for how much work (%) should be toil* the fact that toil isn't always all that badDon't have time to listen to what we learned or added to the concepts? Check out the takeaways toward the end of this email.But firstā¦Before we jump into the takeaways, here's a new segment I'm trying out for newsletters. I'll highlight a new reliability tool that I think could help you. Do you struggle to visualize your Kubernetes workloads?In that case, have you heard of kube-ops-view?It helps you visualize your complex K8s clusters and everything inside them.For a deeper rundown, visit the LinkedIn post I made about kube-ops-view which shares a few more details. Back to our original programmingā¦Here are key takeaways from our chat* Define and Identify ToilRegularly evaluate your tasks. Identify work that is manual, repetitive, and potentially automatable. Recognize it as toil and prioritize its reduction.* Prioritize AutomationLook for repetitive tasks in your workflow and automate them using tools and scripts to reduce manual interventions and increase efficiency.* Embrace the Role of an SRERealize that the role of an SRE is to improve system reliability proactively. Focus on long-term improvements rather than just responding to immediate issues.* Address Common Sources of ToilIdentify frequent sources of toil like context switching, on-call duties, and release processes. Implement solutions to automate and streamline these areas.* Adopt a Toil Elimination MindsetCultivate a mindset focused on eliminating toil. Regularly discuss and explore automation opportunities with your team to improve processes.* Develop a Culture of Continuous ImprovementEncourage a culture that values reducing manual, repetitive work. Advocate for proactive problem-solving and continuous process enhancement within teams.Until next time, happy toil hunting! This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Adam describes using Hexagonal Architecture, also known as Ports and Adapters, for software delivery excellence.Want more?
Adam welcomes Steve Pereira and Andrew Davis to discuss their new book, Flow Engineering. They discuss the book's origin story and the use of cybernetics to drive effective action.Want more?
This episode continues our coverage of Chapter 4 of the Site Reliability Engineering book (2016). In this second part, we take a deeper dive into the mechanics of SLOs.Here are 5 takeaways from the show:* Start Small with SLOs: Begin with a limited number of SLOs and iteratively refine them based on experience and feedback. Avoid overwhelming teams with too many objectives at once.* Defend and Enforce SLOs: Ensure that selected SLOs have real consequences attached to them. If conversations about priorities cannot be influenced by SLOs, reconsider their relevance and enforceability.* Continuous Improvement: Embrace the idea that SLOs are not static targets but evolve over time. Start with loose targets and refine them as you learn more about the system's behavior. Commit to ongoing maintenance and improvement of SLOs for long-term success.* Effective Communication Skills: Recognize the importance of effective communication, especially for technology professionals. Develop the ability to translate technical concepts into plain language that stakeholders can understand and appreciate.* Understanding User Needs: Prioritize understanding and aligning with the expectations of users/customers when defining service level objectives (SLOs) and metrics. User feedback should guide the selection of meaningful SLOs. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
In this episode Bob and Randy invite Dan Salinas and Sarv Shah from Nobl9 to dive deep into the complexities of Site Reliability Engineering (SRE) and Service Level Objectives (SLOs). Discover the origins of SRE, the significance of SLOs in improving customer experience, and the impact of digital reliability on businesses today. From the challenges of maintaining microservices to the advent of cloud dependency, this episode is packed with insights on ensuring operational excellence in our digital world.
In this first part of a 2-part coverage, Sebastian Vietz and I work out how to meet SLAs through SLOs and SLIs. This episode covers Chapter 4 of the Site Reliability Engineering book (2016). Here are 7 takeaways from the show:* Involve Technical Stakeholders Early: Ensure that technical stakeholders, such as SREs, are involved in discussions about SLAs and SLOs from the beginning. Their expertise can help ensure that objectives are feasible and aligned with the technical capabilities of the service.* Differentiate Between SLAs and SLOs: Understand the distinction between SLAs, which are legal contracts, and SLOs, which are based on customer expectations. Avoid using SLAs as a substitute for meaningful service level objectives.* Prioritize Meaningful Metrics: Focus on a select few service level indicators (SLIs) that truly reflect what users want from the system. Avoid the temptation to monitor everything and instead choose indicators that provide valuable insights into service performance.* Align with Customer Expectations: Start by understanding and prioritizing the expectations of your customers. Use their feedback to define service level objectives (SLOs) that align with their needs and preferences.* Avoid Alert Fatigue: Be mindful of the number of metrics being monitored and the associated alerts. Too many indicators can lead to alert fatigue and make it difficult to prioritize and respond to issues effectively. Focus on a few key indicators that matter most.* Start Top-Down with SLIs: Take a top-down approach to defining SLIs, starting with customer expectations and working downwards. This ensures that the selected metrics are meaningful and relevant to users' needs.* Prepare for Deep Dives: Anticipate the need for deeper exploration of specific topics, such as SLOs, and allocate time and resources to thoroughly understand and implement them in your work. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit read.srepath.com
Adam presents the mental model behind T1 and T2 signals, a necessary lexicon for understanding production operations.Want more?
Adam answers a listener's request of advice on succeeding in high-level company or project environments with seven tips.Want more?
Bailey Diveley and Jeannie Evans sit down with Rob Stringer to talk about Site Reliability Engineering and his unique path to get here. If you are someone who likes to learn how things work under the hood, you're going to want to have a listen. If you or someone you know are code curious, we encourage you to attend a Turing Try Coding Event. You can register for a Try Coding class atĀ turing.edu/try-coding.
Adam presents TDD as skill zero, the one that unlocks all the others.Want more?
Adam presents a catch-all episode on ops reviews, visual management, call-to-actions, and SLOs.Want more?
Dave Mangot returns to discuss his new book "DevOps Patterns for Private Equity". Don't let the title fool you, this is best introduction to DevOps and many related software delivery topics. Buy this book for anyone in leadership.Want more?