Podcasts about site reliability engineer

Discipline that incorporates aspects of software engineering and applies them to infrastructure and operations problems

92PODCASTS
153EPISODES
36mAVG DURATION
?INFREQUENT EPISODES
Jan 12, 2026LATEST

POPULARITY

20192020202120222023202420252026

Best podcasts about site reliability engineer

Google SRE Prodcast

22 episodes with site reliability engineer

Packet Pushers - Full Podcast Feed

3 episodes with site reliability engineer

Packet Pushers - Fat Pipe

3 episodes with site reliability engineer

ReliabilityRadio

6 episodes with site reliability engineer

Screaming in the Cloud

2 episodes with site reliability engineer

The Cloudcast

2 episodes with site reliability engineer

linkmeup. ??????? ??? IT ? ??? ?????

3 episodes with site reliability engineer

S.R.E.path Podcast

3 episodes with site reliability engineer

PurePerformance

2 episodes with site reliability engineer

Latest podcast episodes about site reliability engineer

Reliability Through Planning with Matthew Gill

The PowerShell Podcast

Play Episode Listen Later Jan 12, 2026 62:56

Matthew Gill joins The PowerShell Podcast to talk about what it means to be a Site Reliability Engineer (SRE) and how SRE thinking changes the way you approach automation, reliability, and problem solving. Matthew and host Andrew Pla break down core concepts like SLAs, SLOs, and SLIs, and why reliability through planning matters more than rushing straight to the keyboard. They also dig into why PSFramework is worth the dependency for enterprise-grade logging and configuration, how community mentorship (including Fred Weinmann's impact) can fast-track growth, and why books like The Phoenix Project are game-changing for understanding DevOps culture and constraints. Key Takeaways: • SRE is software engineering applied to operations — focus on measurable reliability, proper planning, and balancing change with stability using concepts like SLAs, SLOs, and SLIs. • PSFramework can eliminate “reinventing the wheel” — especially for logging and configuration handling, giving enterprises proven patterns and integrations without custom-built fragility. • Community is a career multiplier — mentorship, learning in public, and teaching others are some of the fastest ways to build confidence and advance your PowerShell journey. Guest Bio: Matthew Gill is a Site Reliability Engineer and is the Co-Director of Content for the PowerShell + DevOps Global Summit. He has been a problem solver, systems administrator, and scripter for nearly 20 years. From working in the United States Marine Corps, education, radio, and currently the private sector, the majority of Matt's experience has been focused on solving problems in a variety of interesting and creative ways.Resource Links PowerShell + DevOps Global Summit – https://powershellsummit.org The Phoenix Project (Book) – https://itrevolution.com/product/the-phoenix-project/ The Unicorn Project (Book) – https://itrevolution.com/product/the-unicorn-project/ PSFramework – https://github.com/PowershellFrameworkCollective/psframework Matthew Gill's Blog – https://therealgill.com Andrew's Links - https://andrewpla.tech/links PDQ Discord – https://discord.gg/PDQ PowerShell Wednesdays – https://www.youtube.com/results?search_query=PowerShell+Wednesdays The PowerShell Podcast on YouTube: https://youtu.be/vkOLsjsPvYo

community planning blog co director devops reliability united states marine corps sre powershell slas phoenix project slos site reliability engineer slis

Reliability Radio EP 332: The Smart Digital Reality, Peter Bynarowicz – Hexagon

ReliabilityRadio

Play Episode Listen Later Aug 25, 2025 12:46

Join Jonathan Guiney and Brendon Russ on Reliability Radio as they speak with Peter from Hexagon, a company focused on Asset Information Management (AIM). The discussion begins with a look at a fundamental challenge facing every industry: the "silver tsunami" and the loss of critical knowledge. Peter explains how Hexagon's solution bridges the siloed gap between engineering and maintenance by creating a "smart digital reality"—a digital twin of a facility. This virtual map is created using drones, scanners, and cutting-edge technology, transforming old, paper-based drawings into an explorable, video-game-like environment. Peter outlines the immense benefits of this approach: it significantly improves technician safety, boosts efficiency by providing instant access to asset information, and helps attract and retain young talent who grew up in a digital world. The conversation concludes with a look to the future, where AI will become a standard feature, offering autonomous task suggestions before humans even have to ask.

ai education reality digital innovation smart maintenance reliability asset management operational excellence hexagon predictive maintenance industry 4.0 site reliability engineer preventive maintenance reliability engineering

The One with Ben Good and Our Kubernetes Friends

Google SRE Prodcast

Play Episode Listen Later Jul 30, 2025 32:19

In this special episode hosts Steve McGhee from the Google SRE Prodcast and Kaslin Fields from the Google Kubernetes Podcast, welcome Google Cloud Solutions Architect Ben Good to discuss platform engineering. Listeners can look forward to hearing about the role of Kubernetes as a tool for building platforms, how to create "golden paths" for developers, and the importance of observability and self-service in platform design. The conversation also touches on industry trends, the bespoke nature of platforms, and how DORA metrics can be applied to platform engineering practices.

friends kubernetes sre site reliability engineering site reliability engineer

The One With AI Agents, Ramón Llamas, and Swapnil Haria

Google SRE Prodcast

Play Episode Listen Later Jul 23, 2025 42:07

Google Staff SRE Ramón Llamas and Google Software Engineer Swapnil Haria join our hosts to explore how AI agents are revolutionizing production management, from summarizing alerts and finding hidden errors to proactively preventing outages. Learn about the challenges of evaluating non-deterministic systems and the fascinating interplay between human expertise and emerging AI capabilities in ensuring robust and reliable infrastructure.

ai llamas sre site reliability engineering swapnil site reliability engineer haria

The One with Technical Program Managers and Karanveer Anand

Google SRE Prodcast

Play Episode Listen Later Jul 16, 2025 27:48

This episode features Google Technical Program Manager (TPM) Karanveer Anand, who joins our hosts to discuss the unique role of TPMs in Site Reliability Engineering (SRE). The conversation highlights how SRE TPMs bridge the gap between technical details and business impact, managing complex projects with inter-team dependencies and ensuring system reliability, particularly in the rapidly evolving AI landscape.

ai program managers anand site reliability engineering tpms site reliability engineer site reliability engineering sre technical program

The One with STPA, Jeffrey Snover, and Theo Klein

Google SRE Prodcast

Play Episode Listen Later Jul 2, 2025 37:18

This episode discusses Systems Theoretic Process Analysis (STPA), a method for analyzing complex systems. Theo Klein, a Google SRE, and Jeffrey Snover, a Distinguished Engineer at Google, explain that STPA focuses on identifying how system accidents and losses occur due to a loss of control, rather than component failures. STPA helps identify design flaws early, even before code is written! The discussion highlights that STPA is a human-driven process, prompting critical questions about system goals and potential losses, and that Google is adapting the pure STPA approach for commercial software development to make it more practical and efficient.

google klein distinguished engineer site reliability engineering site reliability engineer snover

The One with Startups and Adam Fletcher

Google SRE Prodcast

Play Episode Listen Later Jun 25, 2025 41:15

In this episode, hosts Steve McGhee and Matt Siegler are joined by guest, Adam Fletcher, CEO and Co-Founder of MarketStreet. They discuss the current state of web development with LLMs, managing technical debt in startups, the evolution of infrastructure and reliability engineering, the role of community in technology, and the future of software engineering with AI.

ceo ai co founders startups sre site reliability engineering site reliability engineer adam fletcher

The One with SLOs and Sal Furino

Google SRE Prodcast

Play Episode Listen Later Jun 18, 2025 43:55

In this episode, Sal Furino, Customer Reliability Engineer at Bloomberg, discusses all things Service Level Objectives (SLOs) with hosts Steve McGhee and Matt Siegler. Together, they dig into what successful SLOs look like, how it relates to users, and how SLOs provide an effective framework for joint decisions about system reliability across product, engineering, and leadership teams.

bloomberg sre site reliability engineering slos site reliability engineer

The One With the Future of SRE and Matt Zelesko

Google SRE Prodcast

Play Episode Listen Later Jun 11, 2025 26:34

Matt Zelesko, the head of Site Reliability Engineering at Google, discusses the evolution of SRE, highlighting the shift from traditional operations to a model that balances velocity and reliability to better serve the rapid advancements in AI and ML. He emphasizes that SRE's core mission is to enable partners to move quickly while meeting reliability goals, and that the sheer scale of Google's infrastructure necessitates the SRE model for cross-system problem-solving. Zelesko envisions AI as a crucial assistant for SREs, improving incident detection, mitigation, and postmortem processes, and allowing SREs to focus on more complex engineering challenges and risk management earlier in the development cycle, while still valuing the hands-on experience of operating production infrastructure.

ai google ml sre sres site reliability engineering site reliability engineer reliability engineering

The One with AI and Todd Underwood

Google SRE Prodcast

Play Episode Listen Later Jun 4, 2025 43:23

In this Google Prodcast episode, Todd Underwood, a reliability expert from Anthropic with experience at Google and OpenAI, discusses the current state and future of AI in SRE. Todd and the hosts focus on the current state and future of AI and ML in production, particularly for SREs. Topics discussed include the challenges of AI-Ops, limitations of current anomaly detection, the potential for AI in config authoring and troubleshooting, trade-offs between product velocity and reliability, the evolving role of SREs in an AI-driven world, and book publication for optimal timing.

ai google artificial intelligence openai machine learning ml underwood anthropic sre sres site reliability engineering site reliability engineer

The One With Data Centers and Peter Pellerzi

Google SRE Prodcast

Play Episode Listen Later May 28, 2025 36:28

This episode features guest, Peter Pellerzi (Distinguished Engineer, Google). Peter and the hosts, Matt Siegler and Steve McGhee, focus on the physical infrastructure side of SRE, discussing topics such as the scale of Google's data centers, handling incidents like power outages, testing and preparedness strategies, the use of AI for optimizing cooling plants, and more. Peter also emphasizes the importance of community support, proactive planning, and learning from real-world testing and incidents to ensure high availability and resilience in data center operations.

ai google artificial intelligence machine learning data centers sre site reliability engineering site reliability engineer

We're back with Season 4!

Google SRE Prodcast

Play Episode Listen Later Apr 16, 2025 15:03

In this "bumpisode", hosts and producers of Prodcast (including our new co-host, Matt Siegler!) reflect on the previous season and introduce the new season's focus on upcoming trends in Site Reliability Engineering (SRE) and AI, and the friends we make along the way. They also introduce new elements we are bringing in with Season 4, such as a video format and a feedback form.

ai google artificial intelligence sre site reliability engineering prodcast site reliability engineer site reliability engineering sre

Une tasse de fiabilité : Discussion avec un SRE - Quentin Joly

La tangente

Play Episode Listen Later Apr 16, 2025 98:35

Dans cet épisode, on part à la découverte de l'univers de Quentin Joly, SRE chez Lucca, auteur du blog Une tasse de café et streamer passionné.On y parle :Du métier de Site Reliability Engineer, ses joies, ses galères et ses responsabilités.Des différences entre SRE et DevOps, avec une vraie réflexion terrain.De fonction publique vs secteur privé, de la VAE (validation des acquis), de reconversion et de salaires dans la tech.D'ergonomie, de claviers split, de neovim, tmux, et tous ces petits détails qui changent la vie d'un dev.Et surtout de partage de connaissance, de blog, de pédagogie, de la passion pour la tech sans condescendance.Le lien de la conf de Quentin: https://youtu.be/TbQ-rT__CY0?list=PLl0xIhYGSdm94h5lcrybZAsdGAfrpJx4y

dans devops tasse sre vae site reliability engineer

Generalist or Specialist: 米在住 Site Reliability Engineer のキャリアの考え方深掘り (Shuhei)

London Tech Talk

Play Episode Listen Later Mar 15, 2025 70:11

Shuhei さんをゲストにお呼びしました。渡米されてから一年半近く経過した Shuhei さん。前半では、最近のアメリカ生活や子育て、日本一時帰国、タイムゾーンを超えたグローバル企業での働き方を中心に話を伺いました。夏はキャンピング、冬はスキーというアウトドア生活を楽しんでいるとのこと。お子さんとのスキーの楽しみ方について話が盛り上がりました。子どもへのスポーツ教育の話から、Shuhei さんが最近読んでいる本 "Range" を紹介してもらいました。ゴルフとテニスの違い、チェスとポーカーの違いから、専門領域を小さい時期から磨き続けるか、幅広い技能を習得させるかについて議論しました。さらにそこから、Shuhei さんご自身のキャリアにおけるプロフェッショナリズムとジェネラリズムの考え方についてもお話を伺いました。自分の強みの Self-awareness の大切さや、Visibility を意識した動き方、さらには昨今の AI 技術の発展を踏まえた上での技能の磨き方についても話が広がりました。Range: Why Generalists Triumph in a Specialized WorldThinking in Bets: Making Smarter Decisions When You Don't Have All the FactsOpenAI, Introducing deep researchご意見・ご感想など、お便りはこちらの⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠Google Form⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠⁠ で募集しています。

engineers specialist generalists google forms site reliability engineer shuhei

Evolving, Adapting, and Staying Prepared with Brian Weber

Screaming in the Cloud

Play Episode Listen Later Jan 28, 2025 35:24

Ever wondered how Corey got to where he is today? You have Brian Weber to partially thank for that. On this episode of Screaming in the Cloud, Corey catches up with his old friend and mentor to talk about the ever-evolving world of tech. Brian's been around the block a time or two having done significant stints at Pinterest, Facebook, and Twitter (during the Elon acquisition no less)! As Corey and Brian catch up, you'll hear them chat about the importance of empathy, coaching the next generation of tech workers, and their conspiracies surrounding Google and Kubernetes. So grab your tinfoil hats, it's time to go Screaming!Show Highlights(0:00) Intro(0:53) The Duckbill Group sponsor read(1:27) When Brian took Corey under his win(3:21) Brian's experience coming to the cloud as an engineer(7:24) Why it's important to reinvent yourself in tech(8:54) How Brian reacted to the industry adopting Kubernetes over Mesos Marathon(10:31) Kubernetes conspiracy theories(12:30) The importance of empathy in tech(15:46) Trying to advise younger generations entering tech(19:19) The Duckbill Group sponsor read(20:02) Working at Twitter when jobs started getting cut and the site frequently went down(22:41) The best way to navigate certification expiration(26:08) Talking about "The Golden Path”(28:52) Why you should always plan ahead in tech (and life)(34:21) Where you can find more from BrianAbout Brian WeberBrian is a former FedRAMP DevOps Engineer for Coralogix. He's also been a Site Reliability Engineer at Twitter, Pinterest, and Facebook, where he has maintained large installations on-premises, building reliability, security, and developer efficiency. In my spare time, Brian skis, knits, cycles, bakes, and tries to spend as much time outdoors as possible.LinksBrian's LinkedIn: https://www.linkedin.com/in/brian-weber-2423b55/SponsorThe Duckbill Group: duckbillgroup.com

amazon google elon musk staying cloud pinterest prepared evolving adapting screaming aws devops kubernetes golden path site reliability engineer duckbill group brian weber last week in aws

Safety vs Security with Thomas Depierre

Open Source Security Podcast

Play Episode Listen Later Jan 13, 2025 21:23 Transcription Available

In this episode of Open Source Security, Josh welcomes Thomas Depierre, a Site Reliability Engineer and open source maintainer, to discuss the intersection of safety and security. Thomas explains why safety is broader than security. While security often views people as the problem, Thomas explains that people are paradoxically the solution. Nothing should work, but it does, mostly due to people keeping things working. The accompaning blog can be found at https://opensourcesecurity.io/2025/01-safety_vs_security_with_thomas_depierre/

risk safety security cybersecurity failures site reliability engineer

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

Google SRE Prodcast

Play Episode Listen Later Dec 11, 2024 36:10

In this episode of the Prodcast, guests Dominic Hutton (Staff SRE, HashiCorp) and Niccolo' Cascarano (Senior Staff SRE at Google) join hosts Steve McGhee and Jordan Greenberg to dive into configurations. They discuss the differences between imperative and declarative configuration, explore the benefits and challenges of each approach, and the need for careful consideration when choosing between the two. Ultimately, the goal is to achieve reliable and maintainable systems through effective configuration management.

google imperative workflows hutton sre hashicorp declarative niccolo site reliability engineering prodcast site reliability engineer

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Google SRE Prodcast

Play Episode Listen Later Dec 4, 2024 41:18

This episode features Casey Rosenthal (Founder, Cirrusly.ai) and John Allspaw (Founder and Principal, Adaptive Capacity Labs), joining our hosts Steve McGhee and Jordan Greenberg. Together they discuss how resilience appears in Software Engineering and SRE and explore the importance of understanding the human factors involved in adapting to system failures—highlighting the need for a more qualitative and holistic approach to understanding how engineers successfully adapt to system behavior and improving overall reliability.

principal rosenthal software engineering human factors sre complex systems site reliability engineering site reliability engineer john allspaw

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

Google SRE Prodcast

Play Episode Listen Later Nov 20, 2024 33:59

In this episode of the Prodcast, we are joined by guests Christina Schulman (Staff SRE, Google) and Dr. Laura Maguire PhD (Principal Engineer, Trace Cognitive Engineering). They emphasize the human element of SRE and the importance of fostering a culture of collaboration, learning, and resilience in managing complex systems. They touch upon topics such as the need for diverse perspectives and collaboration in incident response, the necessity of embracing complexity, and explore concepts such as aerodynamic stability, and more.

google embracing complexity maguire sre schulman site reliability engineering prodcast site reliability engineer

Maglev: load balancing at Google with Cody Smith and Trisha Weir

Google SRE Prodcast

Play Episode Listen Later Nov 13, 2024 32:53

In this episode, Cody Smith (CTO and Co-founder, Camus Energy) & Trisha Weir (SRE Department Lead, Google) join hosts Steve McGhee and Jordan Greenberg, to discuss their experience developing Maglev, a highly available and distributed network load balancer (NLB) that is an integral part of the cloud architecture that manages traffic that comes in to a datacenter. Starting with Maglev's humble beginnings as a skunkworks effort, Cody and Trisha recount the challenges they faced, and emphasize the importance of psychological safety, collaboration, and adaptability in SRE innovation.

google starting balancing load weir sre maglev site reliability engineering cody smith nlb site reliability engineer

Profiling data with Pat Somaru and Narayan Desai

Google SRE Prodcast

Play Episode Listen Later Oct 30, 2024 42:22

In this episode, guests Narayan Desai (Principal SRE, Google) and Pat Somaru (Senior Production Engineer, Meta) join hosts Steve McGhee and Florian Rathgeber to discuss the challenges of observability and working with profiling data. The discussion covers intriguing topics like noise reduction, workload modeling, and the need for better tools and techniques to handle high-cardinality data.

google data profiling sre site reliability engineering site reliability engineer narayan desai

Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

Google SRE Prodcast

Play Episode Listen Later Oct 23, 2024 32:07

This episode features Google engineers Wilmer van der Gaast (Production on-tall) and Andy Sykes (Senior Staff Systems Engineer, SRE), joining hosts Steve McGhee and Jordan Greenberg, to discuss the development and maintenance of Google Public DNS (8.8.8.8). They highlight the initial motivations for creating the service, technical challenges like cache poisoning and load balancing, as well as the collaborative effort between SRE and SWE teams to address these issues. They also reflect on the evolving nature of SRE and advice for aspiring SREs.

google public vander sykes sre wilmer swe sres site reliability engineering site reliability engineer

SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

Google SRE Prodcast

Play Episode Listen Later Oct 16, 2024 33:40

Guests Jordan Chernev (Senior Technology Executive) and Scott Bowers (SRE, Gearbox Software) who hail from the retail and gaming industries, respectively, join hosts Steve McGhee and Jordan Greenberg to discuss the unique challenges of Site Reliability Engineering in their industries. They share the importance of aligning SLOs with user experience, strategies for handling spikes in traffic, communicating with users during outages, and investing in reliability.

gaming worlds retail bowers sre site reliability engineering gearbox software slos site reliability engineer

Incident Response with Sarah Butt and Vrai Stacey

Google SRE Prodcast

Play Episode Listen Later Oct 9, 2024 43:53

Sarah Butt (Principal Engineer, Centralized Incident Response, Salesforce) and Vrai Stacey (Staff Software Engineer, Google) join hosts Steve McGhee and Jordan Greenberg to dive into incident response—particularly tooling and software for reliability incidents. Tune in for an in-depth discussion on topics such as the importance of communication and collaboration during incidents, and the role of tooling in supporting incident response processes. Sarah and Vrai also share personal takeaways from incidents they have experienced.

google butt salesforce vrai sre incident response site reliability engineering site reliability engineer

Building Reliable Systems with Silvia Botros and Niall Murphy

Google SRE Prodcast

Play Episode Listen Later Oct 2, 2024 42:06

Silvia Botros (SRE Architect, Twilio | Author of "High Performance MySQL, 4th edition”) and Niall Murphy (Co-founder & CEO, Stanza) join hosts Steve McGhee and Jordan Greenberg, to discuss cultural shifts in database engineering, rate limiting, load shedding, holistic approaches to reliability, proactive measures to build customer trust, and much more!

ceo engineering reliable sre stanza site reliability engineering site reliability engineer niall murphy

Creating Systems that are Safe with Liz Fong-Jones

Google SRE Prodcast

Play Episode Listen Later Sep 25, 2024 28:40

Liz Fong-Jones (former Google SRE and current Field CTO at honeycomb.io) joins hosts Steve McGhee and Jordan Greenberg for a lively discussion centered around observability, its evolution from monitoring, and its role in modern software development. Tune in for more on the importance of observability as a spectrum, the evolving role of SREs, and advice to aspiring software engineers.

safe engineering sre creating systems sres site reliability engineering site reliability engineer liz fong jones

Prometheus 3.0 Unveiled: PromCon Highlights with Julius Volz - OpenObservability Talks S5E04

OpenObservability Talks

Play Episode Listen Later Sep 12, 2024 66:09

PromCon, the flagship yearly event of the Prometheus community, is back in Berlin, and we're here to bring you the highlights from the Prometheus ecosystem. And this year we've got some major news: Prometheus's long-awaited major release, v3.0! Join us to hear all about the revamped user interface, about Remote Write 2.0, and about Prometheus' goal to become the default backend for storing OpenTelemetry metrics, featuring native OTel support, and much more. We'll cover these and more highlights from the Prometheus ecosystem. Our guest is no other than Julius Volz, creator of Prometheus, and founder of the PromCon conference. Julius created the Prometheus monitoring at SoundCloud and led the project through open source and beyond. He now focuses on growing the Prometheus community, and helps companies use and adapt Prometheus through his company PromLabs. Before that, Julius was a Site Reliability Engineer at Google, where he gained experience monitoring at hyperscale. The episode was live-streamed on 4 September 2024 and the video is available at www.youtube.com/watch?v=iPUCU-78RD4 Check out the episode recap: https://medium.com/p/1c5edca32c87/ OpenObservability Talks episodes are released monthly, on the last Thursday of each month and are available for listening on your favorite podcast app and on YouTube. We live-stream the episodes on Twitch and YouTube Live - tune in to see us live, and chime in with your comments and questions on the live chat. ⁠⁠https://www.youtube.com/@openobservabilitytalks⁠ https://www.twitch.tv/openobservability⁠ Show Notes: 00:00 - episode and guest intro 01:56 - Prometheus origins 07:23 - Kubernetes synergy 09:34 - Origins of PromCon and this year's event 11:44 - The idea for Prometheus 3.0 13:26 - new UI for Prometheus 20:42 - Beyond Prometheus UI into the broader UI/UX vision 23:07 - OpenTelemetry support and compatibility 37:26 - Native histograms 43:14 - Remote Write 2.0 46:53 - New governance model 48:49 - OpenMetrics is archived, merged into Prometheus 53:34 - Perses joins the CNCF sandbox 57:15 - The landscape of long-term storage for Prometheus 59:13 - Updates in Thanos project 01:00:34 - the growth of Prometheus-semi-compatible solutions 01:04:09 - Kubernets 1.31 is released Resources: PromCon recap: https://medium.com/p/1c5edca32c87/ PromCon: https://promcon.io/2024-berlin/ Prometheus now supports OpenTelemetry: https://horovits.medium.com/83f85878e46a OpenMetrics archived, merged into Prometheus: https://horovits.medium.com/d555598d2d04 Prometheus 3.0-Beta release: https://github.com/prometheus/prometheus/releases/tag/v3.0.0-beta.0 Prometheus 3.0-Beta release blog: https://prometheus.io/blog/2024/09/11/prometheus-3-beta/ Perses project introduction: https://horovits.medium.com/f05b5324d7da Last roundup of Prometheus updates: https://horovits.medium.com/fbede9b5cc9 Last PromCon (2023) recap: https://logz.io/blog/promcon-prometheus-ecosystem-updates/ Socials: Twitter:⁠ https://twitter.com/OpenObserv⁠ YouTube: ⁠https://www.youtube.com/@openobservabilitytalks⁠ Dotan Horovits ============ Twitter: @horovits LinkedIn: www.linkedin.com/in/horovits Mastodon: @horovits@fosstodon Julius Volz ========= Twitter: https://twitter.com/juliusvolz LinkedIn: https://www.linkedin.com/in/julius-volz/ Mastodon: https://chaos.social/@juliusvolz

google berlin twitch soundcloud origins native beta thanos unveiled ui prometheus mastodon kubernetes ui ux cncf otel site reliability engineer perses

Reliability Radio EP 310: Jeff Smith, The State of Condition Monitoring

ReliabilityRadio

Play Episode Listen Later Sep 6, 2024 11:32

Have you heard about the Lubrication Leader Badge? What about the new IoT Leader Badge? Listen in on this podcast as we have a deep discussion with Jeff Smith of Reliabilityweb on the state of condition monitoring, fault cause detection, and the learning opportunities available to the Reliability and Asset Management community on Reliabilityweb's own Workshop Study System (WSS).

education technology innovation engineering monitoring condition reliability asset management ultrasounds jeff smith lubrication plant manager industry 4.0 site reliability engineer

Reliability Radio EP 309: Tim Rice, the concept of Defect Elimination

ReliabilityRadio

Play Episode Listen Later Aug 22, 2024 12:21

Defect Elimination. It's more than just an element in the Work Execution Management Domain. It represents one of the goals of reliability … to improve the value of an organization. We sit and have a chat with Tim Rice, the author of “Defect Elimination: Left to Right, Right to Left” and discuss the concept of Defect Elimination, its benefits, challenges, and much more. A can't miss episode for those looking to make truly sustainable improvements.

left concept maintenance elimination reliability asset management operational excellence defect tim rice predictive maintenance site reliability engineer preventive maintenance

Reliability Radio EP 308: Jack Poley, CMI

ReliabilityRadio

Play Episode Listen Later Aug 15, 2024 10:06

In the world of condition monitoring, we've come a long way to get our results better, faster, and more accurate. The advancement of technology has allowed all of us to self-perform just about every facet of condition monitoring and more importantly, leverage the results together to make decisions. However, there is one technology that most practitioners still don't do themselves and is often disconnected, Fluid Analysis. Listen in as we talk with Jack Poley on state of fluid analysis in industry today, why it's important to bring it into the fold with other technologies, and a clear vision of its future.

oil reliability asset management ultrasounds lube lubrication industry 4.0 site reliability engineer reliability engineer

Reliability Radio EP 307: Russ Parish, ReliabilityWeb

ReliabilityRadio

Play Episode Listen Later Aug 1, 2024 13:51

Performing self-assessments to gauge where you have been an important step in determining areas of strength and for identifying needs for improvements. Whether you are just starting your reliability journey, restarting a program, or just getting a grasp of the current status, assessments are crucial for awareness. Additionally aligning your programs to the Uptime® Elements gives them a parallel to a common understanding of reliability standards. Were you aware that there are two assessments that give an organization's maturity, using the Uptime Elements? We talk with Russ Parish of ReliabilityWeb as we go over the need for reliability assessments, the Asset Analytix RAM-GPS assessment, and the new Uptime Elements-GPS assessment.

russ performing change management reliability parish asset management operational excellence operations management opex industry 4.0 site reliability engineer reliability engineer

Reliability Radio EP 306: David Lockhart, Kaiser Permenente

ReliabilityRadio

Play Episode Listen Later Jul 26, 2024 11:22

When you go to a health care provider, hospital, or outpatient center, the reliability of the equipment should be the furthest thing from your mind. With the potential effects of a failure includes risk to bodily harm, reliability isn't an option, it's an imperative. We speak with David Lockhart from Kaiser Permenente on the history, current state, and future of reliability in the health care sector.

reliability asset management lockhart operational excellence operations management opex industry 4.0 site reliability engineer reliability engineer

Upskilling In the Caribbean

Digital Oil and Gas

Play Episode Listen Later Jul 5, 2024 36:29

Upskilling of oil and gas professionals in the digital era is a topic everywhere, including the Caribbean island economies. Upskilling means the acquisition of new knowledge and skills, and in digital, this means topics such as cloud computing, artificial intelligence, app development, and sensor technology. It's tough enough to acquire new capabilities and stay current in big metropolitan centers. Imagine the challenges in smaller oil and gas producing economies, such as those found in the Caribbean. On-line and distance learning help, but that's not enough. The formal education sector has to play its role, as does industy. But how does this happen? In this podcast, I'm in conversation with Hamlyn Holder, a sessional lecturer with the University of the West Indies and an employee of the oil and gas industry, based in Trinidad and Tobago, one such oil production nation in the Caribbean. Hamlyn has over 20 years of both upstream and downstream engineering experience with nearly a decade of this time dedicated to serving Methanex Trinidad Ltd. Methanex Corporation is the world's largest producer and supplier of methanol to major international markets in North and South America, Europe, and Asia Pacific. Methanol is a clear liquid chemical used in thousands of everyday products, including plastics, paints, cosmetics, and is a clean-burning, cost-effective alternative fuel. As the Site Reliability Engineer, he ensures the optimal functionality and performance of critical plant assets and is committed to continuous improvement and innovation in the asset management of their methanol plants and air separation units. He currently serves as a Lecturer at the University of the West Indies, Engineering Faculty and also at the Caribbean Institute for Quality training the Caribbean workforce in ASQ courses such as Six Sigma Black Belt, Quality Management and Reliability Engineering. He holds a Master's in Engineering and Asset Management from the University of the West Indies, is CAMA Certified and co-founded Cube Root Farms, a company that helps local farmers, schools and communities to adopt modern smart and sustainable agricultural techniques. He is also well versed in developing, enhancing and launching many enterprise management softwares and is a member of many industry bodies such as API, ASQ, PMI , IEEE, ASME and APETT. Additional Tools & Resources:

university europe master north engineering caribbean south america api trinidad lecturer asia pacific tobago asset management lectures west indies pmi upskilling ieee quality management methanol six sigma black belt asq asme site reliability engineer reliability engineering engineering faculty caribbean institute

IBM, VMware, and Dedication with Alexandra McCoy

Ardan Labs Podcast

Play Episode Listen Later Apr 24, 2024 110:28

Alexandra McCoy is currently a Site Reliability Engineer at VMware. With four years of experience in the field and six years within open-source and cloud environments, Alexandra shares insightful anecdotes and valuable expertise. From leading large-scale software projects to mastering Kubernetes and Docker for container orchestration, her story is filled with innovation and dedication. In this episode, Alexandra takes us on a journey through her time in the tech industry while sharing valuable insight and entertaining stories along the way.00:00 Introduction03:00 What is Alexandra Doing Today? 07:44 First Memories of a Computer13:30 High School Journey 30:30 Entering University 35:20 Juggling Sports and University40:50 Working for Nike46:40 Working in Criminal Justice 1:00:45 IBM Enters the Radar1:22:00 Getting a Masters in I.T 1:28:00 Moving to VMware1:49:30 Contact Info Connect with Alexandra: Twitter: https://twitter.com/AddisMama17Linkedin: https://www.linkedin.com/in/alexandramccoy17/Mentioned in today's episode:VMware: https://www.vmware.com/IBM: https://www.ibm.com/us-enWant more from Ardan Labs? You can learn Go, Kubernetes, Docker & more through our video training, live events, or through our blog!Online Courses : https://ardanlabs.com/education/ Live Events : https://www.ardanlabs.com/live-training-events/ Blog : https://www.ardanlabs.com/blog Github : https://github.com/ardanlabs

#98 - Service Levels 101 feat. Alex Ewerlöf - Sr Staff Engineer @ Volvo Cars & SRE Thought Leader

alphalist.CTO Podcast - For CTOs and Technical Leaders

Play Episode Listen Later Apr 5, 2024 53:41

Embrace the Site Reliability Mindset with Alex Ewerlöf, Sr. Staff Engineer @ Volvo Cars

team embrace engineers sr thought leaders devops sla sre slo sli on call volvo cars staff engineer centralised slos site reliability engineer read alex slis platform team service levels service level agreement ewerl

#27 - Growing as a Site Reliability Engineer (Part 3)

S.R.E.path Podcast

Play Episode Listen Later Feb 13, 2024 16:29

Third and final instalment of the Growing as an SRE series covering practical ideas for planning your career progression This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

engineers sre site reliability engineer

#26 - Growing as a Site Reliability Engineer (Part 2)

S.R.E.path Podcast

Play Episode Listen Later Feb 8, 2024 19:06

In part 1, we covered the first truth - that you don't grow in your career merely through tenure. That was a simple one. Let's explore 2 more truths that are somewhat trickier...Background music credit: Luna by KaizanBlue This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

engineers site reliability engineer

#24 - Growing as a Site Reliability Engineer (Part 1)

S.R.E.path Podcast

Play Episode Listen Later Jan 23, 2024 8:42

How can you grow as an SRE? You've probably thought about your career progression at some point. Ash put together his initial thoughts on this topic. Listen on to learn how he unpacks the first idea of "You don't get promotions with tenure". This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit srepath.substack.com

engineers ash sre site reliability engineer

Life of An SRE with Dave Reisner

Google SRE Prodcast

Play Episode Listen Later Oct 17, 2023 29:44

Dave Reisner talks about his path to Staff SRE, from ArchLinux contributor through DevOps to software engineer. This episode emphasizes the value of strong mentoring and manager relationships, and the challenges of work-life balance.

engineering devops sre site reliability engineer

Life of An SRE with Jessica Theodat

Google SRE Prodcast

Play Episode Listen Later Oct 3, 2023 25:45

Explore the role and responsibilities of a Senior SRE with Jessica Theodat, as she discusses life-work balance, the value of mentoring, and being a Black woman in SRE.

black explore engineering sre site reliability engineer

Episode 28 - Navigating a Manful Career & Life w/ Chuks C-Madu

IncrediPaulÂ® Leadership

Play Episode Listen Later Jul 10, 2023 50:50

In this episode, I talk with Chuks about his work with Manful Health, an organization that he founded to provide coaching for men in a variety of areas. He is dedicated to removing the stigma of seeking help through therapy and coaching and encouraging better mental health in men. He works full time as a Site Reliability Engineer at My Fitness Pal, and we talk about how he crafted his career to get to where he is now. We discuss the foundation he developed in college that made it easier to transition and continue growing his career. He stresses the importance of owning your career and seeking out things your passionate about. Watch this episode to learn more about mental health, coaching, career process, and how to effectively build the habits to a career of your dreams. Connect with Chuks on LinkedIn: https://www.linkedin.com/in/ccmadu/ Learn more about Manful Health and join the waitlist: www.manfulhealth.com Follow him on Instagram: @Chuks.gram Learn more about IncrediPaul and schedule your free coaching session on my linktree (www.linktr.ee/incredipaul) or website, www.incredipaul.org/coaching. Follow me on TikTok, Twitter or Instagram @imincredipaul. --- Send in a voice message: https://podcasters.spotify.com/pod/show/incredipaul/message

tiktok navigating career myfitnesspal madu chuks site reliability engineer

S23:E6 - From Site Reliability Engineer to Principal Software Engineer (Alice Goldfuss)

CodeNewbie

Play Episode Listen Later Mar 22, 2023 42:25

Today, Saron talks with Alice Goldfuss, Principal Software Engineer and Systems Programmer specializing in building resilient distributed systems at scale. Alice delivered industry-impacting talks on container platforms, infrastructure operations, and organizational best practices, as well as written on the SRE field, kernel crashes, and personal security. We hear about her coding journey and learn all about her take on various programs and the tech world as a whole. Show Links Compiler (sponsor) Porkbun (sponsor) How to Get Into SRE Rust CSS Notepad++ Inline CSS HTML4

software engineers sre principal software engineer saron site reliability engineer

Hedge 148: The SRE with Niall Murphy (part 2)

The Hedge

Play Episode Listen Later Sep 22, 2022 31:53

It seems like only yesterday we started talking about the Site Reliability Engineer, and their place in the IT ecosystem. Over the last several years, the role of the SRE has changed—and it's bound to continue changing. On this episode of the Hedge, Niall Murphy joins Tom Ammon and Russ White to discuss the changing role of the SRE, and what the SRE could be.

hedge sre site reliability engineer niall murphy russ white

Hedge 147: The SRE with Niall Murphy (part 1)

The Hedge

Play Episode Listen Later Sep 14, 2022 24:39

hedge sre site reliability engineer niall murphy russ white

Conversations #57: DevOps and Site Reliability Engineering

Women Who Code Radio

Play Episode Listen Later Aug 31, 2022 29:00

Deepali Chouhan, Network Director for Women Who Code, Vancouver Canada, interviews Pooneh Mokariasl, Site Reliability Engineer at Criteo in Paris, France. They discuss the differences and similarities of Pooneh's engineering roles. They also discuss Criteo's Voyageurs program and the unique opportunity to explore working with other teams within the company.

conversations technology france tech coding women in tech devops voyageurs vancouver canada criteo women who code site reliability engineering site reliability engineer network director

Emily Rossetti: Site Reliability Engineer; One of the FAANG

The Mark Howley Show

Play Episode Listen Later Jul 20, 2022 34:22

The Mark Howley Show welcomes Emily Rossetti, a software engineer, specifically a site reliability engineer, working for one of the top computer industry companies in the world. Originally from Hong Kong, she moved to the states to pursue her passion and career. Rossetti shares her experience as a woman in the tech industry, as well as balancing work and motherhood. She hopes to encourage other women interested in the tech industry to pursure their goals. Emily's episode is told from a unique perspective that is both inspiring and endearing.

hong kong engineers faang rossetti site reliability engineer

Ep. 64 Bringing New Applications to Federal Financial Oversight

Feds At The Edge by FedInsider

Play Episode Listen Later Jul 4, 2022 61:43

Professional sports broadcasters frequently use the phrase, “taking it to the next level.” Well, when it comes to improving application development in the federal world, taking it to the next level can involve some new concepts presented in this discussion. This is an interview with three professionals who have worked on many federal projects. They provide the listener with guidelines for making the transition with minimal expense and maintaining federal security standards. The discussion opens with a contrast between the traditional method of developing software and the way it is done today. Chris Moran from GDIT estimates 90% of systems generated today are comprised of third-party applications. In other words, it is assembled rather than coded line for line. This new method allows for flexibility and rapid development. Two other methods for deploying software were introduced in this discussion. When multiple applications are deployed over multiple clouds, a person dedicated to reliability must be included in the team, usually referred to as a “Site Reliability Engineer.” This person is tasked with maintenance, patching, and increasing automation for those responsibilities.

professional federal site reliability engineer new applications financial oversight chris moran gdit

TikTok and Short Form Content for Developers with Linda Vivah

Screaming in the Cloud

Play Episode Listen Later Jun 28, 2022 34:01

Full Description / Show Notes Corey and Linda talk about Tiktok and the online developer community (1:18) Linda talks about what prompted her to want to work at AWS (5:29) Linda discusses navigating the change from just being part of the developer community to being an employee of AWS (10:37) Linda talks about moving AWS more in the direction of short form content, and Corey and Linda talk about the Tiktok algorithm (15:56) Linda talks about the potential struggle of going from short form to long form content (25:21) About LindaLinda Vivah is a Site Reliability Engineer for a major media organization in NYC, a tech content creator, an AWS community builder member, a part-time wedding singer, and the founder of a STEM jewelry shop called Coding Crystals. At the time of this recording she was about to join AWS in her current position as a Developer Advocate.Linda had an untraditional journey into tech. She was a Philosophy major in college and began her career in journalism. In 2015, she quit her tv job to attend The Flatiron School, a full stack web development immersive program in NYC. She worked as a full-stack developer building web applications for 5 years before shifting into SRE to work on the cloud end internally.Throughout the years, she's created tech content on platforms like TikTok & Instagram and believes that sometimes the best way to learn is to teach.Links Referenced:lindavivah.com: https://lindavivah.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate. Is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other; which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at honeycomb.io/screaminginthecloud. Observability: it's more than just hipster monitoring.Corey: Let's face it, on-call firefighting at 2am is stressful! So there's good news and there's bad news. The bad news is that you probably can't prevent incidents from happening, but the good news is that incident.io makes incidents less stressful and a lot more valuable. incident.io is a Slack-native incident management platform that allows you to automate incident processes, focus on fixing the issues and learn from incident insights to improve site reliability and fix your vulnerabilities. Try incident.io, recover faster and sleep more.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. We talk a lot about how people go about getting into this ridiculous industry of ours, and I've talked a little bit about how I go about finding interesting and varied guests to show up and help me indulge my ongoing love affair on this show with the sound of my own voice. Today, we're going to be able to address both of those because today I'm speaking to Linda Haviv, who, as of this recording, has accepted a job as a Developer Advocate at AWS, but has not started. Linda, welcome to the show.Linda: Thank you so much for having me, Corey. Happy to be here.Corey: So, you and I have been talking for a while and there's been a lot of interesting things I learned along the way. You were one of the first people I encountered when I joined the TikToks, as all the kids do these days, and was trying to figure out is there a community of folks who use AWS. Which really boils down to, “So, where are these people that are sad all the time?” Well, it turns out, they're on TikTok, so there we go. We found my people.And that was great. And we started talking, and it turns out that we were both in the AWS community builder program. And we've developed a bit of a rapport. We talk about different things. And then, I guess, weird stuff started happening, in the context of you were—you're doing very well at building an audience for yourself on TikTok.I tried it, and it was—my sense of humor sometimes works, sometimes doesn't. I've had challenges in finding any reasonable way to monetize it because a 30-second video doesn't really give nuance for a full ad read, for example. And you've been looking at it from the perspective of a content creator looking to build the audience slash platform is step one, and then, eh, step two, you'll sort of figure out aspects of monetization later. Which, honestly, is a way easier way to do it in hindsight, but, yeah, the things that we learn. Now, that you're going to AWS, first, you planning to still be on the TikToks and whatnot?Linda: Absolutely. So, I really look at TikTok as a funnel. I don't think it's the main place, you're going to get that deep-dive content but I think it's a great way, especially for things that excite you or get you into understanding it, especially beginner-type audience, I think there's a lot of untapped market of people looking to into tech, or technologists that aren't in the cloud. I mean, even when I worked—I worked as a web developer and then kind of learned more about the cloud, and I started out as a front-end developer and shifted into, like, SRE and infrastructure, so even for people within tech, you can have a huge tech community which there is on TikTok, with a younger community—but not all of them really understand the cloud necessarily, depending on their job function. So, I think it's a great way to kind of expose people to that.For me, my exposure came from community. I met somebody at a meetup who was working in cloud, and it wasn't even on the job that I really started getting into cloud because many times in corporations, you might be working on a specific team and you're not really encountering other ends, and it seems kind of like a mystery. Although it shouldn't seem like magic, many times when you're doing certain job functions—especially the DevOps—could end up feeling like magic. So, [laugh] for the good and the bad. So sometimes, if you're not working on that end, you really sometimes take it for granted.And so, for me, I actually—meetups were the way I got exposed to that end. And then I brought it back into my work and shifted internally and did certifications and started, even, lunch-and-learns where I work to get more people in their learning journey together within the company, and you know, help us as we're migrating to the cloud, as we're building on the cloud. Which, of course, we have many more roles down the road. I did it for a few years and saw the shift. But I worked at a media company for many years and now shifting to AWS, and so I've seen that happen on different ends.Not—oh, I wasn't the one doing the migration because I was on the other end of that time, but now for the last two years, I was working on [laugh] the infrastructure end, and so it's really fascinating. And many people actually—until now I feel like—that will work on maybe the web and mobile and don't always know as much about the cloud. I think it's a great way to funnel things in a quick manner. I think also society is getting used to short videos, and our attention span is very low, and I think for—Corey: No argument here.Linda: —[crosstalk 00:04:39] spending so mu—yeah, and we're spending so much time on these platforms, we might as well, you know, learn something. And I think it depends what content. Some things work well, some things doesn't. As with anything content creation, you kind of have to do trial and error, but I do find the audience to be a bit different on TikTok versus Twitter versus Instagram versus YouTube. Which is interesting how it's going to play out on YouTube, too, which is a whole ‘nother topic conversation.Corey: Well, it's odd to me watching your path. It's almost the exact opposite of mine where I started off on the back-end, grumpy sysadmin world and, “Oh, why would I ever need to learn JavaScript?” “Well, genius, because as the world progresses, guess what? That's right. The entire world becomes JavaScript. Welcome.”And it took me a long time to come around to that. You started with the front-end world and then basically approached from the exact opposite end. Let's be clear, back in my day, mine was the common path. These days, yours is very much the common path.Linda: Yeah.Corey: I also want to highlight that all of those transitions and careers that you spoke about, you were at the same company for nine years, which in tech is closer to 30. So, I have to ask, what was it that inspired you, after nine years, to decide, “I'm going to go work somewhere else. But not just anywhere; I'm going to AWS.” Because normally people don't almost institutionalized lifers past a certain point.Linda: [laugh].Corey: Like, “Oh, you'll be there till you retire or die.” Whereas seeing significant career change after that long in one place, even if you've moved around internally and experienced a lot of different roles, is not common at all what sparked that?Linda: Yeah. Yeah, no, it's such a good question. I always think about that, too, especially as I was reflecting because I'm, you know, in the midst of this transition, and I've gotten a lot of reflecting over the last two weeks [laugh], or more. But I think the main thing for me is, I always, wherever I was—and this kind of something that—I'm very proactive when it comes to trying to transition. I think, even when I was—right, I held many roles in the same company; I used to work in TV production and actually left for three months to go to a coding boot camp and then came back on the other end, but I understood the product in a different way.So, for that time period, it was really interesting to work on the other end. But, you know, as I kind of—every time I wanted to progress further, I always made a move that was actually new and put me in an uncomfortable place, even within the same company. And I'm at the point now that I'm in my career, I felt like this next step really needs to be, you know, at AWS. It's not, like, the natural progression for me. I worked alongside—on the client end—with AWS and have seen so many projects come through and how much our own workloads have changed.And it's just been an incredible journey, also dealing with accounts team. On that end, I've worked alongside them, so for me, it was kind of a natural progression. I was very passionate about cloud computing at AWS and I kind of wanted to take it to that next place, and I felt like—also, dealing with the community as part of my job is a dream part to me because I was always doing that on the side on social media. So, it wasn't part of my day-to-day job. I was working as an SRE and an infrastructure engineer, so I didn't get to do that as part of my day-to-day.I was making videos at 2 a.m. and, you know, kind of trying to, like, do—you know, interact with the community like that. And I think—I come from a performing background, the people background, I was singing since I was four years old. I always go to—I was a wedding singer, so I go into a room and I love making people happy or giving value. And I think, like, education has a huge part of that. And in a way, like making that content and—Corey: You got to get people's attention—Linda: Yeah.Corey: —you can't teach them a damn thing.Linda: Right. Exactly. So, it's kind of a mix of everything. It's like that performance, the love of learning. You know, between you and I, like, I wanted to be a lawyer before I thought I was going to—before I went to tech.I thought I was going to be a lawyer purely because I loved the concept of going to law school. I never took time to think about the law part, like, being the lawyer part. I always thought, “Oh, school.” I'm a student at heart. I always call myself a professional student. I really think that's part of what you need to be in this world, in this tech industry, and I think for me, that's what keeps my fire going.I love to experiment, to learn, to build. And there's something very fulfilling about building products. If you take a step back, like, you're kind of—you know, for me that part, every time I look back at that, that always is what kind of keeps me going. When I was doing front-end, it felt a lot more like I was doing smaller things than when I was doing infrastructure, so I felt like that was another reason why I shifted. I love doing the front-end, but I felt like I was spending two days on an Internet Explorer bug and it just drove me—[laugh] it just made it feel unfulfilling versus spending two days on, you know, trying to understand why, you know, something doesn't run the infrastructure or, like, there's—you know, it's failing blindly, you know? Stuff like that. Like, I don't know, for me that felt more fulfilling because the problem was more macro. But I think I needed both. I have a love for both, but I definitely prefer being back-end. So. [laugh]. Well, I'm saying that now but—[laugh].Corey: This might be a weakness on my part where I'm basically projecting onto others, and this is—I might be completely wrong on this, but I tend to take a bit of a bifurcated view of community. I mean, community is part of the reason that I know the things I know and how I got to this place that I am, so use that as a cautionary tale if you want. But when I talk to someone like you at this moment, where you're in the community, I'm in the community, and I'm talking to you about a problem I'm having and we're working on ways to potentially solve that or how to think about that. I view us as basically commiserating on these things, whereas as soon as you start on day one—and yes, it's always day one—at AWS and this becomes your day job and you work there, on some level, for me, there's a bit shift that happens and a switch gets flipped in my head where, oh, you actually work at this company. That means you're the problem.And I'm not saying that in a way of being antagonistic. Please, if you're watching or listening to this, do not antagonize the developer advocates. They have a very hard job understanding all this so they can explain that to the rest of us. But how do you wind up planning to navigate, or I guess your views on, I guess, handling the shift between, “One of the customers like the rest of us,” to, as I say, “Part of the problem,” for lack of a better term.Linda: Or, like, work because you kind of get the—you know. I love this question and it's something I've been pondering a lot on because I think the messaging will need to be a little different [coming from me 00:10:44] in the sense of, there needs to be—just in anything, you have to kind of create trust. And to create trust, you have to be vulnerable and authentic. And I think I, for example, utilize a lot of things outside of just the AWS cloud topic to do that now, even, when I—you know, kind of building it without saying where I work or anything like that, going into this role and it being my job, it's going to be different kind of challenge as far as the messaging, but I think it still holds true that part, that just developing trust and authenticity, I might have to do more of that, you know? I might have to really share more of that part, share other things to really—because it's more like people come, it doesn't matter how much somet—how many times you explain it, many times, they will see your title and they will judge you for it, and they don't know what happened before. Every TikTok, for example, you have to act like it's a new person watching. There is no series, you know? Like, yes, there's a series but, like, sometimes you can make that but it's not really the way TikTok functions or a short-form video functions. So, you kind of have to think this is my first time—Corey: It works really terribly when you're trying to break it out that way on TikTok.Linda: [laugh]. Yeah.Corey: Right. Here's part 17 of my 80-TikTok-video saga. And it's, “Could you just turn this into a blog post or put this on YouTube or something? I don't have four hours to spend learning how all this stuff works in your world.”Linda: Yeah. And you know, I think repeating certain things, too, is really important. So, they say you have to repeat something eight times for people to see it or [laugh] something like that. I learned that in media [crosstalk 00:12:13]—Corey: In a row, or—yeah. [laugh].Linda: I mean, the truth is that when you, kind of like, do a TikTok maybe, like, there's something you could also say or clarify because I think there's going to be—and I'm going to have to—there's going to be a lot of trial and error for me; I don't know if I have answers—but my plan is going into it very much testing that kind of introduction, or, like, clarifying what that role is. Because the truth is, the role is advocating on behalf of the community and really helping that community, so making sure that—you don't have to say it as far as a definition maybe, but, like, making sure that comes across when you create a video. And I think that's going to be really important for me, and more important than the prior even creating content going forward. So, I think that's one thing that I definitely feel like is key.As well as creating more raw interaction. So, it depends on the platform, too. Instagram, for example, is much more community—how do I put this? Instagram is much more easy to navigate as far as reaching the same community because you have something, like, called Instagram Stories, right? So, on Instagram Stories, you're bringing those stories, mostly the same people that follow you. You're able to build that trust through those stories.On TikTok, they just released Stories. I haven't really tried them much and I don't play with it a lot, but I think that's something I will utilize because those are the people that are already follow you, meaning they have seen a piece of content. So, I think addressing it differently and knowing who's watching what and trying to kind of put yourself in their shoes when you're trying to, you know, teach something, it's important for you to have that trust with them. And I think—key to everything—being raw and authentic. I think people see through that. I would hope they do.And I think, uh, [laugh] that's what I'm going to be trying to do. I'm just going to be really myself and real, and try to help people and I hope that comes through because that's—I'm passionate about getting more people into the cloud and getting them educated. And I feel like it's something that could also allow you to build anything, just from anywhere on your computer, brings people together, the world is getting smaller, really. And just being able to meet people through that and there's just a way to also change your life. And people really could change their life.I changed my life, I think, going into tech and I'm in the United States and I, you know—I'm in New York, you know, but I feel like so many people in the States and outside of the States, you know, all over the world, you know, have access to this, and it's powerful to be able to build something and contribute and be a part of the future of technology, which AWS is.Corey: I feel like, in three years or whatever it is that you leave AWS in the far future, we're going to basically pull this video up and MST3k came together. It's like, “Remember how naive you were talking about these things?” And I'm mostly kidding, but let's be serious. You are presumably going to be focusing on the idea of short-form content. That is—Linda: Yeah.Corey: What your bread-and-butter of audience-building has been around, and that is something that is new for AWS.Linda: Yeah.Corey: And I'm always curious as to how companies and their cultures continue to evolve. I can only imagine there's a lot of support structure in place for that. I personally remember giving a talk at an AWS event and I had my slides reviewed by their legal team, as they always do, and I had a slide that they were looking at very closely where I was listing out the top five AWS services that are bullshit. And they don't really have a framework for that, so instead, they did their typical thing of, “Okay, we need to make sure that each of those services starts with the appropriate AWS or Amazon naming convention and are they capitalized properly?” Because they have a framework for working on those things.I'm really curious as to how the AWS culture and way of bringing messaging to where people are is going to be forced to evolve now that they, like it or not, are going to be having significantly increased presence on TikTok and other short-form platforms.Linda: I mean, it's really going to be interesting to see how this plays out. There's so much content that's put out, but sometimes it's just not reaching the right audience, so making sure that funnel exists to the right people is important and reaching those audiences. So, I think even YouTube Shorts, for example. Many people in tech use YouTube to search a question.They do not care about the intro, sometimes. It depends what kind of following, it depends if [in gaming 00:16:30], but if you're coming and you're building something, it's like a Stack Overflow sometimes. You want to know the answer to your question. Now, YouTube Shorts is a great solution to that because many times people want the shortest possible answer. Now, of course, if it's a tutorial on how to build something, and it warrants ten minutes, that's great.Even ten minutes is considered, now, Shorts because TikTok now has ten-minute videos, but I think TikTok is now searchable in the way YouTube is, and I think let's say YouTube Shorts is short-form, but very different type of short-form than TikTok is. TikTok, hooks matter. YouTube answers to your questions, especially in chat. I wouldn't say everything in YouTube is like that; depends on the niche. But I think even within short-form, there's going to be a different strategy regarding that.So, kind of like having that mix. I guess, depending on platform and audience, that's there. Again, trial and error, but we'll see how this plays out and how this will evolve. Corey: This episode is sponsored in part by our friends at Vultr. Optimized cloud compute plans have landed at Vultr to deliver lightning-fast processing power, courtesy of third-gen AMD EPYC processors without the IO or hardware limitations of a traditional multi-tenant cloud server. Starting at just 28 bucks a month, users can deploy general-purpose, CPU, memory, or storage optimized cloud instances in more than 20 locations across five continents. Without looking, I know that once again, Antarctica has gotten the short end of the stick. Launch your Vultr optimized compute instance in 60 seconds or less on your choice of included operating systems, or bring your own. It's time to ditch convoluted and unpredictable giant tech company billing practices and say goodbye to noisy neighbors and egregious egress forever. Vultr delivers the power of the cloud with none of the bloat. Screaming in the Cloud listeners can try Vultr for free today with a $150 in credit when they visit getvultr.com/screaming. That's G-E-T-V-U-L-T-R dot com slash screaming. My thanks to them for sponsoring this ridiculous podcast.Corey: I feel like there are two possible outcomes here. One is that AWS—Linda: Yeah.Corey: Nails this pivot into short-form content, and the other is that all your TikTok videos start becoming ten minutes long, which they now support, welcome to my TED Talk. It's awful, and then you wind up basically being video equivalent for all of your content, of recipes when you search them on the internet where first they circle the point to death 18 times with, “Back when I was a small child growing up in the hinterlands, we wound—my grandmother would always make the following stew after she killed the bison with here bare hands. Why did grandma kill a bison? We don't know.” And it just leads down this path so they can get, like, long enough content or they can have longer and longer articles to display more ads.And then finally at the end, it's like ingredient one: butter. Ingredient two, there is no ingredient two. Okay. That explains why it's delicious. Awesome. But I don't like having people prolong it. It's just, give me the answer I'm looking for.Linda: Yeah.Corey: Get to the point. Tell me the story. And—Linda: And this is—Corey: —I'm really hoping that is not the direction your content goes in. Which I don't think it would, but that is the horrifying thing and if for some chance I'm right, I will look like Nostradamus when we do that MST3k episode.Linda: No, no. I mean, I really am—I always personally—even when I was creating content these last few years and testing different things, I'm really a fan of the shortest way possible because I don't have the patience to watch long videos. And maybe it's because I'm a New Yorker that can't sit down from the life of me—apart from when I code of course—but, you know, I don't like wasting time, I'm always on the go, I'm with my coffee, I'm like—that's the kind of style I prefer to bring in videos in the sense of, like, people have no time. [laugh]. You know?The amount of content we're consuming is just, uh, bonkers. So, I don't think our mind is really a built for consuming [laugh] this much content every time you open your phone, or every time you look, you know, online. It's definitely something that is challenging in a whole different way. But I think where my content—if it's ten minutes, it better be because I can't shorten it. That's my thing. So, you can hold me accountable to that because—Corey: Yeah, I want ten minutes of—Linda: I'm not a—Corey: Content, not three minutes of content in a ten-minute bag.Linda: Exactly. Exactly. So, if it's a ten-minute video, it would have been in one hour that I cut down, like, meaning a tutorial, a very much technical types of content. I think things that are that long, especially in tech, would be something like, on that end—unless, of course, you know, I'm not talking about, like, longer videos on YouTube which are panels or that kind of thing. I'm talking more like if I'm doing something on TikTok specifically.TikTok also cares about your watch time, so if people aren't interested in it, it's not going to do well, it doesn't matter how many followers you have. Which is what I do like about the way TikTok functions as opposed to, let's say, Instagram. Instagram is more like it gives it to your following—and this is the current state, I don't know if it always evolves—but the current state is, Instagram Reels kind of functions in a way where it goes first to the people that follow you, but, like, in a way that's more amplified than TikTok. TikTox tests people that follows you, but if it's not a good video, it won't do well. And honestly, they're many good videos videos that don't go viral. I'm not talking about that.Sometimes it's also the topic and the niche and the sound and the title. I mean, there's so many people who take a topic and do it in three different ways and one of them goes viral. I mean, there's so many factors that play into it and it's hard to really, like, always, you know, kind of reverse engineer but I do think that with TikTok, things won't do well, more likely if it's not a good piece of content as opposed to—or, like, too long, right? Not—I shouldn't say not good a good piece of content—it's too long.Corey: The TikTok algorithm is inscrutable to me. TikTok is firmly convinced, based upon what it shows me, that I am apparently a lesbian. Which okay, fine. Awesome. Whatever. I'm also—it keeps showing me ads for ADHD stuff, and it was like, “Wow, like, how did it know that?” Followed by, “Oh, right. I'm on TikTok. Nevermind.”And I will say at one point, it recommended someone to me who, looking at the profile picture, she's my nanny. And it's, I have a strong policy of not, you know, stalking my household employees on social media. We are not Facebook friends, we are not—in a bunch of different areas. Like, how on earth would they have figured this out? I'm filling the corkboard with conspiracy and twine followed by, “Wait a minute. We probably both connect from the same WiFi network, which looks like the same IP address and it probably doesn't require a giant data science team to put two and two together on those things.” So, it was great. I was all set to do the tinfoil hat conspiracy, but no, no, that's just very basic correlation 101.Linda: And also, this is why I don't enable contacts on TikTok. You know, how it says, “Oh, connect your contacts?”Corey: Oh, I never do that. Like, “Can we look at your contacts?”Linda: Never.Corey: “No.” “Can we look at all of your photos?” “Absolutely not.” “Can we track you across apps?” “Why would anyone say yes to this? You're going to do it anyway, but I'll say no.” Yeah.Linda: Got to give the least privilege. [laugh]. Definitely not—Corey: Oh absolutely.Linda: Yeah. I think they also help [crosstalk 00:22:40]—Corey: But when I'm looking at—the monetization problem is always a challenge on things like this, too, because when I'm—my guilty TikTok scrolling pleasures hit, it's basically late at night, I just want to see—I want something to want to wind down and decompress. And I'm not about ready to watch, “Hey, would you like to migrate your enterprise database to this other thing?” It's, I… no. There's a reason that the ads that seem to be everywhere and doing well are aimed at the mass market, they're generally impulse buys, like, “Hey, do you want to set that thing over there on fire, but you're not close enough to get the job done? But this flame thrower today. Done.”And great, like, that is something everyone can enjoy, but these nuanced database products and anything else is B2B SaaS style stuff, it feels like it's a very tough sell and no one has quite cracked that nut, yet.Linda: Yeah, and I think the key there—this is, I'm guessing based on, like, what I want to try out a lot—is the hook and the way you're presenting it has to be very product-focused in the sense that it needs to be very relatable. Even if you don't know anything about tech, you need to be—like, for example, in the architecture page on AWS, there's a video about the Emirates going to Mars mission. Space is a very interesting topic, right? I think, a hook, like, “Do want to see how, like, how this is bu—” like, it's all, like, freely available to see exactly [laugh] how this was built. Like, it might—in the right wording, of course—it might be interesting to someone who's looking for fun-fact-style content.Now, is it really addressing the people that are building everyday? Not really always, depends who's on there and the mass market there. But I feel like going on the product and the things that are mass-market, and then working backwards to the tech part of it, even if they learn something and then want to learn more, that's really where I see TikTok. I don't think every platform would be, maybe, like this, but that's where I see getting people: kind of inviting them in to learn more, but making it cool and fun. It's very important, but it feels cool and fun. [laugh]. So.Because you're right, you're scrolling at 2 a.m. who wants to start seeing that. Like, it's all about how you teach. The content is there, the content has—you know, that's my thing. It's like, the content is there. You don't need to—it's yes, there's the part where things are always evolving and you need to keep track of that; that's whole ‘nother type thing which you do very well, right?And then there's a part where, like, the content that already exists, which part is evergreen? Meaning, which part is, like, something that could be re—also is not timely as far as update, for example, well-architected framework. Yes, it evolves all the time, you always have new pillars, but the guide, the story, that is an evergreen in some sense because that guide doesn't, you know, that whole concept isn't going anywhere. So, you know, why should someone care about that?Corey: Right. How to turn on two-factor authentication for your AWS account.Linda: Right.Corey: That's evergreen. That's the sort of thing that—and this is the problem, I think, AWS has had for a long time where they're talking about new features, new enhancements, new releases. But you look what people are actually doing and so much of it is just the same stuff again and again because yeah, that is how most of the cloud works. It turns out that three-quarters of company's production infrastructures tends to run on EC2 more frequently than it tends to run on IoT Greengrass. Imagine that.So, there's this idea of continuing to focus on these things. Now, one of my predictions is that you're going to have a lot of fun with this and on some level, it's going to really work for you. In others, it's going to be hilariously—well, its shortcomings might be predictable. I can just picture now you're at re:Invent; you have a breakout talk and terrific. And you've successfully gotten your talk down to one minute and then you're sitting there with—Linda: [laugh].Corey: —the remainder of maybe 59. Like, oh, right. Yeah. Turns out not everything is short-form. Are you predicting any—Linda: Yep.Corey: Problems going from short-form to long-form in those instances?Linda: I think it needs to go hand-in-hand, to be honest. I think when you're creating any short-form content, you have—you know, maybe something short is actually sometimes in some ways, right, harder because you really have to make sure, especially in a technical standpoint, leaving things out is sometimes—leaves, like, a blind spot. And so, making sure you're kind of—whatever you're educating, you kind of, to be clear, “Here's where you learn more. Here's how I'm going to answer this next question for you: go here.” Now, in a longer-form content, you would cover all that.So, there's always that longevity. I think even when I write a script, and there's many scripts I'm still [laugh] I've had many ideas until now I've been doing this still at 2 a.m. so of course, there's many that didn't, you know, get released, but those are the things that are more time consuming to create because you're taking something that's an hour-long, and trying to make sure you're pulling out the things that are most—that are hook-style, that invite people in, that are accurate, okay, that really give you—explain to you clearly where are the blind spots that I'm not explaining on this video are. So, “XYZ here is, like, the high level, but by the way, there's, like, this and this.” And in a long-form, you kind of have to know the long-form version of it to make the short-form, in some ways, depending on what—you're doing because you're funneling them to somewhere. That's my thing. Because I don't think there should be [crosstalk 00:27:36]—Corey: This is the curse of Twitter, on some level. It's, “Well, you forgot about this corner case.” “Yeah, I had 280 characters to get into.” Like, the whole point of short-form content—which I do consider Twitter to be—is a glimpse and a hook, and get people interested enough to go somewhere and learn more.For something like AWS, this makes a lot of sense. When you highlight a capability or something interesting, it's something relevant, whereas on the other side of it, where it's this, “Oh, great. Now, here's an 8000-word blog post on how I did this thing.” Yeah, I'm going to get relatively fewer amounts of traffic through that giant thing, but the people who are they're going to be frickin' invested because that's going to be a slog.Linda: Exactly.Corey: “And now my eight-hour video on how exactly I built this thing with TypeScript.” Badly—Linda: Exactly.Corey: —as it turns out because I'm a bad programmer.Linda: [laugh]. No, you're not. I love your shit-posting. It's great.Corey: Challenge accepted.Linda: [laugh]. I love what you just mentioned because I think you're hitting the nail on the head when it comes to the quality content that's niche focus, like, there needs to be a good healthy mix. I think always doing that, like, mass-market type video, it doesn't give you, also, the credibility you need. So, doing those more niche things that might not be relevant to everybody, but here and there, are part of that is really key for your own knowledge and for, like, the com—you know, as far as, like, helping someone specific. Because it's almost like—right, when you're selling a service and you're using social media, right, not everybody's going to buy your service. It doesn't matter what business you're in right? The deep-divers are going to be the people that pay up. It's just a numbers game, right? The more people you, kind of, address from there, you'll find—Corey: It's called a funnel for a reason.Linda: Right. Exactly.Corey: Free content, paid content. Almost anyone will follow me on Twitter; fewer than will sign up for a newsletter; fewer will listen to a podcast; fewer will watch a video, and almost none of them will buy a consulting engagement. But ‘almost' and ‘actually none of them,' it turns out is a very different world.Linda: Exactly. [laugh]. So FYI, I think there's—Corey: And that's fine. That's the way it works.Linda: That's the way it works. And I think there needs to be that niche content that might not be, like, the most viral thing, but viral doesn't mean quality, you know? It doesn't. There's many things that play into what viral is, but it's important to have the quality content for the people that need that content, and finding those people, you know, it's easier when you have that kind of mass engagement. Like, who knows? I'm a student. I told you; I'm a professional student. I'm still [laugh] learning every day.Corey: Working with AWS almost makes it a requirement. I wish you luck—Linda: Yeah.Corey: —in the new gig and I also want to thank you for taking time out of your day to speak with me about how you got to this point. And we're all very eager to see where you go from here.Linda: Thank you so much, Corey, for having me. I'm a huge fan, I love your content, I'm an avid reader of your newsletter and I am looking forward to very much being in touch and on the Twitterverse and beyond. So. [laugh].Corey: If people want to learn more about what you're up to, and other assorted nonsense, where's the best place they can go to find you?Linda: So, the best place they could go is lindavivah.com. I have all my different social handles listed on there as well a little bit about me, and I hope to connect with you. So, definitely go to lindavivah.com.Corey: And that link will, of course, be in the [show notes 00:30:39]. Thank you so much for taking the time to speak with me. I really appreciate it.Linda: Thank you, Corey. Have a wonderful rest of the day.Corey: Linda Haviv, AWS Developer Advocate, very soon now anyway. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, smash the like and subscribe buttons, and of course, leave an angry comment that you have broken down into 40 serialized TikTok videos.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

EP122 - Conversation on SREs with Brian Singer of Noble9

Cloud Unfiltered

Play Episode Listen Later May 9, 2022 40:37

We talk to Brian Singer of Noble9 on everything SRE! The Site Reliability Engineer is becoming one of the most important roles in DevOps. Brian was formerly of Google and is authoritative on the subject!

google conversations devops sre sres brian singer site reliability engineer

403: Mission Control with Joe Ferris

Giant Robots Smashing Into Other Giant Robots

Play Episode Listen Later Dec 9, 2021 34:04

Joe Ferris is thoughtbot's CTO and Managing Director of the thoughtbot DevOps and maintenance team known as Mission Control. Mission Control is our newest team doing DevOps Support, Maintenance, and SRE (Site Reliability Engineering). The goal of Mission Control, rather than building products or pairing with team members to improve their team like the rest of thoughtbot, is to support those teams and support other client teams in deploying and scaling applications. They have an on-call team and do more complex cloud build-outs with the goal being to empower and educate the teams that we work with so that they are more capable of working in those ecosystems on their own. Follow Joe on Twitter (https://twitter.com/joeferris) or LinkedIn (https://www.linkedin.com/in/joe-ferris-81421a167/). thoughtbot's Mission Control team (https://thoughtbot.com/mission-control) Follow thoughtbot on Twitter (https://twitter.com/thoughtbot) or LinkedIn (https://www.linkedin.com/company/150727/) Become a Sponsor (https://thoughtbot.com/sponsorship) of Giant Robots! Transcript: CHAD: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Chad Pytel. And with me today is Joe Ferris, thoughtbot's CTO and Managing Director of the thoughtbot DevOps and maintenance team known as Mission Control. Joe, welcome back to the show. JOE: Thanks, Chad. It's been a while. CHAD: It has been a while. I think you were the first-ever guest, if I'm not mistaken. JOE: I believe that's right. We talked about null, I think. [laughter] CHAD: Yeah. And it would have been with Ben back when I was just a listener and maybe producer. So welcome back to the show. It's been a long time, and a lot has changed at thoughtbot over the years. I've been talking to each of the managing directors of the new teams, and I wanted to be sure to have you on. Why don't we take a little bit of a step back and talk about Mission Control? When we say DevOps and maintenance, what do we mean? And what does Mission Control do? JOE: Sure. Mission Control is our newest team doing DevOps support, and maintenance, and SRE. It came out of our experiments with DevOps a while ago now, almost two years coming up. Historically, thoughtbot has shied away from getting too much into DevOps. I think a lot of us had some unpleasant experiences earlier in our career around sysadmin tasks and expectations there. Not a lot of people have wanted to be on call historically. So we've heavily leveraged services like Heroku that take a lot of that burden away from you and avoided doing things like direct to AWS deployments or getting too involved with CI/CD pipelines that were particularly complex. But we've had clients over the years that have requested more interesting or more difficult deployments. And finally, we had one a couple of years ago, where we said, "All alright, let's just handle this instead of saying no or trying to outsource it." We thought it made sense for them. And after going through it, we came to the conclusion that it was actually pretty good that the ecosystem had evolved a lot and that it was a service worth offering. That began our journey into DevOps, so to speak. So we did some smart DevOps work for a variety of clients over the next year or so before we decided to form an official team doing this new kind of work, which is how we ended up with Mission control. The goal of Mission Control, rather than building products or pairing with team members to improve their team like the rest of thoughtbot, the goal of Mission Control is to support those teams and support other client teams in deploying and scaling their applications. And we have an on-call team. We will do more complex cloud build-outs. And our goal is to empower and educate the teams that we work with so that they are more capable of working in those ecosystems on their own. CHAD: You used the acronym SRE earlier in that little spiel. I'm not sure that everyone knows what that is. [laughs] So it stands for Site Reliability Engineer, right? JOE: That's right. And that's been newer for us. So DevOps is supposed to be the fusion of development and operations. But the operations world is really big. So similar to how everybody has problems getting people to be full-stack enough given the complexity of front end and back end, we have similar problems in design. We also have that problem in DevOps where both development and operations are huge, rich ecosystems. And so, having developers that are fully experienced at both is hard. So the path of least resistance, when you say are doing DevOps, is definitely just to do operations. And it's been a struggle for us to actually break down those silos and have teams work more on the operation side on their own. So one of the things that caught our eye with SRE was some of the built-in mechanisms for engaging with the team. The one-sentence pitch for SRE is that it is operations if you approach it like a software problem. It has these concepts of SLOs, Service Level Objectives, and error budgets, which is the amount of time you spent violating your SLO. And part of the process is getting buy-in from the entire team, from the stakeholders down to the developers and the operations team. And so, it provides a natural interaction point between the operations folks and the rest of the team because nobody wants to break the error budget. Once the error budget is exhausted, everybody has to stop building new features and focus on stability until the error budget is cut up again. So rather than having this unpleasant give or take where we're more coming from the operations side, and we're always pushing for more stability, and everybody else is coming from the product side, and they're always pushing for more features, SRE gives you this useful metric to have that conversation around where we're not always just pushing for more. We're trying to hit a specific goal that we've agreed on. And when we hit the goal, we know that we can keep full throttle moving out new features. CHAD: Now, is the SRE a developer who is also working on resolving errors before the budget is hit? JOE: Yeah, a Site Reliability Engineer is a developer. But that's actually not too different from other forms of DevOps. DevOps is supposed to be developers in general. When I say we built an operations team, even if you look at the work that we're doing, a lot of it is development work. We build scripts, and automations, and so on. We don't manually set up EC2 instances, and not everything is toil, even outside of SRE. But the idea in SRE is that somebody will be more integrated with the development team and make changes to not just the operational stack but also the development stack in service of reliability. I've heard it said that SRE is a particular implementation of DevOps. That makes sense to me. CHAD: Let's start back in the beginning because you made reference to the fact that historically, a lot of what we deploy was deployed to Heroku. And we did that because, for a lot of the applications that we're building, it made sense. It minimized the operational overhead of deployments. There is a point in some systems that you cross a line. Where do we see that line typically being where you need to start looking at something else? JOE: I think there can be a few different instigating factors. One of the fastest ways for somebody to want to move to AWS is if they have significant security concerns, particularly for healthcare applications. The security model is more straightforward in AWS to have better isolation. There are options on Heroku, but it requires going to a different Heroku platform using Shield. And you just don't get the same power you get in terms of network isolation models you get on AWS with your own VPC. So if you're already at the point where you want to start out with a VPC out of the gate and do that kind of isolation, my opinion is you may as well own it and go to AWS. So that's one reason. Another is if you start hitting scaling issues, Heroku is easier for the developers because it's simple and it's very streamlined. But doing complex deployments is difficult, which eliminates some of the options available to somebody doing something like SRE. So to give one example, one mechanism people can use to make it safer to deploy without potentially introducing bugs or performance degradations is a canary release where when you release, you put the new version out as the canary build. And you route maybe 5% of traffic to that, and you actually collect metrics on performance and error rates on the canary traffic versus the regular traffic. And then you have some period where you're in experiment mode, which varies depending on the level of stability you're looking to achieve. Once you're confident that the canary release didn't introduce a regression, then it gets promoted to the stable build, and you do that every time you deploy. I have no idea how you would do that on Heroku. CHAD: I think you'd have to do it at the application level. You'd have to do it with a feature flag system. And it would only be possible to do some of the things that you would be able to do if you're able to do the whole system. JOE: Right. And I guess you could do weighted random numbers to try and decide whether to canary or not. But one of the benefits of doing it outside the application is there's no way to make a mistake. So, for example, if you introduced a bug in your canary mechanism in the application or you forget to put it behind a feature flag, then you've now deployed to production, and you have an error. Whereas if it's managed by the CI/CD pipeline, you're just deploying a new version of the application. In Heroku land, that would mean you deploy the new slug as a canary build. In most other areas, it means you're deploying a container image. That's one example of why if you get to the point that you have a lot of traffic in production and you need to manage that traffic while continuing to release features, it can be helpful to work on a platform like AWS where you have a lot more deployment options. Another one is that SRE is heavily built on observability and metrics, which can be difficult to collect on Heroku. Some of that is just a matter of lineage. Like, the SRE community was built up around tools like Prometheus that are scrape-based. That means you need to have a special metrics endpoint exposed on all of your containers. In Heroku, there isn't a way to access any of your dynos directly except through the web router, and you can't control which one you get. So using Prometheus on Heroku is not really practical, which means you need to re-implement what everybody else has built for SRE using a different observability tool. And observability out of the box on Heroku it's easy to get set up, but it's more limited. So doing something like complex SLOs and setting up error budget dashboards and alerting is going to be a significant task. Versus on a platform like Kubernetes where it doesn't sound like it'll be easier, but it is because there are open-source tools that you can just deploy. CHAD: You mentioned Kubernetes. It's probably worth calling out that that's pretty much what we are using across the board, right? JOE: For our AWS and other cloud deployments, we have standardized largely on Kubernetes. We started out using simpler containerization platforms like ECS on AWS. But what we found is that the developer tooling is generally not particularly good because there's not enough community momentum behind any of those. And the open-source is limited versus something like Kubernetes there's a massive open-source community. There is a ton of different tooling that people build that's available for developers and for DevOps. And for these things like SRE, you can use almost entirely open-source software to build out all of the interesting parts of that and deploy that. So what we've been building is basically an SRE Platform as a Service where we collect these open-source components. We deploy them to a managed Kubernetes cluster. And then, applications can immediately start exposing metrics to Prometheus and defining SLOs. CHAD: So much in the same way where we talked about some of the boundaries where it starts to make sense to not be on Heroku, what are some of the boundaries that teams hit where it makes sense to start thinking about SRE or even just having someone on the team that's focused on that kind of work? JOE: I think as soon as people start hitting their first scaling challenges. So for an MVP where you're validating a product where you don't actually have production traffic yet, I don't think it makes sense. And I also think I would avoid deploying to something like Kubernetes if you can help it for an MVP. But for anybody who has scaling concerns, SRE is a very useful mindset. And the sooner you start adopting it, the sooner you'll start to build an application that's made to scale. It can be very difficult to put out those fires while something is not on a platform where you have many options, and nobody has been thinking about observability. It means that you need to be guessing at how to put out the fire as well as simultaneously introducing metrics and potentially planning a cloud migration. So I think as soon as you start feeling nervous about deploying to production or as soon as you notice that you're spending a lot of time working on performance, it makes sense to bring in SRE. I also think anybody that needs to provide an SLA should for sure implement SRE. It can be used to measure whether or not you're on track to hit an SLA because you basically set SLOs that are stricter than your SLA, and you make sure that you meet it. CHAD: Is there a way that existing teams can layer on some of the SRE activities without having full-time SRE people? JOE: I think you can have a team member who does development that also acts as the SRE. If you have a small team, I could see the commitment to it being daunting. I think that could be one good reason to bring in outside specialists if you're not at the point where you can afford to have a full-time SRE in-house. Working with a team that can provide an SRE on-demand like Mission Control could be valuable. CHAD: I didn't realize that that was going to be a perfect segue into part of the value proposition of Mission Control [laughter] when I asked the question. But I guess that's a really good point. That is part of what we're helping people do is monthly contracts that provide this to them, even if their team can't do it 100% of the time. JOE: Right, except for pretty large teams. I don't think it makes sense for them to hire a full-time SRE. It's much easier to work with a team like ours that has the experience and has more than one person. Even if you do hire a full-time SRE, you will only have one. So if they go on vacation, or if they get sick, or if it's in the middle of the night, then do you still have an SRE? I think that's one of the benefits of working with a team. CHAD: And that's been interesting with Mission Control because we introduced Mission Control and made it a formal thing at the same time as going entirely remote. And it's interesting how doing that freed us up in terms of being able to commit to building a different kind of team. It doesn't necessarily need to be on call after hours if we're going to have an entirely remote team. We can have people on that team that span different time zones. And so, from a thoughtbot perspective, it's interesting how those things went hand in hand for us. JOE: Yes, it's been immensely helpful for Mission Control, in particular, to be fully remote. There are a lot of options that wouldn't have been available to us if we were a U.S.-centric team. It's been really interesting. I've built out development teams before that were focused on a location. And it's been really interesting to build out this team with a focus on availability and distribution. For example, one thing that has helped us is having somebody in South America because they don't celebrate U.S. holidays. So even discounting time zones, which are a challenge when you're trying to provide around-the-clock availability, just having that kind of diversity in holiday schedules really helps. So we've been able to build it totally differently than we would have if we were trying to put a bunch of people in an office. And I think it's made it possible for us to have much better coverage with a much smaller team. Mid-roll Ad I wanted to tell you all about something I've been working on quietly for the past year or so, and that's AgencyU. AgencyU is a membership-based program where I work one-on-one with a small group of agency founders and leaders toward their business goals. We do one-on-one coaching sessions and also monthly group meetings. We start with goal setting, advice, and problem-solving based on my experiences over the last 18 years of running thoughtbot. As we progress as a group, we all get to know each other more. And many of the AgencyU members are now working on client projects together and even referring work to each other. Whether you're struggling to grow an agency, taking it to the next level and having growing pains, or a solo founder who just needs someone to talk to, in my 18 years of leading and growing thoughtbot, I've seen and learned from a lot of different situations, and I'd be happy to work with you. Learn more and sign up today at thoughtbot.com/agencyu. That's A-G-E-N-C-Y, the letter U. CHAD: So Mission Control I introduced it as maintenance and DevOps. So we're also helping people with different kinds of things beyond operations, right? JOE: Yeah, particularly with SRE, there's a focus on stability and scaling. And we're also helping people with CI/CD. One of the focuses for us this quarter has been helping people develop CI/CD pipelines that provide safer deploys and providing guidance and a system for developers to implement things like feature flags and beta flags. Because one of the challenges of making performance improvements is that you don't actually know if you've solved the problem until it's deployed, and deploying something that changes performance is inherently risky. And so, in addition to helping people actually make the performance improvements, we have to demonstrate the process for deploying and testing those improvements. CHAD: I've worked on fairly big systems in the past. But there have been a couple of different instances over the last maybe year where we've approached the problem in a different way than we have in the past, which has been really interesting to me from a development standpoint. It's the idea of…if you remember, for the food delivery application, we had that conversation about the different ways to build APIs rather than versioning APIs explicitly. And that has been a different approach than the way I would have done things in the past. And it's been a really powerful approach. So, can we talk a little bit more about that approach? JOE: Sure. CHAD: Well, specifically, so we have mobile applications that use a back-end API, and not everyone updates their mobile application at the same time instantly. You have bugs basically in the wild that you are fixing or that you're changing in your API, or if you're just introducing API changes. And so the idea of instead of explicitly versioning API on the server-side and having clients write to a specific API, instead building much more flexible APIs, in particular, having the client tell you what version of the API that they're expecting but through consolidated API endpoints so that the server is much more in control of the behavior than the client being in control of the behavior. JOE: Yeah, I think the two big changes that were helpful on that project were using GraphQL for some of the APIs, which provides more flexibility generally than a typical REST API and the minimum version requirement. So the application sends the version of the application. And the API will tell the client they have to upgrade if it's a version that isn't compatible with the newer APIs. So when we do have to break backwards compatibility, we force an app upgrade. CHAD: But in general, you're taking the approach not to break backward compatibility. And you're meeting the client where it's at whenever possible and maintaining backward compatibility in the APIs. JOE: That's something that we have been teaching developers about generally is backwards and forwards compatibility. We do that with deployments as well. For some of the larger deployments we have where there might be dozens of containers running for a service, it certainly doesn't make sense to stop them all and start new ones because the app would be down for a long time. And it would take too long to catch up to the backlog of requests. But even a typical blue-green deployment is problematic. So if we have 30 containers running and we spin up 30 new containers, and they all need 15 database connections, then during the deploy, you potentially overload your database or exhaust your connection limit. Plus, you will need to allocate the compute resources for double the normal workload. So what we've been doing instead is rolling deploys almost everywhere where we spin up a few new containers using the new version and wait until they're fully online, spin down a few old ones, and then repeat that process until everything is up to date. But to do a rolling deploy like that requires backwards compatibility with the services it uses, in particular, at the database. And so, writing Rails migrations that are backwards compatible for one version has been a challenge. CHAD: And there's not really good tooling in Rails to do multiple stages of things. So if you really want to do that, you have to manage that in your source control basically and say, "Here's a new migration. We're going to merge in and deploy after this one," and that's not so great. JOE: Right. The other way to do that in the CI/CD pipeline would be to release commits one at a time and wait for them to be rolled out. But depending on how you structure your commit log, that could be pretty tedious. [laughs] CHAD: Yeah. I've seen as I've worked on this other project we're really striving to do continuous deployment. It's a high traffic, very complex deployment with lots of individual configured tenants. Separating out the concept of a deploy from a release has been very valuable for the application and for the clients. It changes the way that you need to think about how development progresses. I never before really worked in a system where you're literally sometimes duplicating and preserving old code, putting new code in place, having them both deployed, and then being able to switch between them as part of the release, and then cleaning up the old code later. At the scale that this is at, at the complexity that this is at, it makes sense for that application. It obviously doesn't make sense for everybody to be working that way. JOE: Right. Breaking up applications to be a little smaller, having components that could be experimented with individually would make some of that easier. The experimentation there separating the release from the deploy some of that is necessary because it's monolithic in so many ways. Like, it's a very big Rails application with one database with ACID compliance, which is a very powerful model. And it provides simplicity in some ways. But then it requires you to take on the complexity of making sure that you release things correctly. I do think that it would be difficult in this particular situation but for applications that reach that level of traffic and where you need to manage the risk of deploying, having smaller components, having some services broken would make that easier because you could do, for example, a canary deploy with one release rather than duplicating the code and having the old and new version. CHAD: Right. The services create boundaries with contracts about behavior and reduces things that are tightly coupled together, and their behavior is tightly coupled together. So, for example, on this application, we do have that one service that is completely managed independently from the main monolith and has its own deploy schedule. And we can, for the most part, change them independently without needing to go through all of that process that we go through to manage change. I think you're absolutely right. JOE: Another experiment we've been trying for another client is it's another Rails monolith. There are different audiences for it. So this is the food delivery application again. And there are customers who are placing orders. There are drivers who are delivering orders. There are restaurants that are fulfilling orders. And then there are admins who are managing everything in the back end. And there's some overlap in the data they use. But the actual requests, and controllers, and pieces of the Rails application they use are almost entirely isolated. So one challenge we had was being able to provide different reliability contracts for those different audiences and also scaling them and configuring them differently. So, for example, if you've done tuning for a Rails application before, you've probably tweaked things like how many threads will I have for each of my Puma workers? How many Puma workers will I have per container? How many database connections do I need in the pool? And what we were able to do for this application using Kubernetes and Isto was running the same application, the same container, so like one monolithic Rails container but running it more than once in different configurations and routing traffic to different pools of containers based on the audience. And so, for example, if the customer is making requests, those all go to the customer pool of containers, which are scaled independently and have their own configuration tweaks for the kinds of requests that customers tend to make, which are generally small, high throughput requests with lots of little rights. And then, compared to the admin panel, they typically view dashboards and big lists of records. And so, the requests tend to be larger, but the number of users is much smaller. There are way more customers than there are admins. And so, for those, we have fewer connections. We have more memory allocated for the kind of bloat that results in those types of requests. And we also have a different performance objective for admins. It's more acceptable for those pages to respond a little bit slower. And admins understand it's their job. They have to use the software. So they'll reload the page if they have to versus a customer where if they're having trouble placing an order, they might just buy somewhere else. So that's been a pretty powerful mechanism we were able to leverage CHAD: Is that switching on URL-like endpoints? JOE: Yeah, it's based on the path. But the mechanisms available to us are actually pretty powerful. At that point, we have access to the full request. So we could really route based on anything we wanted right down to the user. CHAD: I guess that's a really good example. You don't have access to that routing on Heroku. JOE: No, I think any Platform as a Service where they manage the routing if they don't provide that feature, you don't get that feature. CHAD: This is the first we're talking about this. That is a really interesting example of how to scale a monolith solves some of the problems that services often get you without having to break everything up right off the bat in order to do that. JOE: Yeah. I also think it provides kind of an inside-out approach to doing that. One of the problems with breaking out services is you have to plan what the services are going to be to a certain degree. And so, I think the best way to do it is to extract services from a monolith the same way you extract classes to break them up. And this audience-based approach is almost like a dry run. You can see if the boundaries you're drawing make sense in terms of traffic. And if those make sense, it probably makes sense to break up the front end at those boundaries eventually into different applications. And then figure out what services you need to extract to provide the common infrastructure for those front-end services. The same way test-driven development makes it much easier to find the correct tests to write, I think this approach of audience boundary discovery is an interesting approach to finding service boundaries versus trying to guess at what the services are, which very frequently leads people to wrapping services around database tables which doesn't help at all. CHAD: Yeah, that's the wrong thing to be looking at when you're looking at how to do services. JOE: Right. It's almost like deciding what your database tables would be upfront before you've seen the UI for the application. CHAD: Cool. So heading into 2022, we're looking ahead at the upcoming year. And so what's on the docket for Mission Control? JOE: We didn't start experimenting fully with SRE until the third quarter of this year. And so far, we've loved it. So I think we'll make a pretty heavy investment into our SRE offering. The goal is for us to have an open-source set of Terraform modules that effectively deploy a platform ready to go for SRE. What we want to do is maintain and curate that platform and then deploy it and maintain it for our clients. I think another big thing we'll be doing is (This might be incredibly boring.) but restructuring the way our agreements work a little bit. One of the things we wanted to test out when we built Mission Control was how much we could have built into a monthly recurring contract versus billing for time and materials like we usually do. So we tried putting a lot into that contract and really pushing the boundaries of what would be reasonable. And there was definitely a lot of pain there for us and a lot of difficult conversations with clients. So I think for 2022, we will be shifting a lot of our work back towards time and materials. So I guess that's a lesson out there for anybody else that's providing [laughs] support contracts is to make sure that the responsibilities contained in the linear amount scale linearly. CHAD: I think when we originally conceived of Mission Control, we also saw it handling a lot more things that it turns out just were not doing as part of Mission Control like regular Rails upgrades. JOE: Yeah, a lot of the things that we included in contracts originally were not particularly important to clients or at least were not outside of what they were capable of doing already. So it wasn't that much of a value-add. There are a lot of people out there that will upgrade your Rails version. And having somebody who just does it in the background but isn't aware of some of the impacts that might have in the application turned out to be not much of a value prop. Whereas stability turns out to be a big pain point for a lot of people, people don't know how to do it. And then our maintenance offering, I think what ended up providing the most value is not the keeping the code fresh parts, but it was more for the teams that don't have a large continuous development team having access to somebody who can fix quick bugs and things like that without needing to first negotiate a contract with a provider. I think that provides a lot of value. Those are pretty separate and different offerings. But those are the pieces that we found have really been valuable to clients. CHAD: Well, great. If people want to find out more about Mission Control or get in touch with you, where are the best places for them to do that? JOE: Well, we have a website thoughtbot.com/mission-control with a dash between mission and control. There are a few ways to reach out there. You can also find us on Twitter. We are @thoughtbot, and I am @joeferris. CHAD: Cool. You can subscribe to the show and find notes for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @cpytel. This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time. Announcer: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success. Special Guest: Joe Ferris.

Podcasts about site reliability engineer

Best podcasts about site reliability engineer

Google SRE Prodcast

Packet Pushers - Full Podcast Feed

Packet Pushers - Fat Pipe

ReliabilityRadio

Screaming in the Cloud

The Cloudcast

linkmeup. ??????? ??? IT ? ??? ?????

S.R.E.path Podcast

PurePerformance

Latest news about site reliability engineer

Latest podcast episodes about site reliability engineer

Reliability Through Planning with Matthew Gill

Reliability Radio EP 332: The Smart Digital Reality, Peter Bynarowicz – Hexagon

The One with Ben Good and Our Kubernetes Friends

The One With AI Agents, Ramón Llamas, and Swapnil Haria

The One with Technical Program Managers and Karanveer Anand

The One with STPA, Jeffrey Snover, and Theo Klein

The One with Startups and Adam Fletcher

The One with SLOs and Sal Furino

The One With the Future of SRE and Matt Zelesko

The One with AI and Todd Underwood

The One With Data Centers and Peter Pellerzi

We're back with Season 4!

Une tasse de fiabilité : Discussion avec un SRE - Quentin Joly

Generalist or Specialist: 米在住 Site Reliability Engineer のキャリアの考え方深掘り (Shuhei)

Evolving, Adapting, and Staying Prepared with Brian Weber

Safety vs Security with Thomas Depierre

Imperative vs. Declarative Change Workflows with Dominic Hutton & Niccolo' Cascarano

Human Factors in Complex Systems with Casey Rosenthal and John Allspaw

Embracing Complexity with Christina Schulman & Dr. Laura Maguire

Maglev: load balancing at Google with Cody Smith and Trisha Weir

Profiling data with Pat Somaru and Narayan Desai

Google Public DNS (8.8.8.8) with Wilmer van der Gaast and Andy Sykes

SRE in the Retail and Gaming Worlds with Jordan Chernev & Scott Bowers

Incident Response with Sarah Butt and Vrai Stacey

Building Reliable Systems with Silvia Botros and Niall Murphy

Creating Systems that are Safe with Liz Fong-Jones

Prometheus 3.0 Unveiled: PromCon Highlights with Julius Volz - OpenObservability Talks S5E04

Reliability Radio EP 310: Jeff Smith, The State of Condition Monitoring

Reliability Radio EP 309: Tim Rice, the concept of Defect Elimination

Reliability Radio EP 308: Jack Poley, CMI

Reliability Radio EP 307: Russ Parish, ReliabilityWeb

Reliability Radio EP 306: David Lockhart, Kaiser Permenente

Upskilling In the Caribbean

IBM, VMware, and Dedication with Alexandra McCoy

#98 - Service Levels 101 feat. Alex Ewerlöf - Sr Staff Engineer @ Volvo Cars & SRE Thought Leader

#27 - Growing as a Site Reliability Engineer (Part 3)

#26 - Growing as a Site Reliability Engineer (Part 2)

#24 - Growing as a Site Reliability Engineer (Part 1)

Life of An SRE with Dave Reisner

Life of An SRE with Jessica Theodat

Episode 28 - Navigating a Manful Career & Life w/ Chuks C-Madu

S23:E6 - From Site Reliability Engineer to Principal Software Engineer (Alice Goldfuss)

Hedge 148: The SRE with Niall Murphy (part 2)

Hedge 147: The SRE with Niall Murphy (part 1)

Conversations #57: DevOps and Site Reliability Engineering

Emily Rossetti: Site Reliability Engineer; One of the FAANG

Ep. 64 Bringing New Applications to Federal Financial Oversight

TikTok and Short Form Content for Developers with Linda Vivah

EP122 - Conversation on SREs with Brian Singer of Noble9

403: Mission Control with Joe Ferris