The Software Misadventures Podcast is where we sit down with software and devops experts to hear their stories from the trenches about how software breaks in production. The show hosts - Ronak, Austin and Guang - have seen first hand how stressful it is when something breaks in production but they have also found it to be the best opportunity to learn about a system more deeply. On this show, they discuss with fellow engineers the principles and practical tips to build resilient software as well as advice to grow as technical leaders. Learn more at https://softwaremisadventures.com.
Ronak Nathani, Austin Ouyang, Guang Yang
Some reflections on running the podcast and Ronak has some eggciting news to share :) Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Oxide co-founders Bryan and Steve are back on the show to give an impromptu peek at the Oxide server rack and to chat about writing their own manufacturing software, overcoming false summits before shipping the first rack, the #1 reason startups fail and more. Don't miss the full-circle moment on their "meet cute" story from last time, shared at the end of the conversation :) Segments: (00:00:00) The Oxide rack uncrating experience (00:02:40) The office tour (00:04:03) Challenges of shipping and unboxing hardware (00:11:04) Hybrid hardware company? (00:13:38) Custom designing a crate for the rack (00:18:12) Optimizing for time to value (00:20:43) Writing custom manufacturing software (00:23:25) Taking ownership of the customer experience (00:25:29) Buy vs build (00:27:46) The false summits before shipping the first rack (00:30:05) “Missing just enough context to be optimistic” (00:33:07) The #1 reason startups fail (00:38:49) Hiring the first sales role (00:44:53) The dangers of “happy ears” (00:47:18) The pitfalls of rushing to market (00:51:03) The “third VP of sales” problem (00:56:06) The value of a good sales leader (01:00:07) Curiosity and empathy in sales (01:03:41) Grooming sales skills as an engineer (01:07:33) Learning from current customers (01:09:13) Talk to prospective customers “that we have 0% chance of closing” (01:11:25) Actionable bad news (01:14:11) The role of GPUs in data centers (01:18:50) Cloud repatriation (01:24:23) Full circle to the “meet cute” Show Notes: Our previous convo: https://softwaremisadventures.com/p/oxide-ditching-the-rules Bryan on Twitter: https://x.com/bcantrill Steve on Twitter: https://x.com/sdtuck Stay in touch:
Known for co-creating Django and Datasette, as well as his thoughtful writing on LLMs, Simon Willison joins the show to chat about blogging as an accountability mechanism, how to build intuition with LLMs, building a startup with his partner on their honeymoon, and more. Segments: (00:00:00) The weird intern (00:01:50) The early days of LLMs (00:04:59) Blogging as an accountability mechanism (00:09:24) The low-pressure approach to blogging (00:11:47) GitHub issues as a system of records (00:16:15) Temporal documentation and design docs (00:18:19) GitHub issues for team collaboration (00:21:53) Copy-paste as an API (00:26:54) Observable notebooks (00:28:50) pip install LLM (00:32:26) The evolution of using LLMs daily (00:34:47) Building intuition with LLMs (00:43:24) Democratizing access to automation (00:47:45) Alternative interfaces for language models (00:53:39) Is prompt engineering really engineering? (00:58:39) The frustrations of working with LLMs (01:01:59) Structured data extraction with LLMs (01:06:08) How Simon would go about building a LLM app (01:09:49) LLMs making developers more ambitious (01:13:32) Typical workflow with LLMs (01:19:58) Vibes-based evaluation (01:23:25) Staying up-to-date with LLMs (01:27:49) The impact of LLMs on new programmers (01:29:37) The rise of 'Goop' and the future of software development (01:40:20) Being an independent developer (01:42:26) Staying focused and accountable (01:47:30) Building a startup with your partner on the honeymoon (01:51:30) The responsibility of AI practitioners (01:53:07) The hidden dangers of prompt injection (01:53:44) “Artificial intelligence” is really “imitation intelligence” Show Notes: Simon's blog: https://simonwillison.net/ Natalie's post on them building a startup together: https://blog.natbat.net/post/61658401806/lanyrd-from-idea-to-exit Simon's talk from DjangoCon: https://www.youtube.com/watch?v=GLkRK2rJGB0 Simon on twitter: https://x.com/simonw Datasette: https://github.com/simonw/datasette Stay in touch:
A Silicon Valley veteran and known for his writings like "The Death of the Junior Developer", Steve Yegge joins the show to chat about his "AI Midlife Crisis", the unique writing process he employs, and building the future of coding assistants. Segments: (00:00:00) The AI Midlife Crisis (00:04:53) The power of rants (00:09:55) “You gotta be able to make yourself laugh” (00:11:46) Steve's writing process (00:14:10) “I published them… and nothing happened for six months” (00:17:30) Key to perseverance in writing? Get pissed. (00:23:24) Writing in one sitting (00:29:05) The AI Midlife Crisis (00:35:04) Management to IC (00:38:35) The acceleration and evolution of programming (00:41:43) Picking up new skills in a new domain (00:43:40) The power of prompt engineering (00:47:27) Secondary hashing (00:50:47) The importance of context in coding assistants (00:53:56) “The future of coding assistants is chat” (00:57:15) The importance of platforms in coding assistants (01:02:30) The nefarious T-word in AI (01:06:32) The death of the junior developer and its consequences (01:09:35) The future of code understanding and semantic indexing (01:13:15) The power of context in AI platforms (01:16:21) Surprising capabilities of LLMs (01:21:04) Transferable skills in AI product development (01:23:53) Mental health and the innovator's dilemma Show Notes The Death of the Junior Developer: https://sourcegraph.com/blog/the-death-of-the-junior-developer Steve's blog rants: https://steve-yegge.blogspot.com/ Steve's medium posts: https://steve-yegge.medium.com/ Sourcegraph's blog: https://sourcegraph.com/blog Steve on twitter: https://x.com/steve_yegge Stay in touch:
A veteran of early Twitter's fail whale wars, Dmitriy joins the show to chat about the time when 70% of the Hadoop cluster got accidentally deleted, the financial reality of writing a book, and how to navigate acquisitions. Segments: (00:00:00) The Infamous Hadoop Outage (00:02:36) War Stories from Twitter's Early Days (00:04:47) The Fail Whale Era (00:06:48) The Hadoop Cluster Shutdown (00:12:20) “First Restore the Service Then Fix the Problem. Not the Other Way Around.” (00:14:10) War Rooms and Organic Decision-Making (00:16:16) The Importance of Communication in Incident Management (00:19:07) That Time When the Data Center Caught Fire (00:21:45) The "Best Email Ever" at Twitter (00:25:34) The Importance of Failing (00:27:17) Distributed Systems and Error Handling (00:29:49) The Missing README (00:33:13) Agile and Scrum (00:38:44) The Financial Reality of Writing a Book (00:43:23) Collaborative Writing Is Like Open-Source Coding (00:44:41) Finding a Publisher and the Role of Editors (00:50:33) Defining the Tone and Voice of the Book (00:54:23) Acquisitions from an Engineer's Perspective (00:56:00) Integrating Acquired Teams (01:02:47) Technical Due Diligence (01:04:31) The Reality of System Implementation (01:06:11) Integration Challenges and Gotchas Show Notes: - Dmitriy Ryaboy on Twitter: https://x.com/squarecog - The Missing README: https://www.amazon.com/Missing-README-Guide-Software-Engineer/dp/1718501838 - Chris Riccomini on how to write a technical book: https://cnr.sh/essays/how-to-write-a-technical-book Stay in touch: - Make Ronak's day by signing up for our newsletter to get our favorites parts of the convo straight to your inbox every week :D https://softwaremisadventures.com/ Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Known for hosting the CoRecursive podcast, which dives into the stories behind the code, Adam joins the show to chat about discovering that the great engineers he had looked up to are actually great communicators, his framework for building one of the best storytelling engineering podcasts, and the journey getting into DevRel. Chapters: (00:00:00) Highlights (00:04:23) The power of casual conversations (00:07:08) Taking the leap into podcasting (00:10:34) The hardest part of running a podcast (00:14:03) Learning to follow up (00:16:26) Storytelling in podcasting (00:20:36) The evolution of CoRecursive (00:21:19) What makes a good story? (00:24:48) Finding the right guests (00:30:26) Preparing for interviews (00:32:07) Favorite part of making a podcast episode (00:37:43) Learning from radio journalists (00:39:47) Overcoming self-doubt (00:44:27) Balancing passion projects with full-time work (00:46:38) The power of vulnerability in storytelling (00:53:29) Behind the scenes of developer relations (01:00:38) The great engineers you know are actually great communicators Show Notes: Adam on Twitter: https://x.com/adamgordonbell CoRecursive Podcast: https://corecursive.com/ Automating follow-up emails: https://www.followupthen.com/ Stay in touch:
As the original architect and API design lead of Kubernetes, Brian joins the show to chat about why "APIs are forever", the keys to evangelizing impactful projects, and being an Uber Tech at Google, and more. Segments: (00:03:01) Internship with Mark Ewing (00:07:10) “Mark and Brian's Excellent Environment” manual (00:11:58) Poker on VT100 terminals (00:14:46) Grad school and research (00:17:23) The value of studying computer science (00:21:07) Intuition and learning (00:24:06) Reflecting on career patterns (00:26:37) Hypergrowth and learning at Transmeta (00:28:37) Debugging at the atomic level (00:34:27) Evangelizing multithreading at Google (00:39:56) The humble beginnings of Borg and Kubernetes (00:47:10) The concept of inertia in system design (00:50:07) The genesis of Kubernetes (00:53:45) The open-source proposal (00:57:25) The Unified Compute Working Group (01:02:16) Designing the Kubernetes API (01:05:03) AIP.dev and API design conventions (01:08:02) The vision for a declarative model in Kubernetes (01:17:25) Kubernetes as a DIY platform (01:19:07) The evolution of Kubernetes (01:21:40) The complexity of building a platform (01:25:11) Style guides? (01:28:23) Gotchas in Kubernetes workload APIs (01:32:02) Understanding your thinking style (01:35:37) Reflections on Kubernetes design choices (01:44:08) The importance of getting it right the first time (01:48:13) Designing for flexibility (01:51:16) Collaboration and leadership (01:52:21) The role of an Uber tech lead at Google (01:56:33) “Giving away the Legos” (02:02:29) Picking the right person to hand off (02:06:41) Overcoming writer's block Show Notes: API Design conventions: https://google.aip.dev/ Brian's blog: https://medium.com/@bgrant0607 Stay in touch:
From building a new kind of server to building a new kind of company, co-founders Bryan and Steve join the show to chat about their "meet cute" and the origin story of Oxide, their unconventional recruiting process, transparent and uniform salaries, and their solution to the "N+1 shithead problem". Segments: (00:03:03) Bryan and Steve's "meet cute" (00:05:56) "the sun does not shine on me" (00:12:19) the dagger that went into sun (00:21:23) culture of exonerating yourself vs solving customer problems (00:23:25) the shared "error in judgment" of joining joyent (00:27:54) the origin story of joyent (00:29:44) reporting to the (physical) chair (00:31:26) the comically bad ceo candidate (00:36:23) the enterprise software shift (00:40:21) the importance of curiosity in sales (00:48:30) filtering for curiosity in hiring (00:52:26) oxide's unconventional hiring process (01:04:01) bryan's worst hire (01:05:21) the limitations of traditional hiring (01:08:32) the value of written reflections (01:10:28) "what were the happiest moments in your career?" (01:21:16) misconceptions about sales and go-to-market (01:22:03) trust and alignment in sales (01:30:24) building connections across organizations (01:34:23) how to do performance reviews when everyone's paid the same? (01:40:00) the power of transparency in compensation (01:50:14) validation through impact (01:53:14) origins of on the metal (01:55:45) transparency and open communication (02:01:32) the importance of storytelling (02:04:56) building a company differently Show Notes: - Bryan's blog post on the transparent and uniform compensation model at Oxide: https://oxide.computer/blog/compensat... - On the Metal's interview with Jeff Rothschild: https://share.transistor.fm/s/6fa1eaa4 Stay in touch: - Make Ronak's day by signing up for our newsletter to get our favorites parts of the convo straight to your inbox every week :D https://softwaremisadventures.com/ Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/...
From building a data platform and Parquet at Twitter to using AI to make biology easier to engineer at Ginkgo Bioworks, Dmitriy joins the show to chat about the early days of big data, the conversation that made him jump into SynBio, LLMs for proteins and more. Segments: (00:03:18) Data engineering roots (00:05:40) Early influences at Lawrence Berkeley Lab (00:09:46) Value of a "gentleman's education in computer science" (00:14:34) The end of junior software engineers (00:20:10) Deciding to go back to school (00:21:36) Early experiments with distributed systems (00:23:33) The early days of big data (00:29:16) "The thing we used to call big data is now ai" (00:31:02) The maturation of data engineering (00:35:05) From consumer tech to biotech (00:37:42) "The 21st century is the century of biology" (00:40:54) The science of lab automation (00:47:22) Software development in biotech vs. consumer tech (00:50:34) Swes make more $$ than scientists? (00:54:27) Llms for language is boring. Llms for proteins? that's cool (01:02:52) Protein engineering 101 (01:06:01) Model explainability in biology Show Notes: The Death of the Junior Developer: https://sourcegraph.com/blog/the-death-of-the-junior-developer Dmitriy on twitter: https://x.com/squarecog?lang=en Tech and Bio slack community: https://www.bitsinbio.org/ Stay in touch: - Make Ronak's day by signing up for our newsletter to get our favorites parts of the convo straight to your inbox every week :D https://softwaremisadventures.com/ Music: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
Having quit Google in 2018 to bootstrap indie software businesses, Michael is known for writing very transparently about the ups and downs of his journey. After recently selling his hardware business TinyPilot for $600K, Michael returns to the show to chat about the misconceptions about running an indie business, the hardest part of selling a company, and why “hardware is definitely out” for his next move
Well-known for his insightful and meticulous write-ups on testing distributed systems, Kyle (aka Aphyr) joins the show to chat about the origins of Jepsen, how he built a business around testing distributed systems, his writing process, favorite databases, and more. Segments: (00:03:29) From Physics to Software Engineering (00:07:47) The origins of Jepsen (00:09:41) Turning Jepsen into a full-time venture (00:13:14) Jepsen's testing philosophy (00:16:30) The consulting journey (00:19:16) Structuring a consultancy (00:22:32) Setting boundaries (00:24:32) Pricing misadventures (00:29:17) Pros and cons of being an independent consultant (00:32:08) Managing your time when working for yourself (00:38:23) Best part of the job (00:41:13) Early writing influences (00:45:25) LLMs and AI-generated content (00:48:17) “The period where you can trust what you read is actually very recent” (00:51:33) How to become a better writer (00:54:25) Developing a formal understanding of distributed systems (00:59:30) Common faults in distributed systems (01:01:17) The complexity of testing distributed systems (01:07:32) Communicating criticism effectively (01:10:26) Advice for distributed systems engineers (01:13:46) “Anybody trying to sell you a distributed lock is selling you sawdust and lies” (01:16:31) Failure mode documentation (01:18:52) The pitfalls of containerization (01:20:17) Lightning round - favorite databases Show Notes: “Anybody who is trying to sell you a distributed lock is trying to sell you sawdust and Lies”: https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html Kyle's excellent write-ups on testing distributed systems: https://jepsen.io/analyses Kyle's blog: https://aphyr.com/posts Training courses that Kyle runs: https://jepsen.io/services/training Stay in touch:
From creating one of the Python's most influential libraries to co-founding Voltron Data, Wes joins the show to chat about why the book cover of the pandas book doesn't feature a panda, open source pitfalls to avoid, the pros and cons of hiring engineers at a non-profit, and more. Segments: (00:02:50) Guang's complaint about the pandas book cover (00:04:38) Quarto and Open Access Publishing (00:12:00) Convincing Wall Street to Open Source (00:15:31) Publishing the first python package over Christmas (00:18:01) Doubling Down on Building pandas (00:23:23) Personal sacrifices for the sake of impact (00:26:28) The Evolution of Open-Source (00:29:19) “Open source development started out as a very privileged activity” (00:32:40) The Consulting Trap (00:35:17) The Startup Trap (00:39:29) The Corporate User Trap (00:44:21) Avoiding the Startup Trap (00:46:54) Non-Profit vs. For-Profit (00:48:09) The Challenges of Hiring Engineers in a Non-Profit Setting (00:50:08) The Benefits of Remote Work for Open Source Development (00:52:15) Balancing Open Source and Enterprise Interests (00:57:25) New Funding Models for Open Source? (01:00:01) Getting into VC (01:06:19) The Future of Composable Data Systems Show Notes: - online edition of pandas book: https://wesmckinney.com/book/ - the new digital publishing tool that Wes recommends: https://quarto.org/ Stay in touch:
From creating Envoy to co-founding bitdrift to reimagine mobile observability, Matt joins the show to chat about being told to simply “write some proxy in Python” in the early days of building Envoy, early influences from building “shrink wrap” software at Microsoft, the process of spinning bitdrift out of Lyft, and much more. Segments: (00:03:10) Being a plumber on LinkedIn (00:05:00) Early influences from building “shrink wrap” software at Microsoft (00:10:44) Getting diverse work experiences (00:16:36) Setting high standards for the team (00:20:42) Lessons from failure of the first startup (00:22:02) Building a successful open source project vs. running a startup (00:25:25) Why not start a company around Envoy? (00:29:54) Why not open source bitdrift? (00:36:01) Mitigating the risk of big companies building in-house solutions (00:38:16) Co-founding bitdrift to tackle mobile observability (00:40:37) Applying lessons from the first startup failure (00:44:14) Why mobile observability is so hard (00:50:06) Open source vs source available (00:53:33) The software licensing strugglebus (00:58:03) How bitdrift was spinned out of Lyft (01:03:36) Achieving work-life balance through leverage (01:06:13) The early days of Envoy (01:09:20) Impact driven development (01:13:43) The crazy decision to build Envoy in retrospect Show Notes: Matt's blog posts on why mobile observability is a hard problem: https://mattklein123.dev/2024/04/24/no-one-talks-about-mobile-observability/ The new company Matt is building: https://bitdrift.io/ Stay in touch:
From being a distinguished engineer at Sun Microsystems to co-founding Oxide Computer Company to build a new kind of server, Bryan joins the show to chat about being told that he's on a suicide mission when starting Oxide, the moment he felt “I'm actually living HBO Silicon Valley”, and lessons from Sun. And much more. Chapters: (00:02:24) The Origin of Bryan's Nom-de-Guerre: "Colonel of Data Corruption" (00:04:02) What Debugging Performance Issues at Twitter in the Early Days Revealed About Silicon Valley (00:13:37) Value of Formal Education and the Experience That Everyone Should Have (00:16:02) Balancing Following One's Passion vs. Having Stability (00:21:14) What Shaped Bryan's Sense of Integrity (00:25:39) The Moments When Values Are Instilled (00:30:25) The Dark Side of Tech (00:35:12) "Economic Opportunities Attract Economic Opportunists" (00:40:35) The Origins of Oxide Computers (00:50:20) Building the A-Team (00:52:18) "Compaq Was the Most Successful Startup" (00:55:51) The Venture Capitalist's Dilemma (01:03:04) Being Told "You're on a Suicide Mission" (01:07:12) The Lifestyle of the "Lifestyle Business" (01:09:30) The Harsh Reality of Raising Venture Capital (01:13:12) The Challenges of Building Hardware (01:16:36) Why You Should Think About Not Only Gross Margin but Net Margin (01:19:14) Hardware and Software Co-Design (01:22:06) The Frustrations of Infrastructure Deployment (01:26:46) Finding the Right VCs (01:28:16) "Oh My God, I'm Actually Living HBO Silicon Valley" (01:33:12) Oxide's Principles and Lessons from Sun Microsystems (01:39:51) Sun's Unspoken Values (01:45:03) Sun's Legacy of Empowering Employees (01:48:53) Sun's Missed Opportunities (01:53:04) The Reason Why Sun Survived the Dot-Com Crash (01:56:21) "God Bless the Early Adopters" (01:57:39) A Tweet from Shopify's CEO (02:01:24) The Hard Thing About Hard Things (02:12:55) The Hardest Moment in Oxide's History Show Notes: - Oxide's principles: https://oxide.computer/principles - Requests for Discussion (RFDs): https://rfd.shared.oxide.computer/ - Toby's tweet: https://x.com/tobi/status/1793798092212367669 - Bryan on twitter: https://x.com/bcantrill Stay in touch:
From writing the first lines of Kafka over a Christmas break as a LinkedIn engineer to running a public company as the CEO of Confluent, Jay joins the show to chat about how he and his co-founders convinced investors to take a chance on their vision, what many engineers get wrong about communication, and why engineers can make great CEOs - even when coding is not in the job description. And much more. Segments: (00:01:16) The Shaved Head Bet (00:04:07) Fundraising (00:12:16) The Role of Technical Background in VCs (00:15:48) The power of believing in the possibility of important changes (00:18:29) The Journey to starting Confluent (00:27:11) Kafka's Controversial Beginnings (00:34:30) Effective Communication in Engineering (00:44:20) The Early Days of Kafka (00:48:31) The Power of Storytelling (00:57:19) Early days of Confluent (01:03:06) Do Engineers Make Good CEOs? (01:07:59) A Typical Day in the Life of a CEO (01:12:24) The Evolution of Data Streaming Show Notes: - “The log” blog post that solidified Jay and his co-founders' conviction to found Confluent: https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying - Jay on twitter: https://x.com/jaykreps Stay in touch:
If you've worked on data problems, you probably have heard of Airflow and Superset, two powerful tools that have cemented their place in the data ecosystem. Building successful open-source software is no easy feat, and even fewer engineers have done this back to back. In part 2 of the conversation, we talk about Max's journey in open source. Segments: (00:03:27) “Project-Community Fit” in Open Source (00:08:31) Fostering Relationships in Open Source (00:10:58) Dealing with Trolls (00:13:40) Attributes of Good Open Source Contributors (00:20:01) How to Get Started with Contributing (00:27:58) Origin Stories of Airflow and Superset (00:33:27) Biggest Surprise since Founding a VC-backed Company? (00:38:47) Picking What to Work On (00:41:46) Advice to Engineers for Building the Next Airflow/Superset? (00:42:35) The 2 New Open Source Projects that Max is Starting (00:52:10) Challenges of Being a Founder (00:57:38) Open Sourcing Ideas Show Notes: Part 1 of our conversation: https://softwaremisadventures.com/p/maxime-beauchemin-llm-ready Max on LinkedIn: https://www.linkedin.com/in/maximebeauchemin/ SQL All Stars: https://github.com/preset-io/allstars Governator: https://github.com/mistercrunch/governator Stay in touch:
If you've worked on data problems, you probably have heard of Airflow and Superset, two powerful tools that have cemented their place in the data ecosystem. Building successful open-source software is no easy feat, and even fewer engineers have done this back to back. In Part 1 of this conversation, we chat about how to adapt to the LLM-age as engineers. Segments: (00:01:59) The Rise and Fall of the Data Engineer (00:11:13) The Importance of Executive Skill in the Era of AI (00:13:53) Developing the first reflex to use AI (00:17:47) What are LLMs good at? (00:25:33) Text to SQL (00:28:19) Promptimize (00:32:16) Using tools LangChain (00:35:02) Writing better prompts Show Notes: - Max on Linkedin: https://www.linkedin.com/in/maximebeauchemin/ - Rise of the Data Engineer: https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603 - Downfall of the Data Engineer: https://maximebeauchemin.medium.com/the-downfall-of-the-data-engineer-5bfb701e5d6b - Promptimize: https://github.com/preset-io/promptimize Stay in touch:
Out of thousands of engineers at Uber, there's only a handful of Distinguished Engineers and Joakim was one of them. In this conversation we chat about Why software engineering is a lot like a sausage factory. Considerations for leaving big tech for a startup. “How to beat the promo commitee”. How can one effectively shape engineering culture? “Mentoring two people on the same team is a waste”. Much More. Subscribe now Segments: [0:01:52] The “reverse sausage” architecture [0:07:36] How to get people on board with the new deployment system? [0:13:55] What does it mean to be a distinguished engineer? [0:17:47] Under-appreciated soft skills? [0:21:28] How to improve technical writing [0:24:16] Do all senior engineers need to write and review code every day? [0:30:19] How to search out where to contribute when your time is so constrained? [0:43:10] How to maximize your impact as a mentor [0:48:52] “How to beat the promo committee” [0:52:56] Effective means to influence engineering culture? [0:57:09] Capping the company at 150 employees [1:03:33] Why join a startup instead of moving to another big tech company? [1:11:14] What Joakim is working on now at Beyond Work Show Notes: Joakim on leaving Uber to start Beyond Work: https://www.linkedin.com/pulse/why-i-left-uber-start-beyond-work-joakim-recht-o63of?trk=public_post_feed-article-content Read Joakim's other excellent posts here: https://www.linkedin.com/in/recht/recent-activity/all/ Stay in touch:
We're super excited to have Kelsey back on the show! Our last conversation was around his incredible career journey - from working at McDonald's after school to starting his own computer store, to hacking on python infrastructure with the core developers, to meeting Satya Nadella for an interview. In part two of this conversation, we dive deep into Kelsey's experiences learning in public and writing “Kubernetes: Up and Running”: The biggest barrier to getting started with learning in public and a step-by-step guide to overcome it Cautionary tale of the “JavaScript sucks” guy Developing the skill of crafting good analogies The business and economics of writing a book Much more Segments: [0:01:12] Writing and learning in public. [0:10:58] Writing "Kubernetes: Up and Running." [0:16:05] The business and economics of writing a book. [0:21:27] Why your first book should not exceed 100 pages. [0:23:36] What prevented Kelsey from giving up on the book. [0:26:15] Being intentional about building an audience and the cautionary tale of the "JavaScript sucks" guy. [0:36:44] Authenticity does not guarantee success. [0:39:09] Developing the skill of crafting effective analogies. [0:47:47] Advice for engineers to leverage their technical skills outside of the nine-to-five. Show Notes: Kelsey on twitter: https://twitter.com/kelseyhightower Our previous conversation with Kelsey about retiring as Distinguished Engineer from Google at 42: https://softwaremisadventures.com/p/kelsey-hightower-on-retiring-as-distinguished-057 Stay in touch:
We're super excited to have Kelsey back on the show! Our last conversation was around his incredible career journey - from working at McDonald's after school to starting his own computer store, to hacking on python infrastructure with the core developers, to meeting Satya Nadella for an interview. In part one of this conversation, we dive deep into Kelsey's experiences and expertise as a startup advisor: How to break into advising when you don't have a lot of connections How to influence without authority Passive vs. active advising How to add value as an advisor Setting boundaries and expectations Much more Segments: [0:01:53] Being a "junior retiree" [0:11:00] How Kelsey got started with startup advising. [0:17:43] How to avoid mismatches in advisory engagements? [0:27:23] How to influence without authority as an advisor? [0:32:58] How to establish boundaries as an advisor. [0:38:29] Actions engineers can take today to prepare themselves for future startup advising roles. [0:42:55] How to manage the balance between advising and your primary job. [0:44:32] How to cultivate perspectives beyond engineering. Show Notes: Kelsey on twitter: https://twitter.com/kelseyhightower Our previous conversation with Kelsey about retiring as Distinguished Engineer from Google at 42: https://softwaremisadventures.com/p/kelsey-hightower-on-retiring-as-distinguished-057 Stay in touch:
As a self-described “gainfully unemployed data person”, Josh Wills is an angel investor and has worked on and led data teams at Slack, Cloudera, WeaveGrid and Google. We discuss: How to get started with angel investing without a ton of $$ Attributes that define great engineering managers What's it like transitioning from management back to IC Challenges in Climate Tech from a software perspective And more Segments: [0:01:35] Transitioning from management to individual contributor (IC). [0:10:19] Emotional intelligence and its role in engineering management. [0:25:21] Contrasting the hard power of management with the soft power of senior individual contributors. [0:37:18] Addressing challenges in climate technology. [0:51:34] The importance of practicality and how to assess it in interviews. [0:56:01] Josh's journey into angel investing. [1:12:59] Criteria used by Josh to evaluate whether to invest in a startup. Show Notes: Josh on Twitter: https://twitter.com/josh_wills The “Touchy Feely” course at Stanford: https://www.gsb.stanford.edu/experience/learning/leadership/interpersonal-dynamics Jason Calacanis's book on angel investing: https://www.amazon.com/Angel-Invest-Technology-Startups-Timeless-Investor/dp/0062560700 Stay in touch:
Known for coining the term “Data Scientist”, DJ is a renowned technologist with a diverse background spanning academia, industry, and government. Having led product teams at companies like RelateIQ and LinkedIn, DJ was appointed by President Obama to be the first U.S. Chief Data Scientist where his efforts led to the establishment of nearly 40 Chief Data Officer roles across the Federal government, new health care programs as well as new criminal justice reforms. We discuss: “Dream in years, plan in months, evaluate in weeks, ship daily” High school misadventures that shaped DJ's world view Under-hyped opportunities in AI Building with the customer vs. “if you build it, they will come” Do we need more regulations on AI? Much more. Segments: [0:01:48] Picking locks in high school. [0:07:15] How can we make it easier for others to take a risk on us? [0:11:29] How do you decide whom to take a chance on? [0:14:24] The 70-20-10 framework for choosing what to work on. [0:17:49] "No rules, only guidelines." [0:24:09] Developing personal ethics. [0:30:52] Building with the customer versus "if you build it, they will come." [0:34:51] "Dream in years, plan in months, evaluate in weeks, ship daily." [0:43:56] Ideas should be considered in terms of momentum. [0:46:11] Under-hyped trends in AI? [0:51:53] How does AI need to evolve to operate in fields that require very low margins of error? [0:56:09] Concerning advances that lack sufficient guardrails? [0:58:55] Do we need more regulations on AI? [1:02:48] "Failure is the only option." Show Notes: DJ Patil on Linkedin: https://www.linkedin.com/in/dpatil/ The card that DJ carried in his notebook: https://twitter.com/DJ44/status/819316928623902720 DJ's interview series with thought leaders in Data Science: https://www.linkedin.com/learning/data-impact-with-dj-patil/data-science-how-did-we-get-here Stay in touch:
Erica is a former VP of Engineering at LinkedIn. Having almost dropped out of college, Erica's journey in tech is a testament to her perseverance and dedication. In addition to leading engineering teams at LinkedIn, Erica founded WIT (Women In Tech) to empower women within the company as well as the broader tech community. We discuss: How to create incentives for diversity-building work. Building your personal “board of directors”. Balancing mentoring work vs sprint tickets. Structuring a community for long-term success. Much more. Segments: [0:18:04] building women-in-tech and the importance of leading by example [0:21:17] creating incentives for diversity-building work [0:23:30] examples of building better products with more diverse stakeholders [0:29:48] how to spot red flags during the interview process [0:32:51] do men and women bring different skill sets to the problem or it's all individual based? [0:35:34] building your personal “board of directors” [0:40:21] how to ask people for mentorship if I'm shy? [0:44:21] exploring new projects [0:53:32] how to hold yourself accountable when there's no structure? [1:03:17] how to structure a community for long-term success [1:10:22] how to balance mentoring work vs sprint tickets [1:14:57] journey to being on the advisory board for SJU Show Notes: Erica on Linkedin: https://www.linkedin.com/in/ericalockheimer/ Stay in touch:
Mitchell co-founded HashiCorp in 2012 and created many important infrastructure tools, such as Terraform, Vagrant, Packer, and Consul. In addition to being a prolific engineer, Mitchell grew HashiCorp into a multi-billion-dollar public company. We discuss: How to structure large projects to avoid demotivation or burnout The "A.P.P.L.E" framework for diffusing tense situations and handling trolls How to decide what to work on Mitchell's unconventional transitions from CEO to CTO and then back to an individual contributor (IC) The quality that Mitchell values the most in an engineering team. Segments: [0:14:19] Impactful lessons from working at the Apple Store in college [0:22:26] Origin story of HashiCorp [0:26:08] College side project that turned into Mitchell's first financial success [0:31:25] Why infrastructure? [0:39:50] How individual products came about [0:44:17] Challenges of fundraising as a company with an umbrella of products [0:48:20] Balancing being the CTO and writing code: “I didn't want to be that CTO that just produced technical debt” [0:53:09] Transitioning from CEO to co-CTO [0:57:26] From CTO to Individual Contributor [1:06:03] What's next? Show Notes: Mitchell's blog: https://mitchellh.com/writing The “APPLE” principle that has guided Mitchell throughout his career: https://mitchellh.com/writing/apple-the-key-to-my-success Mitchell's Startup Banking Story
After 17 years building SRE teams at Google and serving as the Site Lead for Engineering in Dublin, Dave joined Elastic as the Sr Director of Engineering and later VP of Engineering at Twilio. Following a recent career break, Dave now divides his time between coaching engineering leaders and consulting to help busy teams be more effective. In the heart of our conversation, Dave shares the frameworks and practical tips he's amassed for making the most of the mentorship experience. Segments: [00:01:45] Growing remote SRE team as the Google Dublin Site Lead [00:19:49] Company Culture vs Company Values [00:23:47] How to find companies that are serious about remote work [00:34:26] Coaching vs Mentoring at Big vs Small companies [00:38:35] How Google does coaching & mentoring [00:41:38] What makes a good 1-1 [00:46:56] Considerations for seeking out a mentor [01:03:27] Getting external mentorship while working at a small company [1:08:20] How to set specific goals for mentorship [1:20:13] The “CIA” Method for career decision making [1:31:08] How to sunset mentorship 1-1s [1:35:20] Venturing into consulting to help busy teams be more effective [1:42:13] How to get started with consulting Show Notes: Dave on LinkedIn: https://www.linkedin.com/in/gerrowadat/ Dave's personal website: https://log.andvari.net/pages/about.html Dave's coaching website: https://www.strategichopes.co/ Service Level Objectives by Alex Hidalgo: https://www.oreilly.com/library/view/implementing-service-level/9781492076803/ The Staff Engineer's Path by Tanya Reilly: https://www.oreilly.com/library/view/the-staff-engineers/9781098118723/ Stay in touch:
At the personal request of Reid Hoffman to emerge from early retirement, David joined LinkedIn in 2009 during a period of rapid growth to help stabilize the chaos, cultivating a much-needed culture of “Site Up and Secure.” Before this, David served as SVP of Engineering and Operations at Yahoo!, overseeing their Search Marketing organization and the Production Operations infrastructure for the entire company. Throughout his career, David has held multiple leadership positions and is recognized as one of the top operations executives. David's intensity, passion, courage and commitment to work have always been deeply admired by his colleagues and his wisdom, well captured in one line axioms, better known as Henkeisms, are still echoed at LinkedIn. This episode was first published almost 3 years ago and we are sharing it again because it's been one of our favorites :) Hope you like it too! Segments: [00:01:37] “This is my freaking site” poster [00:04:10] David's first 2 retirements and starting at LinkedIn [00:09:41] IC to Management [00:17:20] Site-Up Culture [00:21:58] Re-architecting LinkedIn's release process [00:27:23] War stories from Yahoo: The 10G Massacre [00:32:06] “Go to work every day willing to be fired”: Project Panama at Yahoo [00:43:33] The power of Individual Contributors Show Notes: David Henke on LinkedIn Learning to Lead - David Henke's talk on leadership which he delivered at his alma mater - UCSB Project Inversion at LinkedIn Stay in touch:
Before joining CueIn last year as a Founding Data Scientist, Melissa was a Lead Data Scientist at Salesforce working on the Einstein Platform that focused on automating Data Science workflows. In this conversation we dive into Melissa's unique journey, what to do in the face of increasing job automation and explore the latest developments in practical AI. Segments: [00:02:13] Melissa's background in computational neuroscience [00:06:08] 7 years at Salesforce vs startup [00:11:31] Joining CueIn [00:19:30] Chatbot observability [00:28:16] Feedback loops [00:33:10] Use LLM to observe.. LLMs? [00:39:06] AI automating jobs [00:43:01] Doing ML in 2017 vs now [00:50:35] Few shot learning, Hugging Face Show Notes: Melissa's Linkedin: https://www.linkedin.com/in/melissajanerunfeldt/ Stay in touch:
What's it like to open source an internal project at a big tech company like LinkedIn? When should a company open source a project and what are the benefits and challenges that come along with it? If you want to open source an internal project, how should you go about advocating for it? Félix is a Principal Staff Engineer at LinkedIn where he works on the data infrastructure team that builds Venice. Venice is a distributed derived data store which LinkedIn open sourced in the fall of 2022. He joins the show to chat about his experiences leading the open source efforts for Venice, as well as his thoughts on balancing leadership with execution, delegating responsibility and fostering a culture of ownership, and growth within a team. --- Show Notes: Check out Venice: https://github.com/linkedin/venice Félix's linkedin: https://github.com/linkedin/venice --- Stay in Touch: ✉️ Subscribe to our newsletter: https://softwaremisadventures.com
Should engineers and product managers “stay in their lanes”? What big company habits should you keep vs unlearn when transitioning to working at a start-up? Could an ayahuasca retreat give you more clarity on your career goals? Ilya and Arnab join the show to share their journey quitting big tech to bootstrap a podcasting startup. Arnab and Ilya are the co-founders of Metacast. Before starting the company, Arnab was a Principal Engineer at AWS while Ilya was a Sr. Product Manager at Google and Principal PM at Amazon before that. While at Amazon, Arnab and Ilya worked together on various projects including AWS Chatbot, which they started from scratch and launched into a successful AWS service. Show Notes: Sign-up for the podcast app that they're launching soon: metacast.app Newsletter about their startup journey: https://www.metacastpodcast.com/ Stay in Touch: ✉️ Subscribe to our newsletter: https://softwaremisadventures.com
What's "AI in a Box"? Pete Warden joins the show to share a new project he recently launched that encapulates Language Transcription/Translation and Question Answering capabilities into a wallet-sized board running locally without internet, as well as stories and learnings from building his new company, Useful Sensors, after 7 years of leading the tensorflow mobile project at Google. Pete is the CEO of Useful Sensors. After founding his own company Jetpac and selling it to Google in 2014, he became a staff research engineer at Google, where he led the TensorFlow Mobile team. Pete is also the author of two well-received books: "Public Data Sources" and "Big Data Glossary" and builder of OpenHeatMap. Show Notes: AI in a Box crowdfunding: https://www.crowdsupply.com/useful-sensors/ai-in-a-box Pete's blog: https://petewarden.com/ Useful Sensors: https://usefulsensors.com/ Stay in touch: ✉️ Subscribe to our newsletter: https://softwaremisadventures.com
What does it mean to change the economics of software development? Nathan Marz joins the show to share how they reduced the cost of building Mastodon at Twitter-scale by 100X and the 10 years journey to build Rama, a new programming platform that made this feat possible. Nathan is the founder of Red Planet Labs. Prior to RPL, he led engineering for BackType which was acquired by Twitter in 2011. Nathan created the Apache Storm project and wrote the book Big Data: Principles and best practices of scalable realtime data systems. Outside of working, Nathan is a private pilot, loves going to stand-up comedy shows, and is forever trying to teach his dog new tricks. Show Notes: Nathan's Twitter: https://twitter.com/nathanmarz What is Rama? https://redplanetlabs.com/learn-rama Reducing the cost of building Mastodon at Twitter-scale by 100X: https://blog.redplanetlabs.com/2023/08/15/how-we-reduced-the-cost-of-building-twitter-at-twitter-scale-by-100x/ Stay in touch: ✉️ Subscribe to our newsletter: https://softwaremisadventures.substack.com
Kelsey Hightower was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. In this second part of the conversation, we focus on Kelsey's retirement - the financial planning that enabled him to retire at 42, how he got started advising startups and his perspectives on compensation, turning down a substantial offer from Microsoft and meeting Satya Nadella in person. And, of course, plans for the future.
Kelsey Hightower was a Distinguished Engineer at Google, where he worked on Google Cloud Platform. In this first part of the conversation, we delve into pivotal moments in Kelsey's career journey ranging from buying his first car by working at mcdonald's after school, to starting his own computer store that turned into a music studio after 6pm, to hacking on python infrastructure with the core developers. Through these stories, we learned a ton about how Kelsey thinks about acquiring new skills - getting paid for it, breaking into the world of open source, navigating corporate politics, building trust within a team, and much more.
Julie Amundson is a Sr Staff Software Engineer at Google working on Machine Learning Infrastructure. Prior to Google, she was the Director of Machine Learning Infrastructure at Netflix. Julie decided to take a career break last year when she was affected by mass layoffs. In this conversation, we talk to her about what it was like to find a job during hiring freezes, what it was like to position herself in this market, whether the interviewers cared about the career break she took and how the career break changed her perspective towards work and life.
Chris Pruett is the CTO and Co-founder of Jam - a new way to share and listen to bite-sized audio. Prior to Jam, Chris spent 9+ years at LinkedIn growing from an engineering manager to VP of Engineering. During his tenure at LinkedIn, he worked on almost all aspects of the app and towards the end, led an org of 500+ engineers working on Feed, Messaging, Identity and Search. In this episode, we discuss how he made the decision to leave his leadership position at LinkedIn and co-found Jam. We also spoke about his time at LinkedIn and how he developed the practice to make value based decisions both in professional and personal life.
Short episode about reflections on the past year and plans for 2022.
Kailash is the CTO at Zerodha, the largest stock broker in India. In this conversation, we speak with him about absurdism - a philosophy that guides his personal and professional worldview. We discuss how he built Zerodha's tech team, their team culture and how the team operates so efficiently while being so lean. We also discuss why Zerodha self-hosts all of their tech stack, what they look for when hiring engineers and how their systems scaled when the user base grew from 2 to 8 million in 18 months.
Michael Lynch is the founder of TinyPilot. After doing software engineering at Microsoft and Google for 7 years, Michael decided in 2018 to quit and start working for himself by building small software businesses. From years of negative profit to now building a $450K+ ARR hardware business, Michael joins the show to chat about what made him quit his cushy job at Google, how he builds in public with monthly retrospectives, what he has learned over the 3 years indie hacking and much more.
Cory is currently a Solutions Engineer at Jeli.io and very well known in the community for his work on Observability. His career in observability began at Twitter where he managed the observability team and then he joined Stripe, where he created and led the observability team, this time around as a Principal Engineer. We talk to him about how he got his start in customer support and the role it played in the later part of his career. We discuss his time at Twitter where there was a power outage in the data center on the day he joined and how once he had to stay up all night dealing with file handle leaks. We also discuss how he created and led the observability team at Stripe as an individual contributor, how one can succeed in a new org, how to navigate information asymmetry in the workplace, what are some effective ways to advocate for your team and how we all are just humans trying to get stuff done.
Ashwin is a Startup Partnership Lead at Stripe. From web development to co-founding a YC startup, to deep learning, Ashwin has a knack for picking up new skills extremely quickly. In this episode, we chat about the methods he employed to successfully make these transitions, learnings/tips from winning 30+ hackathons in a row, and what engineers can gain from better story-telling.
Bruno Connelly is a VP of Engineering at LinkedIn. He leads the Site Engineering org responsible for LinkedIn's production infrastructure. He joins the show to talk about his journey in tech - from teaching himself how to code at a young age, building, maintaining and reverse engineering software as a teenager, building ISPs in the early part of his career (there are some fun stories that involve sleeping in the data center) to leading the SRE org at LinkedIn over the last decade. He talks about the early days at LinkedIn that involved a lot of firefighting to keep the site up, how the team built technical stability and scaled the platform. We also dive into how he grew the SRE org globally and overcame challenges that came with the growth. Throughout the conversation, he shares various nuggets of wisdom - like how to stay calm under pressure and how to make people feel at ease - as he describes his leadership style, people who have influenced him and what he thinks is a positive way to collaborate with people. Website link: https://softwaremisadventures.com/bruno Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en
With 5+ years of experience building resilient systems at the Netflix scale, Lorin joins the show to chat about his favorite incident story, the path that led him to doing chaos engineering (and later away from it), and advocating for a dedicated analyst to talk to people after an incident. Throughout the conversation, Lorin shares his philosophy and tips on how to learn from incidents, what engineers can gain from writing better, and why some metrics may not be as useful as you think.
Spoons is the Co-founder and Chief Architect of Lightstep. He joins the show to talk about building systems at Google scale and various aspects that make Google a weird place than other companies. We talked about Spoons's journey of leaving Google and deciding to join Lightstep as a co-founder. We dig into the challenges during the early days of Lightstep and discuss the importance of speaking to customers to build the right product. We talk about what it's like to start a family and run a startup and how one can be intentional about building a company's culture. As always, we go through some of the misadventures and one of them involves a cable being cut under the English channel.
Having led 100+ ML projects at Insight and built ML systems at Stripe scale, Emmanuel joins the show to chat about how to build useful ML products and what happens next when the model is in production. Throughout the conversation, Manu shares stories and advice on topics like the common mistakes people make when starting a new ML project, what's similar and different about the lifecycle of ML systems compared to traditional software, and writing a technical book.
Todd is a Sr Director of Engineering at Google where he leads Site Reliability Engineering teams for Machine Learning. Having recently presented on how ML breaks in production, by examining more than a decade of outage postmortems at Google, Todd joins the show to chat about why many ways that ML systems break in production have nothing to do with ML, what’s different about engineering reliable systems for ML, vs traditional software (and the many ways that they are similar), what he looks for when hiring ML SREs, and more.
Evan Estola (https://twitter.com/estola) is a Director of Engineering at Flatiron Health where he's leading software engineering teams focused on building Machine Learning products. Throughout this episode, Evan shares various stories when recommendation systems didn’t work as expected, like this one time when members saw mathematically worst recommendations for meetups near them. He also shares why Schenectady, NY pops up on some lists of most popular cities and the story behind the Wall Street Journal article titled 'Orbitz steers Mac users to pricier hotels'. We also discuss skills Evan looks for when hiring ML engineers, how to give constructive feedback, filter bubbles and much more.
Uma is a VP of Engineering at Render. In this episode, she shared with us her insights on how to successfully manage infrastructure migrations. We discussed the importance of communicating the "why" behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects and much more. Uma also shared stories where parts of a migration didn’t go as planned, how the team fixed the issue and the kind of engineers she thinks would make good tech leads for these projects. We had a great time speaking with Uma! Our major focus in this episode was large scale infrastructure migrations and Uma shared many insights on how to manage them successfully. We discussed the importance of communicating the “why” behind a migration, identifying success metrics, creating a culture where migrations are identified as highly impactful projects and much more. Uma also shared stories where parts of a migration didn’t go as planned, how the team fixed the issue and the kind of engineers she thinks would make good tech leads for these projects. There’s a lot to learn from Uma’s experience. Please enjoy this highly educational conversation with Uma Chingunde!
Charity Majors (https://twitter.com/mipsytipsy) is the co-founder and CTO of Honeycomb.io. Before this she worked at Facebook, Parse and Linden Lab on infrastructure and developer tools, and always seemed to wind up running the databases. She is the co-author of Database Reliability Engineering book and also has an amazing blog at charity.wtf. We love the content in her blogs and have learned a lot from them. We had a lot of fun speaking with Charity in this lively conversation! We learned about her journey from being an engineer to co-founding Honeycomb, what it was like being on-call when she was only 17, and staying calm during production incidents. We talked about various production outages throughout the episode and our favorite involved driving to a datacenter to flip a DB switch. Charity also shares what it takes to build an awesome engineering culture, the engineer/manager pendulum, and qualities Charity looks for when hiring senior engineers.
Tammy Bryant Butow is a Principal SRE at Gremlin where she works on Chaos Engineering. In this episode, we discuss how her curiosity led her to the world of infrastructure engineering, an outage from her early days where a core switch took down half the datacenter, her experience running a disaster recovery test and how it taught her about the importance of injecting failures into a system to make it more resilient. We also touch on advanced failure injection techniques, how chaos engineering is evolving and how extreme sports help Tammy keep calm under pressure. Lastly, Tammy has some great advice for teams looking to get started with chaos engineering.
Oliver Leaver-Smith, better known as Ols, is a Senior Devops Engineer at Sky Betting and Gaming. In this episode, we discuss how a seemingly simple monitoring change ended up taking down the entire site. We also talk about chaos and resilience engineering. We discuss how the team at Sky Betting and Gaming conducts fire drills (chaos engineering exercises) where they not only test the resiliency of their software systems but also their people systems. We walk through a recent example of a fire drill, how they have evolved over the past few years and the lessons learned in the process.
Ryan Underwood is a Staff SRE and tech lead on the Helix and Zookeeper SRE team at LinkedIn. Prior to LinkedIn, he was an SRE at Machine Zone and Google. Apart from his regular responsibilities, Ryan’s interest and expertise include debugging production kernel, I/O and containerization issues. His opinion about not treating software as a black box and his persistent approach to debugging complex problems are truly inspiring. On several occasions, Ryan’s colleagues have leaned on him to solve an esoteric problem that everyone thought was insurmountable. Our main focus today is one such problem that Ryan and team ran into while upgrading machines to 4.x kernel that resulted in elevated 99th percentile latencies. We dive into what the problem was, how it was identified and how it was fixed. We discuss some of the tools and practices that are helpful in debugging system performance issues. And we also talk about Ryan’s background and how his curiosity landed him a career in Site Reliability Engineering. Please enjoy this deeply technical and highly educational conversation with Ryan Underwood. Website link: https://softwaremisadventures.com/ryan Music Credits: Vlad Gluschenko — Forest License: Creative Commons Attribution 3.0 Unported: https://creativecommons.org/licenses/by/3.0/deed.en