A podcast about building and operating reliable software.
Austin Parker and Isobel Redelmeier
We got paged, but how'd we do? George Miranda (@gmiranda23) from PagerDuty joins us to talk incident response and grade our performance on a recent incident! Also; us-east-1's very bad day and what $0.96 in AWS credit will get you.--Amazon AWS Outage Shows That The Cloud Is Not Always SafePagerDuty Ops Guides
The internet broke, did you notice? We're talking BGP, WAF, and more. Also, an interview with Fred Hebert (@mononcqc) and Tristan Sloughter (@t_sloughter) on Erlang, Elixir, and doing ops good. Links and References Cloudflare outage writeup: https://blog.cloudflare.com/details-of-the-cloudflare-outage-on-july-2-2019/ Stripe outage writeup analysis: https://lorinhochstein.wordpress.com/2019/07/27/contributors-mitigators-risks-stripe-2019-07-10-outage/ BGP outage article in Slate: https://slate.com/technology/2019/06/verizon-dqe-outage-internet-cloudflare-reddit-aws.html Property-Based Testing with PropEr, Erlang, and Elixir: https://propertesting.com/ Adopting Erlang: https://adoptingerlang.org/ Erlang In Anger: https://www.erlang-in-anger.com/ How I Start: https://howistart.org/
We're back with a special episode that's all about OpenTelemetry! Joining us today is Ted Young (@tedsuo), Director of Open Source at LightStep and a maintainer of the OpenTelemetry project, here to talk about what this new project is, how it helps solve some of the biggest problems in observability libraries, and how you can get involved. Find OpenTelemetry on the web at https://opentelemetry.io. Get involved with OpenTelemetry development at https://github.com/open-telemetry/community!
Discussion Items - Azure Outage (https://twitter.com/AzureSupport/status/1124086111050051584?s=20) Facebook Outage (https://www.theverge.com/2019/4/14/18310069/facebook-instagram-whatsapp-down-outage-issues) Firefox Breaks Addons (https://blog.mozilla.org/addons/2019/05/04/update-regarding-add-ons-in-firefox/) Distributed Tracing, From Zero To One (https://thmsmlr.com/distributed-tracing-zero-to-one/) Follow Thomas on Twitter at @thmsmlr
Austin and Isobel are joined by LightStep's Stephanie Baum to discuss post-mortems, the sorry state of conference wifi, and how the plot of LOST has a terrifying amount in common with modern hotel reservation systems. Links to articles discussed - SquareSpace Outage Postmortem (https://status.squarespace.com/incidents/w8tvp6s2x5fr) Post-Mortem GitHub Repo (http://danluu.com/postmortem-lessons/ and https://github.com/danluu/post-mortems) iOS Budapest Conference Postmortem (https://medium.com/@nfgabor/story-of-a-failed-conference-ios-conf-budapest-postmortem-54c3b0199f7d) Have your own on-call story you'd like to share? Tweet us @oncallmemaybe or email stories@oncallmemaybe.com
Austin and Isobel discuss the absolute state of things around DevOps, ML, and On-Call. Austin sits down with Ben Sigelman (@el_bhs) to talk about monitoring at Google. Finally, our very own on-call stories! Links to articles discussed - InfoQ Trends Graph (https://www.infoq.com/articles/devops-cloud-trends-2019) Forty percent of ‘AI startups’ in Europe don’t actually use AI (https://www.theverge.com/2019/3/5/18251326/ai-startups-europe-fake-40-percent-mmc-report) NHS uses 10% of the worlds pagers (https://www.theguardian.com/society/2017/sep/09/old-technology-nhs-uses-10-of-worlds-pagers-at-annual-cost-of-66m) Experiments, growth engineering, and exposing company secrets through your API (https://blog.jonlu.ca/posts/experiments-and-growth-hacking) Have your own on-call stories to share? Tweet us @oncallmemaybe or email us - stories@oncallmemaybe.com
A brief introduction to the podcast, with what you can expect to hear from us in the future.