The brutal truth about digital performance engineering and operations. Andreas (aka Andi) Grabner and Brian Wilson are veterans of the digital performance world. Combined they have seen too many applications not scaling and performing up to expectations. With more rapid deployment models made possi…
Scientific research is the foundation of many innovative solutions in any field. Did you know that Dynatrace runs its own Research Lab within the Campus of the Johannes Kepler University (JKU) in Linz, Austria - just 2 kilometers away from our global engineering headquarter? What started in 2020 has grown to 20 full time researchers and many more students that do research on topics such as GenAI, Agentic AI, Log Analytics, Procesesing of Large Data Sets, Sampling Strategies, Cloud Native Security or Memory and Storage Optimizations.Tune in and hear from Otmar and Martin how they are researching on the N+2 generation of Observability and AI, how they are contributing to open source projects such as OpenTelemetry, and what their predictions are when AI is finally taking control of us humans!To learn more about their work check out these links:Martin's LinkedIn: https://www.linkedin.com/in/mflechl/Otmar's LinkedIn: https://www.linkedin.com/in/otmar-ertl/Dynatrace Research Lab: https://careers.dynatrace.com/locations/linz/#__researchLab
As a leader that wants to optimize an organization you are bound to fail if you isolate social (culture and people) and technical (tools and process) changes. When we ask Lesley Cordero, Staff Engineer at The New York Times how to solve this dilemma she answers: "Platform Engineering, it can drive organizational sustainability by practicing sociotechnical principles that provide a community driven support system for application developers using our standardized shared platform architecture"Tune in to our latest episode and learn more about the importance of leadership to continuously keep up and balance the tension between "Developers" and "Operations", between "End User Experience" and "Developer Experience" and ultimately between "Culture and People and "Tools and Processes"Links we discussedLesley's LinkedIn: https://www.linkedin.com/in/lesleycordero/GOTO Conference Talk => https://www.youtube.com/watch?v=Jx-XrUONJ-o QCon 2025 Talk Details: https://qconlondon.com/presentation/apr2025/platform-engineering-practice-sociotechnical-excellence DevOpsCon 2024 Talk Details: https://devopscon.io/business-company-culture/platform-engineering-devops/
Do you plan for incidents? Do you have a time / cost budget for it in your sprint or quarterly planning? Do you have engineers that are "interruptible"?We discussed those and more questions with Lisa Karlin Curtis, Founding Engineer at incident.io who teaches us why we need to think differently about dealing with incidents!In our discussion we learn why modern incident management embraces more incidents that are publicly shared within an organization to foster learning. We learn about how to train more people to become incident responders, how to triage and categorize incidents, how to better plan for them and how to best report on themWe also touch on AI - and how AI-generated code will eventually result in more Incidents which we should use as an opportunity to learn and improve our engineering processP.S: This was our 10th-anniversary podcast episode!!Here the links we discussed in the podcast:Lisa's LinkedIn: https://www.linkedin.com/in/lisa-karlin-curtis-a4563920/Her talk at ELC Prague: https://docs.google.com/presentation/d/18536WBHBcPEppEeXXP7o5UQOX2XfWoGmfds2CHegHq4/edit?slide=id.g3434e0cba65_0_0#slide=id.g3434e0cba65_0_0Incident Playbook: https://incident.io/guide
MCPs (Model Context Protocol) is an open source standard for connecting AI assistants to the the systems where data lives. But you probably already knew that if you have followed the recent hype around this topic after Anthropic made their announcement end of 2024.To learn more about that MCPs are not that magic, but enable "magic" new use cases to speed up efficiency of engineers we have invited Dana Harrison, Staff Site Reliability Engineer at Telus. Dana goes into the use cases he and his team have been testing out over the past months to increase developer efficiency.In our conversation we also talk about the difference between local and remote MCPs, the importance of keeping resiliance in mind as MCPs are connecting to many different API backends and how we can and should observe the interactions with MCPs.Links we discussedAntrohopic Blog: https://www.anthropic.com/news/model-context-protocolDana's LinkedIn: https://www.linkedin.com/in/danaharrisonsre/overlay/about-this-profile/
So you think Distributed Tracing is the new thing? Well - its not! But its never been as exciting as today!In this episode we combine 50 years of Distributed Tracing experience across our guests and hosts. We invited Christoph Neumueller and Thomas Rothschaedl who have seen the early days of agent-based instrumentation, how global standards like the W3C Trace Context allowed tracing to connect large enterprise systems and how OpenTelemetry is commoditizing data collection across all tech stacks.Tune in and learn about the difference between spans and traces, why collecting the data is only part of the story, how to combat the challenge when dealing with too much data and how traces relate and connect to logs, metrics and events.Links we discussedYouTube with Christoph: LINK WILL FOLLOW ONCE VIDEO IS POSTEDChristoph's LinkedIn: https://www.linkedin.com/in/christophneumueller/Thomas's LinkedIn: https://www.linkedin.com/in/rothschaedl/
In the ever-changing IT world, it's hard to create content that stays relevant for long. One of the objectives of "Platform Engineering for Architects: Crafting Modern Platforms as a Product" was to stay timeless by providing practical examples of use cases that are not necessarily tied to current technology trends.The book focuses on the importance of building a platform with a purpose, making the impact measurable and making sure the platform continuous evolves by continuously including the end users (the engineering teams) in the evolution of the platform.Tune in to this episode and hear from Max Körbächer (Founder of Liquid Reply), Hilliary Lipsig (Senior Principal SRE at RedHat) and Andi Grabner (Co-Host of PurePerformance) on what made them write a book on Platform Engineering and get some personal insights into what gets the authors excited about their respective topics.If you have a chance meet Max, Hilliary and Andi at KubeCon in London. They will present at Platform Engineering Day and will also do a book signing at KubeCrawl!Links we discussed:Book on Amazon: https://www.amazon.com/Platform-Engineering-Architects-Crafting-platforms-ebook/dp/B0DH5DJFTHPlatform Engineering Day Session: https://colocatedeventseu2025.sched.com/event/1u5mX/platform-engineering-for-architects-crafting-platforms-as-a-product-max-korbacher-liquid-reply-hilliary-lipsig-red-hatHilliary Lipsig: https://www.linkedin.com/in/hilliary-lipsig-a5935245/Max Körbächer: https://www.linkedin.com/in/maxkoerbaecher/Andi Grabner: https://www.linkedin.com/in/grabnerandi/
One PetaByte is the equivalent of 11000 4k movies. And CERN's Large Hadron Collider (LHC) generates this every single second. Only a fraction of this data (~1 GB/s) is stored and analyzed using a multicluster batch job dispatcher with Kueue running on Kubernetes. In this episode we have Ricardo Rocha, Platform Engineering Lead at CERN and CNCF Advocate, explaining why after 20 years at CERN he is still excited about the work he and his colleagues at CERN are doing. To kick things off we learn about the impact that the CNCF has on the scientific community, how to best balance an implementation of that scale between "easy of use" vs "optimized for throughput". Tune in and learn about custom hardware being built 20 years ago and how the advent of the latest chip generation has impacted the evolution of data scientists around the globeLinks we discussedRicardo's LinkedIn: https://www.linkedin.com/in/ricardo-rocha-739aa718/KubeCon SLC Keynote: https://www.youtube.com/watch?v=xMmskWIlktA&list=PLj6h78yzYM2Pw4mRw4S-1p_xLARMqPkA7&index=5Kueue CNCF Project: https://kubernetes.io/blog/2022/10/04/introducing-kueue/
The word "Compliance" reminds many about mandatory training or audits. Two things not everyone gets excited about!Tune in and meet Michiel de Lepper who has spent most of his career in Security and Compliance. He gives us a different perspective on the importance of compliance, why it exists, how it intertwines with security and threat detection, what it has to do with security posture management and why he thinks its one of the most exciting things in IT!Links we discussed:Michiel's LinkedIn: https://www.linkedin.com/in/madelepper/Blog posts on security and compliance:https://www.dynatrace.com/news/blog/dynatrace-for-executives-security-compliance/ https://www.dynatrace.com/news/blog/manage-compliance-and-resilience-at-scale-with-dynatrace/ https://www.dynatrace.com/news/blog/dynatrace-kspm-transforming-kubernetes-security-and-compliance/
Feature Flagging - some may call them "glorified if-statements" - has been a development practice for decades. But have we reached a stage where organizations are doing "Feature Flag-Driven Development?". After all it took years to establish a test-driven development culture despite having great tools and frameworks available!To learn more we invited Ben Rometsch, Co-Founder of Flagsmith, to chat about the history, state and future of Feature Flagging. He is giving us an update on where the market is heading, how the CNCF project OpenFeature and its community is driving best practices, what the role of AI might be and what he thinks might be next!Couple of links we discussed during the episode:Ben on LinkedIn: https://www.linkedin.com/in/benrometsch/YouTube Video on Observability & Feature Flagging: https://www.youtube.com/watch?v=VZakh1_oEL8OpenFeature: https://openfeature.dev/
To predict the future, it's important to know the past. And that is true for Bernd Greifeneder, Founder and CTO of Dynatrace, who has been driving innovation in the observability and security since he founded Dynatrace 20 years ago!Bernd agreed to sit down, look behind the covers and answer the open questions that people posted on his LinkedIn in response to his recent observability prediction blog. Tune in and learn about Bernd's though on the evaluation from reactive to preventive operations, who is behind the convergence of observability & security, why observability can help those that have serious intentions for sustainability and how observability becomes mandatory and indispensable for AI-driven services.We mentioned a lot of links in todays session. Here they are:Our podcast from 9 years ago: https://www.spreaker.com/episode/015-leading-the-apm-market-from-enterprise-into-cloud-native--9607734Bernds LinkedIn Post: https://www.linkedin.com/feed/update/urn:li:activity:7275101213237354497/Predictions Blog: https://www.dynatrace.com/news/blog/observability-predictions-for-2025/K8s Predictive Scaling Lab: https://github.com/Dynatrace/obslab-predictive-kubernetes-scalingSecurity Video: https://www.youtube.com/watch?v=ICUwRy4JFTkCarbon Impact App: https://www.youtube.com/watch?v=8Px0BB1U1ykAI & LLM Observability Video: https://www.youtube.com/watch?v=eW2KuWFeZyY
eBay, Yahoo, Netflix and then 10+ years at Uber. In this episode we sit down with Vishnu Acharya, Head of Network Infrastructure EMEA and Platform Engineering at Uber. Vishnu shares how Uber has scaled over the years to about 4000 engineers and how his team makes sure that infrastructure and platform engineering scales with the growing company and the growing demand on their digital services.Tune in and learn about how Vishnu thinks about SLOs across all layers of the stack, how they manage to get better insights with their cloud providers and why its important to have an end-to-end understanding of the most critical end user journeys.Links we discussed:Conference talk at Observability & SRE Summit: https://www.iqpc.com/events-observability-sre-summit/speakers/vishnu-acharyaVishnu's LinkedIn Page: https://www.linkedin.com/in/vishnuacharya/Uber Engineering Blog: https://www.uber.com/blog/engineering/
For the past 10 years Anton has been working at Booking.com - one of the leading digital travel companies based out of Amsterdam. The journey that started as System Administrator has led Anton to be an Engineering Manager for Site Reliability where over the past 3 years he led the rollout and adoption of OpenTelemetry as the standard for getting observability into new cloud native deployments.Tune in and learn how Anton saw R&D grow from 300 to 2000, why they replaced their home-grown Perl-based Observability Framework with OpenTelemetry, how they tackle adoption challenges and how they extend and contribute back to the open source communityLinks we discussed:Anton's LinkedIn Profile: https://www.linkedin.com/in/antontimofieiev/Observability & SRE Summit: https://www.iqpc.com/events-observability-sre-summit/speakers/anton-timofieievOpenTelemetry: https://opentelemetry.io/
Most services are moving to SaaS - whether it's email, collaboration, customer relations, or finance. But not everyone can go to SaaS - or at least that's the initial reaction when navigating certain industries' rules and regulations.Milan Steskal - who worked in healthcare for many years - is now helping organizations ask the right questions and find the best solutions as they evaluate their options to move their observability data to SaaS. Tune in and learn about the questions to ask vendors and your internal security, privacy, and compliance teams. Milan also walks us through the capabilities SaaS vendors such as Dynatrace have put in place to protect data sent to the cloud so that it stays safe and only accessible to those needing access.Links discussed today:Milans LinkedIn Page: https://www.linkedin.com/in/milansteskal/Dynatrace Trust Center: https://www.dynatrace.com/company/trust-center/ Blogs on Trust: https://www.dynatrace.com/news/tag/trust-center/
Andreas Taranetz is a software engineer and lecturer at the University of Vienna. He creates a lot of educational content around Web Performance Optimization. For the past seven years, he has also operated Wahlkabine, Austria's top website, for matching one's political views with the parties that are up for election.This episode was an amazing flashback - reminding us about the time when Steve Souders - the "godfather" of Web Performance Optimization - educated web developers about optimizing CSS, JavaScript, and server-side roundtrips.Tune in and learn why Web Performance is still such an important topic, how it relates to sustainability, why you should cache on every layer, and what the Static Site Paradox really is! Links we discussed in the episode:Andreas on LinkedIn: https://www.linkedin.com/in/andreas-taranetz/Personal Website: https://andreas.taranetz.com/We Are Developers Talk: https://www.youtube.com/live/KRemC82gsBkWahlkabine: https://wahlkabine.at/Steve Souders: https://stevesouders.com/
Authentication (validating who you claim to be) and Authorization (enforcing what you are allowed to do) are critical in modern software development. While authentication seems to be a solved problem, modern software development faces many challenges with secure, fast, and resilient authorization mechanisms. To learn more about those challenges, we invited Alex Olivier, Co-Founder and CPO at Cerbos, an Open Source Scalable Authorization Solution. Alex shared insights on attribute-based vs. role-based access Control, the difference between stateful and stateless authorization implementations, why Broken Access Control is in the OWASP Top 10 Security Vulnerabilities, and how to observe the authorization solution for performance, security, and auditing purposes.Links we discussed during the episode:Alex's LinkedIn: https://www.linkedin.com/in/alexolivier/Cerbos on GitHub: https://github.com/cerbos/cerbosOWASP Broken Access Control: https://owasp.org/www-community/Broken_Access_Control
Open Source is the Best Thing that happened to IT"! Powerful words from Marcio Lena who has been using and contributing back to open source for the past 20+ years. Besides being a vivid advocate for open source, Marcio also knows the concerns of large enterprises when picking open source projects.Tune in and follow our discussion about how to identify a healthy open-source project, how to balance between vendor and community lock-in, the power of open standards such as OpenTelemetry, open source business models as well as that contributing to open source is not limited to code but includes documentation, education and advocacy as well!Links we discussed:Marcio's LinkedIn Page: https://www.linkedin.com/in/marcio-lena/CNCF DevStats: https://devstats.cncf.io/Linux Foundation Events: https://events.linuxfoundation.org/CNCF Ambassadors: https://www.cncf.io/people/ambassadors/
DORA - the EU's Digital Operational Resiliency Act - will take effect in January of 2025 and is currently top of mind for IT Leaders across all financial service institutions that operate in the European Union. But what is DORA really? Why is this important? How can institutions meet the DORA requirements? What is the role of observability, automation and AI in all of this?To answer all those and more questions we invited Kay Young, Sr Principal Product Manager at Dynatrace, who has been working with organizations around the globe that have been tasked to implement regulations such as DORA, GDPR, FedRAMP or others.In our conversation we also touch base on the third-party risk management as well as resiliency testing and incident reporting.Resources we discussed:Kay's LinkedIn Profile: https://www.linkedin.com/in/karlien-young-4a156730/What is DORA blog: https://www.dynatrace.com/news/blog/what-is-dora/Taming DORA compliance: https://www.dynatrace.com/news/blog/taming-dora-compliance-with-ai-observability-and-security/Blog on Dynatrace's DORA compliance journey: https://www.dynatrace.com/news/blog/the-dynatrace-journey-toward-dora-compliance/Beyond DORA compliance: https://www.dynatrace.com/news/blog/dora-how-dynatrace-helps-the-financial-sector-stay-resilient/
NAIS (pronounced like NICE) is a team central application platform that provides DevOps teams with the tools they need build, test, deploy, run and observe applications.In this episode Hans Kristian Flaatten, Platform Engineer at NAV, walks us through the WHYs, HOWs and challenges of building modern platforms on Kubernetes. Tune in and hear WHY they defined their own abstraction layer for applications, HOW developers benefit from that platform and WHY they developed their developer portal instead of going with other popular available choices.Links we discussed:Hans Kristian's LinkedIn: https://www.linkedin.com/in/hansflaatten/NAIS Documentation: https://docs.nais.io/
"We will overwhelm developers if we give them the same specialized observability, security or deployment tools that are used by their platform engineering, operations, SREs or security teams!" - says Viktor Farcic, Developer Advocate at UpBound and host of The DevOps Toolkit YouTube channel. Tune in and hear us discuss about making observability easier accessible for developers, what Viktor doesn't like about Kubernetes and how Crossplane - the cloud native control plane framework - can be the gateway to real product-oriented platform engineering!Here the links we discussed during this episode:Viktor on LinkedIn: https://www.linkedin.com/in/viktorfarcic/DevOps Toolkit: https://www.youtube.com/@DevOpsToolkitCrossplane: https://www.crossplane.io/
Hans Kristian is a Platform Engineer for NAV's Kubernetes Platform Nais hosting Norway's wellfare services. With 10 years on Kubernetes, 2000 apps and 1000 developers across more than 100 teams there was a need to make OpenTelemetry adoption as easy as possible.Tune in as we hear from Hans Kristian who is also a CNCF Ambassador and hosts Cloud Native Day Bergen why OpenTelemetry is chosen by the public sector, why it took much longer to adopt, which challenges they had to scale the observability backend and how they are tackling the "noisy data problem"Links we discussed in the episodeFollow Hans Kristian on LinkedIn: https://www.linkedin.com/in/hansflaatten/From 0 to 100 OTel Blog: https://nais.io/blog/posts/otel-from-0-to-100/?foo=barCloud Native Day Bergen: https://2024.cloudnativebergen.dev/Public Money, Public Code. How we open source everything we do! (https://m.youtube.com/watch?v=4v05Huy2mlw&pp=ygUkT3BlbiBzb3VyY2Ugb3BlbiBnb3Zlcm5tZW50IGZsYWF0dGVu)State of Platform Engineering in Norway (https://m.youtube.com/watch?v=3WFZhETlS9s&pp=ygUYc3RhdGUgb2YgcGxhdGZvcm0gbm9yd2F5)
Has one of the decision makers in your organization decided that you have to go "all in on technology X" because they saw a great presentation at a conference or got a great sales pitch from a vendor? If that is the case then this episode is for you and you should forward it to those decision makers.Sebastian Vietz, Director of Reliability Engineering and Host of the Reliability Enablers Podcast, shares his thoughts on considerations when picking a technology like Serverless. We discuss the importance of knowing limits, best fit architectural patterns and things that should influence your technology decisions!Being aware of coldstarts, a 20000 concurrent request limit or 512mb being an ideal size for Lambda are just some of the things we can all learn from Sebastian.Additional links we discussed:Sebastians LinkedIn: https://www.linkedin.com/in/sebastianvietz/Reliability Podcast: https://podnews.net/podcast/ibe8kMore things on serverless: https://serverlessland.com/
When your code runs on more than 6 million systems - many of them business critical - then this is really exciting news for Marco and Wolfgang, Dynatrace OneAgent Java Team members. Their code powers auto-instrumentation and collection of all observability signals of Java based applications running on every possible stack: container in k8s, serverless, VM, on your workstation or even the mainframe.Tune is as we sat down with Marco and Wolfgang to learn what it means to continuously innovate on agent-based instrumentation with 160+ other engineers across the globe that also focus on OneAgent. They share insights on how they develop their observability code, how they continuously test across all supported environments, what the processes at Dynatrace look like to avoid situations like the recent CrowdStrike outage and how they integrate and collaborate with other communities and tools such as OpenTelemetry!Things we discussed during the episodeDynatrace OneAgent: https://www.dynatrace.com/platform/oneagent/Dynatrace for Java: https://www.dynatrace.com/technologies/java-monitoring/OpenTelemetry and Dynatrace: https://docs.dynatrace.com/docs/extend-dynatrace/opentelemetryJobs at Dynatrace: https://careers.dynatrace.com/
When thousands of systems show a blue screen - which ones do you fix first to quickly bring up your most critical systems? For that you need to know which systems are impacted, which mission critical applications run on it, and which depending systems are also impacted by something like the recent CrowdStrike incident!We have invited Josh Wood, Principal Solutions Engineer at Dynatrace, who was one of the first responders helping organizations to leverage observability data to identify which systems to fix first to bring critical apps such as ATMs, Self-Service Terminals, POS (Point of Sales), ... back up again quickly.In this special episode Josh is walking us through the technical details of the CrowdStrike BSOD (Blue Screen of Death), what caused it, how to leverage observability to get a priorities list of systems to fix first and what organizations can do to prevent software impacting issues in the future.Here the links we discussed in the episode:Josh on LinkedIn: https://www.linkedin.com/in/joshuadwood/Josh's blog on CrowdStrike BSOD: https://www.dynatrace.com/news/blog/crowdstrike-bsod-quickly-find-machines-impacted-by-the-crowdstrike-issue/CrowdStrike Incident Takeaway Blog: https://www.dynatrace.com/news/blog/crowdstrike-incident-revisiting-vendor-quality-control/
WebAssembly runs in every browser, provides secure and fast code execution from any language, runs across multiple platforms and has a very small binary footprint. It's adopted by several of the big web-based SaaS solutions we use on a daily basis. But where did WebAssembly come from? What problems does it try to solve? Has it reached critical adoption? And how about observing code that gets executed in browsers, servers or embedded devices?To answer all those questions we invited Matt Butcher, CEO at Fermyon, who explains the history, current implementation status, limitations and opportunities that WebAssembly provides.Further links we disucssedLinkedIn Profile: https://www.linkedin.com/in/mattbutcher/Fermyon Dev Website: https://developer.fermyon.com/ The New Stack Blog with Matt: https://thenewstack.io/webassembly-and-kubernetes-go-better-together-matt-butcher/
"Because I don't want software to go down every single day in my next gig!" is what drives the motivation of Ash Patel, Reliability Advocate and Podcast host of SREpath, to talk about and educate IT professionals on the importance of building and operating reliable systems.For 15 years Ash used to be Director of Operations at a private health service organization. He has experienced that patients couldn't get the treatment they expected due to unreliable software he was responsible for. In our conversation Ash talks about how he had to close the knowledge gap on technology but also solve the problem by having engineers understand the pain and the requirements of their end users. One way to educate more engineers is through his podcast called SREpath where Observability has become a hot topic recently. Tune in, hear about the memorable stories from his guests from CapitalOne, IKEA and SquaredUp, and lets move towards a world where software is reliable by default.Links as discussed today:Ash on LinkedIn: https://www.linkedin.com/in/ash-patel-srepath/SREpath Podcast: https://www.srepath.com/podcast/Clearing Delusions in Observability https://read.srepath.com/p/30-clearing-delusions-in-observability-2af Boosting your observability data's usability https://read.srepath.com/p/35-boosting-your-observability-datas-3f4 How to Enable Observability for Success https://read.srepath.com/p/40-how-to-enable-observability-for
"Meet your users where they are!" - For Platform Engineering Teams that means understanding the current way your engineers work, understand their pain, and provide a solution that doesnt force them to change their behavior but provides a 10x efficiency improvement. Thats not easy to achieve but is what we discussed with Abby Bangser in our latest episodeAbby is a Team Topologies Advocate, has spent years at Thoughtworks helping organizations transform through Delivery Platforms and is now a Lead at the CNCF Platform Working Group. Tune in and hear our discussions on Why Platform Engineering is nothing new, how to avoid Platform Engineering Teams to become your next bottleneck and silo, why Platforms need to have more than one interface and why the purpose of Platform Engineering should be to bring good Developer Experience to all engineersHere all the links we discussed during this episodePlatform Engineering Maturity Model: https://tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model/CNCF Platform Working Group: https://tag-app-delivery.cncf.io/wgs/platforms/KubeCon 2024 Talk: https://colocatedeventseu2024.sched.com/event/1YFdf/sometimes-lipstick-is-exactly-what-a-pig-needs-abby-bangser-syntasso-whitney-lee-vmwareGitHub Issue for Questionnaire: https://github.com/cncf/tag-app-delivery/issues/635Kratix: https://www.kratix.io/Abbys LinkedIn: https://www.linkedin.com/in/abbybangser/Abbys Events: https://www.paintedwavelimited.com/events
Requesting more CPU for your database used to take 6 months of planning 20 years ago. Now it takes the execution of a Terraform script. What has stayed the same all those years is Almudena Vivanco's passion for performance engineering to keep systems optimized. Ensuring that systems are available, scalable and resilient even during spike events such as the upcoming Euro Cup or any holiday specials.Tune in and hear from Almudena, who is currently working for SCRM Lidl, on how moving to the cloud gave new justification to performance engineering. She explains the importance of connecting business with service level objectives and gives insights on how Lidl makes sure to sell 50000 pieces of pork without breaking the cloud bankHere the additional links we discussedSlides from Barcelona Meetup: https://docs.google.com/presentation/d/1h83V4gUyqAmIWeAAtKb4BcRvuJV-XirLk-9Xq077nbwVideo from TestCon: https://www.youtube.com/watch?v=rIP_G-YBy04LinkedIn: https://www.linkedin.com/in/almudenavivanco/
Making observability available to everyone! This noble goal needs superhero powers in an IT world where there is so much chatter and confusion about what observability is, how to sell the value add besides a glorified troubleshooting tool and how OpenTelemetry will disrupt the landscape.In our latest episode we have Rainer Schuppe, Observability Veteran (more than 20+ years in the space), who has worked for the majority of the observability vendors. He is sharing his observability expertise through workshops in his home town of Mallorca. Teaching organizations from basic to strategic observability implementations.Tune in and learn about the typical adoption and maturity path of observability within enterprises: from fixing a problem at hand, to justifying the cost to keep it until enabling companies to become information driven digital organizations! Also check out his OpenTelemetry journey in his blog post seriesHere are the links we discussed today:Observability Heroes Website: https://observability-heroes.com/Observability Heroes Community: https://observability.mn.co/Cloud Native Mallorca Meetup: https://www.meetup.com/cloud-native-mallorca/OpenTelemetry: https://opentelemetry.io/Rainer on LinkedIn: https://www.linkedin.com/in/rainerschuppe/
eBPF is a kernel technology enabling high-performance, low overhead tools for networking, security and observability. In simpler terms: eBPF makes the kernel programmable!Tune in to this episode whether you have never heard about eBPF, using eBPF based tools such as bcc, Cillium, Falco, Tetragon, Inspector Gadget ... or whether you are developing your own eBPF programs!Liz Rice, Chief Open Source Officer at Isovalent, kicks this episode off with a brief introduction of eBPF, explains how it works, which use cases it has enabled and why eBPF can truly give you super powers! In our conversation we dive deeper into the performance aspects of eBPF: how and why tools like Cillium outperforms classical network load balancers, how performance engineers can use it and how the Kernel internally handles eBPF extecutions.We discussed a lot of follow up material - here are all the relevant links:Liz's slide deck on "Unleashing the kernel with eBPF": https://speakerdeck.com/lizrice/unleashing-the-kernel-with-ebpfeBPF Documentary on YouTube: https://www.youtube.com/watch?v=Wb_vD3XZYOALearning eBPF GitHub repo accompanying her book: https://github.com/lizrice/learning-ebpf eBPF website: https://epbf.ioLiz on LinkedIn: https://www.linkedin.com/in/lizrice/
Use Things you Understand! Learn the fundamentals to understand the layers of abstraction! And remember that we don't live in a world with unlimited resources!These are advice from our recent conversation with Ernst Ambichl, Chief Product Architect at Dynatrace, who has started his performance career in the late 80s building the first load testing tools for databases which later became one of the most successful performance engineering tools in the market.Tune in and learn about how Ernst has evolved from being a performance engineer to become an advocate for "Designing and Architecting for Performance". Ernst explains how important good upfront analysis of performance requirements and characteristics of the underlying infrastructure is, how to define baselines and constantly evaluate your changes against your goals.On a personal note: I want to say THANK YOU Ernst for being one of my personal mentors over the past 20+ years. You inspired me with your passion about performance and building resilient systems
SREs (Site Reliability Engineers) have varying roles across different organizations: From Codifying your Infrastructure, handling high priority incidents, automating resiliency, ensuring proper observability, defining SLOs or getting rid of alert fatigue. What an SRE team must not be is a SWAT team - or - as Dana Harrison, Staff SRE at Telus puts it: "You don't want to be the fire brigade along the DevOps Infinity Loop"In his years of experience as an SRE Dana also used to run 1 week boot camps for developers to educate them on making apps observable, proper logging, resiliency architecture patterns, defining good SLIs & SLOs. He talked about the 3 things that are the foundation of a good SRE: understand the app, understand the current state and make sure you know when your systems are down before your customers tell you so!If you are interested in seeing Dana and his colleagues from Telus talk about their observability and SRE journey then check out the On-Demand session from Dynatrace Perform 2024: https://www.dynatrace.com/perform/on-demand/perform-2024/?session=simplifying-observability-automations-and-insights-with-dynatrace#sessions
Whether its GitOps, DevOps, Platform Engineering, Observability as a Service or other terms. We all have our definitions, but rarely do we have a consensus on what those terms really mean! To get some clarity we invited Roberth Strand, CNCF Ambassador and Azure MVP, who has been passionately advocating for GitOps as it was initially defined and explained by Alexis Richardson, Weaveworks in his blog What is GitOps Really! Tune in and learn about Desired State Management, Continuous Pull vs Pushing from Pipelines, how Progressive Delivery or Auto-Scaling fits into declaring everything in Git, what OpenGItOps is and why this podcast will help you get your GitOps certification (coming soon)As we had a lot to talk we also touched on Platform Engineering and various other topicsHere are all the links we discussed:Alexis GitOps Blog Post: https://medium.com/weaveworks/what-is-gitops-really-e77329f23416OpenGitOps: https://opengitops.dev/Flux Image Reflector: https://fluxcd.io/flux/components/image/CNCF White Paper on Platform Engineering: https://tag-app-delivery.cncf.io/whitepapers/platforms/Platform Engineering Maturity Model: https://tag-app-delivery.cncf.io/whitepapers/platform-eng-maturity-model/Platform Engineering Working Group as part of TAG App Delivery: https://tag-app-delivery.cncf.io/wgs/platforms/
Can you explain GitOps in simple terms? How does it fit into Continuous Integration (CI), Continuous Delivery and Continuous Deployment? And what are considerations when rolling out GitOps in an enterprise? To get answers to those questions we sat down with Christian Hernandez, Head of Community at Akuity, who has a fabulous analogy to explain GitOps that I am sure many of us will "borrow" from him. Christian also explains the ecosystem he works in such as ArgoCD, Kargo as well as OpenGitOps which aims to provide open-source standard and best practices to implementing GitOps.We closed the session with some advice around Application Dependency Management, External Secrets Operator and choosing the right Git Repo Structure.Here are some of the links we discussed:OpenGitOps: https://opengitops.dev/ArgoCD: https://argoproj.github.io/cd/Kargo: https://github.com/akuity/kargoArgoCon: https://events.linuxfoundation.org/kubecon-cloudnativecon-north-america/co-located-events/argocon/GitOpsCon: https://events.linuxfoundation.org/gitopscon-north-america/
While the mainframe is powering the world's most critical system the words "modern", "open source" or "generative AI" typically don't come to mind. So lets change this!To do that simply tune in to our latest episode where we have Jessielaine (Jelly) Punongbayan, Sr. Technical Support Engineer at Dynatrace, telling us why she is excited about the modern Mainframe and how it brought her from the Philippines via Singapore and Czech Republic to Austria. We learn about all the open-source projects and communities she is involved in such as Open Mainframe or Zowe that make it easy to connect the Mainframe with the modern tooling of today's development environments. Jelly shares her stories about the role of good observability, how it connects the distributed and the mainframe world and how it enables development teams to build more efficient systems. And what about AI? Well - you have to tune in and listen to the end!Here the links discussed in the episodeWriting a COBOL program using VSCode: https://medium.com/modern-mainframe/beginners-guide-cobol-made-easy-introduction-ecf2f611ac76 Using CircleCI to perform automation in Mainframe: https://medium.com/modern-mainframe/beginners-guide-cobol-made-easy-leveraging-open-source-tools-eb4f8dcd7a98 Using OpenTelemetry to capture Mainframe Insights: https://medium.com/@jessielaine.punongbayan/re-imagining-mainframe-insights-through-open-source-tooling-79dd4c937114Dynatrace support for Mainframe: https://www.dynatrace.com/technologies/mainframe-monitoring/
201 is the HTTP status code for Resource Created. It is also the number of PurePerformance Episodes (including this one) we have published over the past years. None better to invite than the person who initially inspired us to launch PurePerformance: Mark Tomlinson, Performacologist and Director of Observability at FreedomPayTune in and listen to our thoughts on current state of automation, a recap on IFTTT, whether we believe that AIs such as CoPilot will not only make us more efficient in creating code and scripts but also lead to new ways of automation. We also give a heads-up (or rather a recap) of what Mark will be presenting on at Perform 2024.To learn more about and from Mark follow him on the various social media channels:LinkedIn: https://www.linkedin.com/in/mtomlins/Performacology: https://performacology.com/
Marcelo Amaral is a Researcher for Cloud System Optimization and Sustainability. With his background in performance engineering where he optimized microservice workloads in containerized environments making the leap towards analyzing and optimizing energy consumption was easy.Tune in to this episode and learn about how Kepler, the CNCF project Marcelo is working on, which provides metrics for workload energy consumption based on power models it was trained on by the community. Marcelo goes into details about how Kepler works and also provides practical advice for any developer to keep energy consumption in mind when making architectural and coding decisions.To learn more about Kepler and the episode today check out:LinkedIn from Marcelo: https://www.linkedin.com/in/mcamaral/CNCF Blogpost on Kepler: https://www.cncf.io/blog/2023/10/11/exploring-keplers-potentials-unveiling-cloud-application-power-consumption/Kepler GitHub Repo: https://github.com/sustainable-computing-io/kepler
Its only been a year since ChatGPT was introduced. Since then we see LLMs (Large Language Models) and Generative AIs being integrated into every days life software applications. Developers have the hard choice to pick the right model for their use case to produce the quality of output their end users demand.Tune in to this session where we have Nir Gazit, CEO and Co-founder of Traceloop, educating us about how to observe and quantify the quality of LLMs. Besides performance and costs engineers need to look into quality attributes such as accuracy, readability or grammatical correctness.Nir introduces us to OpenLLMetry - a set of Open Source extensions built on top of OpenTelemetry providing automated observability into the usage of LLMs for developers to better understand how to optimize the usage of LLMs. His advice to every developer is to start measuring the quality of your LLMs on Day 1 and continuously evaluate as you change your model, the prompt and the way you interact with your LLM stack!If you have more questions about LLM Observability check out the following links:OpenLLMetry GitHub Page: https://github.com/traceloop/openllmetryTraceloop Website: https://www.traceloop.com/OpenLLMetry Documentation: https://traceloop.com/docs/openllmetry
After analyzing Distributed Traces over more than 15 years Brian and I thought that everyone in software engineering and operations must be satisfied with all that observability data we have available. But. Maybe Brian and I were wrong because we didn't fully understand all the use cases - especially those for developers that must fix code in production or need to quickly understand what code from somebody else is really doing without having the luxury to add another log line and redeploy on the fly. To learn more about the observability requirements of developers we invited Liram Haimovitch, CTO at Rookout and now part of Dynatrace, who has spent the last 7 years solving the challenging problems that developers face day and night. Tune in and learn about what non-breaking breakpoints are, how it is possible to "debug in production" without impacting running code and how we can make developers lives easier even though we push so many things "to the left"
I was invited to speak at BankTechShow in Budapest, Hungary where the nations IT leaders in the banking sector presented and discussed the future of banking - both in the cloud as well as what it means for the physical bank branches. I got a chance to sit down with Adam Gajdi, IT Solutions CoE Lead at K&H, who walked me through the process of their recent new mobile banking app launch. Adam highlighted the importance of observability for both business owners as well as developers. Furthermore, Adam enlightened me with the fact that Hungarian banks are mandated to conduct chaos tests to proof that their systems are resilient in case of data center outages. I was obviously also curious about how AI, LLMs and other technologies are adopted in their sector. Tune in to learn more
Besides attending KubeCon 2023 NA Andreas (Andi) Grabner, co-host of PurePerformance but guest today, has also travelled parts of the US to chat with the broader observablity community on topics such as Platform Engineering, Observability, DevOps, Automation & Security.Tune in and get a quick recap of all the topics Andi has picked up on his recent trip
Zero-Trust Architectures. Data-Flow Inventory. User Experience First! Those are key initiatives in the public sector to ensure that digital services delivered to citizens around the globe are not only working with a flawless user experience but are also safe from any bad actors trying to disrupt agencies on local, stage and federal sectors.In this episode we invited Willie Hicks, Federal CTO at Dynatrace, to learn more about the state of observability and security with government agencies Willie has been working with over the past decade. In our conversation we explore the differences between commercial and government as it comes to ROI or how they see competition as a driving motivator.To learn more about the public sector tune into the Tech Transformers podcast that Willie is co-hosting with his colleague Carolyn Ford.
4% of worldwide CO2 emissions come from IT and like in all other industries we have big potential to not only reduce the carbon footprint but also lower costs.Tune in to our episode where we have Mario-Leander Reimer, CTO at QAware GmbH, talk about his top 3 suggestions for Sustainable IT: Making the right architectural choices, Right-sizing your environments and shutting down environments not needed!Mario is also heavily involved in the CNCF and gives us an overview of projects to look into such as Kepler, kube-green, Karpenter or Carbon Aware Multi-Cluster Schedulers.Here are the links we discussed:Blue turns Green presentation: https://speakerdeck.com/lreimer/blue-turns-green-approaches-and-technologies-for-sustainable-k8s-clusters-number-kcdmunich?slide=5Kepler Project: https://kepler.gl/kube-green: https://kube-green.dev/CNCF TAG Environmental Sustainability: https://github.com/cncf/tag-env-sustainabilitySustainability Week: https://tag-env-sustainability.cncf.io/cloud-native-sustainability-week/
Martin Spier was one of six engineers to take care of all of Netflix Operations about 10 years ago. Back then performance and observability tools weren't as sophisticated and didn't scale to the needs of Netflix as some do today. FlameScope was one of the Open Source projects that evolved out of that period, visualizing Flame Graphs on a time-scaled heatmap to identify specific performance patterns that caused issues in their complex systems back then.Tune in to this episode and hear more performance and observability stories from Martin, about his early days in Brazil, his time at Expedia and Netflix and about his current role as VP of Engineering at PicPay - one of the hottest fin techs in Brazil.More links we discussed:Performance Summit talk about FlameCommander: https://www.youtube.com/watch?v=L58GrWcrD00CMG Impact talk on Real User Monitoring at Netflix: https://www.cmg.org/2019/04/impact-2019-real-user-performance-monitoring-at-netflix-scale/Learn more about Vector: https://netflixtechblog.com/extending-vector-with-ebpf-to-inspect-host-and-container-performance-5da3af4c584bMartin's GitHub: https://github.com/spiermarConnect with him on LinkedIn: https://www.linkedin.com/in/martinspier/
Africa is not only the second largest continent in the world - its also top when it comes to adoption of cloud native technologies. I was fortunate to spend a week in South Africa and had the chance to spend a lot of time with Kelvin Klein, Dynatrace Product Manager at Mediro ICT. After two observability events in Johannesburg and Cape Town and several meetings with local tech leaders I got to sit down with Kelvin and learn more about the status of Observablity, Cloud Native and Security in South Africa.
I was fortunate to travel to South Africa and meet many tech leaders in Johannesburg and Cape Town to talk about Observability, Security, Automation, Platform Engineering, DevOps and FinOps. One of those leaders is Amit Chiba, Multi Product Specialist at Nedbank. I sat down with Amit to discuss his personal journey and his projects at Nedbank, one of the leading financial institutions in South Africa. Tune in and hear from Amit how self-service platform engineering helps them to scale observability, how they tackle cloud costs and why he thinks that the future of IT Ops is more Sleep!
Do you measure build times? On your shared CI as well as local builds on the developers workstations? Do you measure how much time devs spend in debugging code or trying to understand why tests or builds are all of a sudden failing? Are you treating your pre-production with the same respect as your production environments?Tune in and hear from Trisha Gee, Developer Champion at Gradle, who has helped development teams to reduce wait times, become more productive with their tools (gotta love that IDE of yours) and also understand the impact of their choices to other teams (when log lines wake up people at night). Trisha explains in detail what there is to know about DPE (Developer Productivity Engineering), how it fits into Platform Engineering, why adding more hardware is not always the best solution and why Flaky Tests are a passionate topic for Trisha.Here the links to Trishas social media, her books and everything else we discussed during the podcastLinkedIn: https://www.linkedin.com/in/trishagee/Trishas Website: https://trishagee.com/Trisha's Talk on DPE: https://trishagee.com/presentations/developer-productivity-engineering-whats-in-it-for-me/Trisha's Books: https://trishagee.com/2023/07/31/summer-reading-2023/Dave Farley on Continuous Delivery: https://www.youtube.com/channel/UCCfqyGl3nq_V0bo64CjZh8g
Only a few can claim they have successfully created a Pure-Serverless architecture and only those really understand the challenges of observing real event driven architectures. Apostolis Apostolidis (also known as Toli) is one of those people and its why we invited him back to discuss all the lessons learned from his time as Head of Engineering Practices at cinch. Tune in and learn about the evoluation of Serverless observability and the challenges when observing API Gateways, Queues and Step Functions. Listen to Toli's advice on picking one observability vendor, doing your own custom instrumentation and making yourself familiar with the observability data from your managed service provider.Also go back to our previous episode to hear more from his Engineering Practices for Success and remember that the time to ask about coldstarts is over
Codifying Golden Paths that ideally don't need you to build a K8s Operator! This is what Practical Platform Engineering should look like!In our latest episode we learn from Maurico (Salaboy) Salatino who has been contributing to open source for the past 12 years. Tune in and learn from his journey of designing and built platforms. He shares his opinion on the Platform Engineering skillsets, how to design for self-service, how to pick the right tools out of the 160+ CNCF project options and shares some of his favorite tools (including Crossplane, VCluster, Argo, OpenFeature, Keptn ...) that should be part of a modern cloud native platform.Links discussed in this podcast:Salaboy on Twitter: https://twitter.com/salaboySalaboy on LinkedIn: https://www.linkedin.com/in/salaboy/Upcoming Book: https://www.salaboy.com/book/Cloud-Native Snapshots: https://www.salaboy.com/cloud-native-snapshots/Diagrid: https://www.diagrid.io/
Reducing the cognitive load by simplifying computing for every developer in an organization! One of the many definitions of Platform Engineering. But what is Platform Engineering for real? Just a new hype? What problem does it really solve? How does it link with DevOps and SRE? Are there any standards or reference architectures available?To get a new perspective on Platform Engineering we invited Saim Safdar, CNCF Ambassador and member of the CNCF TAG App Delivery Platform Working Group. Tune in and learn about the Platform Maturity Model, how to get involved to shape the field of Platform Engineering, what other people that Saim has interviewed are good to follow and much more ..Here the links we discussed:CNCF Platforms White Paper: https://tag-app-delivery.cncf.io/whitepapers/platformsMaturity Model Working Document: https://docs.google.com/document/d/1bP8-LQ-d41eIdQB3IC2YsncDhawpFLggql2JxwtE0XI/editPlatform Working Group: https://tag-app-delivery.cncf.io/about/wg-platforms/Cloud Native Podcast with Alexis Richardson: https://www.youtube.com/watch?v=p6D-NYkVp9EPatterns and Anti-Patterns: https://octopus.com/devops/platform-engineering/patterns-anti-patterns/Saim on LinkedIn: https://www.linkedin.com/in/saim-safder/
Are you frustrated with your team's ability to troubleshoot issues in production despite their proficiency in pushing out new builds? The root of this problem may lie in the absence of Observability Driven Development. In our latest episode we are joined by Apostolis Apostolidis (also known as Toli) who - as Head of Engineering Practices at cinch - has spent his past years enabling teams to adopt the easiest path to value. He is passionate about DevOps and has a strong opinion on how to educate engineers on "Consciously Instrumenting Code for good Observability".Tune in learn more about good engineering practices, building internal communities of practice, the benefits of traces over metrics and logs and why we need to start adding observability to our CVs and LinkedIn profiles.Here are all relevant links we discussed in this episodeTolis Website: https://www.toli.io/Tolis LinkedIn Profile: https://www.linkedin.com/in/apostolosapostolidis/Toli on Twitter: https://twitter.com/apostolis09/WTFisSRE Talk on DevOps Meets Service Delivery: https://www.youtube.com/watch?v=nLrx0BCMl0YGOTO talk on EDA in Practice: https://www.youtube.com/watch?v=wM-dTroS0FA
Do you know why customers spend more money at a pub when ordering at a table vs ordering directly from at the bar tender? Do you want to know how to get SaaS vendors to send you their observability & telemetry data? Do you want to know the career path of how an Infrastructure Analyst turned Digitial Readiness Manager?Tune in to this PurePerformance episode where we sat down with Mark Forrester from Mitchell & Butlers answering all these questions and also drawing the parallels to Observability. Because observability has come a long way just as Mark: From traditional infrastructure (CPU, Memory, Network) to APM (Service Response Time & Failure Rates), to Real User Behaviour and now End-2-End Business Processes Analytics. Unlocking the potential of Digitial Business Observability lets Mark optimize the end-2-end customer journey to make sure their customers always feel like they are taken care of when trying to order online food delivery, a meal or a drink at a restaurant. As you learn, digital business observability goes beyond your own digital premise and needs to tap into the data of your 3rd party suppliers and SaaS vendors.To see more from Mark also check out his interview at Dynatrace Perform: https://www.youtube.com/watch?v=rGpduOrPxpU