Podcasts about sre

  • 571PODCASTS
  • 1,775EPISODES
  • 37mAVG DURATION
  • 1DAILY NEW EPISODE
  • Jan 26, 2023LATEST

POPULARITY

20152016201720182019202020212022

Categories



Best podcasts about sre

Show all podcasts related to sre

Latest podcast episodes about sre

Software Engineering Radio - The Podcast for Professional Software Developers
Episode 548: Alex Hidalgo on Implementing Service Level Objectives

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Jan 25, 2023 48:30


Alex Hidalgo, principal reliability advocate at Nobl9 and author of Implementing Service Level Objectives, joins SE Radio's Robert Blumen for a discussion of service-level objectives (SLOs) and error budgets. The conversation covers the meaning...

Ship It! DevOps, Infra, Cloud Native
Human scale deployments

Ship It! DevOps, Infra, Cloud Native

Play Episode Listen Later Jan 20, 2023 53:37 Transcription Available


Lars is big on Elixir. Think apps that scale really well, tend to be monolithic, and have one of the most mature deployment models: self-contained releases & built-in hot code reloading. In episode 7, Gerhard talked to Lars about “Why Kubernetes”. There is a follow-up YouTube stream that showed how to automate deploys for an Elixir app using K3s & ArgoCD. More than a year later, how does Lars think about running applications in production? What does simple & straightforward mean to him? Gerhard's favourite: what is “human scale deployments”?

nova.rs
Podcast DLZ i Snežana Čongradin: Slede masovna hapšenja kritičara režima?

nova.rs

Play Episode Listen Later Jan 18, 2023 77:03


Svako se ponekad razboli, pa tako i Nenad Kulačin. Ostade junak bolan da leži kod kuće, a sva se muka snimanja DLZ svalila na nejaka pleća Vidojković Marka.  Srećom po njega, Nenada i gledaoce, gošća u ovoj epizodi je diva srpskog novinarstva i zvezda dnevnog lista "Danas", Snežana Čongradin, koja se Marku pridružila od najavne špice do priloga za "laku noć", osećajući se u Nenadovoj fotelji kao u svojoj fotelji. Kao što imamo hibridnu državu, tako imamo hibridnu epizodu DLZ u kojoj su video prilozi služili ne samo za komentarisanje nego i za pokretanje velikih i važnih društvenih tema.  Snežana je govorila o ugroženosti novinara u Srbiji, o strahu da će u BiH doći do krvoprolića, prilog o "naprednjaku iz kockarnice" ju je rasplakao i još mnogo mnogo toga. U Magarećem kutku moći ćete da vidite kao se Šapić pred našim očima smanjuje sve više i više. DLZ, istovremeno na našem portalu i YouTube!

A res, tega ne veš?
127: Zakaj je umrl Bruce Lee?

A res, tega ne veš?

Play Episode Listen Later Jan 16, 2023 20:26


V tem delu izveš štiri stvari, za ceno ene! Izveš kako biblijsko kakati, zakaj je umrl bruce lee, kateri sadež so najemali in pa katero mesto je “One mile city”! Srečno poslušanje! … Klikneš, poslušaš, izveš! Ti je podkast všeč? Lahko ga podpreš tukaj

PurePerformance
Learning from Incidents is what good SREs do with Laura Nolan

PurePerformance

Play Episode Listen Later Jan 16, 2023 49:47


Incidents happen! And when asking Laura Nolan who was an SRE at Google and Slack, healthy organizations should take proper time to analyze and learn from them. This will improve future incident response as well as overall system resiliency.Tune in to this episode and hear Laura's tips & tricks what makes a good SRE organization. It starts with doing good write ups of incidents, doing your research on incident reports of software and services that you are looking into using. We also spent a good amount of time discussing root cause analysis where she highlighted an incident that happened at her time at Google and what she learned about outdated alerting.Thanks Laura for a great discussion and lots of insights.Here are the additional links we discussed during the podcastLaura on LinkedIn: https://www.linkedin.com/in/laura-nolan-bb7429/Laura on Twitter:https://twitter.com/lauraliftsIncident Template talk @ SRECon: https://www.usenix.org/conference/srecon22emea/presentation/nolan-breakWhat SRE could be talk @ SRECon: https://www.usenix.org/conference/srecon22emea/presentation/nolan-sreHowie Post-Incident Guide: https://www.jeli.io/howie/welcomeMy philosophy on Alerting article: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Algorütm | Geenius.ee
12.01 Algorütm: Modern DevOps/SRE stack, eriti GitOps

Algorütm | Geenius.ee

Play Episode Listen Later Jan 12, 2023 60:54


Täna räägime DevOpsi stackist ja sellest, mis seal uut on. Kuidas on arenenud  virtualiseerimine, millised uued tööriistad on kasutusel ja kuidas seda teemat võiks ettevõttes korraldada. Jutupunktid: Millised on Modern DevOps stack tunnused? Millised valikud tuleb ära teha enne, kui midagi ehitama hakata? Kubernetes, Helm Kas GitOps on parim asi pärast viilutatud saia? Või milliseid muid meetrikaid SRE tiim jälgib ja optimeerib? Millised spetsialiseerumise valikud on DevOpsi sees? Saate külalised on Priit Pääsukene ja Siim Tiilen Veriffist. Algorütmi veavad Priit Liivak Nortalist, Martin Kapp Pipedrive'ist ja Tiit Paananen Veriffist.

Imagen Informativa Primera Emisión
México y Estados Unidos acuerdan nuevo plan en materia migratoria

Imagen Informativa Primera Emisión

Play Episode Listen Later Jan 12, 2023 10:31


Roberto Velasco, jefe de la Unidad para América del Norte de la SRE, se refirió en entrevista para Primera Emisión con Pascal Beltrán del Río a los acuerdos alcanzados en la Cumbre de Líderes de América del Norte entre los más importante el plan para migrantes.

Manuel López San Martín
"Diferencias en México, EU y Canadá no fueron tema central de la Cumbre"

Manuel López San Martín

Play Episode Listen Later Jan 12, 2023 8:57


En entrevista Roberto Velasco, jefe de Unidad para América del Norte de la SRE, , habló sobre la X Cumbre de Líderes de América del Norte que se llevó a cabo en la capital mexicana.

Ship It! DevOps, Infra, Cloud Native
The hard parts of platform engineering

Ship It! DevOps, Infra, Cloud Native

Play Episode Listen Later Jan 11, 2023 77:05 Transcription Available


Marcos Nils has been into platform engineering for the best part of the last decade. He helped architect & build developer platforms using VMs & OpenStack, containers with Docker, and even Kubernetes. He did this at startups with 10 people, as well as large, publicly traded companies with 1000+ software engineers. Today we talk with Marcos about the hard parts of platform engineering.

Changelog Master Feed
Bare metal meets Talos Linux (the K8s OS) (Ship It! #84)

Changelog Master Feed

Play Episode Listen Later Jan 5, 2023 64:00


Welcome to 2023! A new year is the perfect time to start with a fresh perspective. Given a few bare metal hosts with fast, local storage, how would you run your workloads on them? Would you cluster them for redundancy? What operating system would you choose? Steve Francis, CEO at Sidero Labs and Andrew Rynhard, CTO at Sidero Labs join us today to talk about running Talos Linux on bare metal.

Ship It! DevOps, Infra, Cloud Native
Bare metal meets Talos Linux (the K8s OS)

Ship It! DevOps, Infra, Cloud Native

Play Episode Listen Later Jan 5, 2023 64:00 Transcription Available


Welcome to 2023! A new year is the perfect time to start with a fresh perspective. Given a few bare metal hosts with fast, local storage, how would you run your workloads on them? Would you cluster them for redundancy? What operating system would you choose? Steve Francis, CEO at Sidero Labs and Andrew Rynhard, CTO at Sidero Labs join us today to talk about running Talos Linux on bare metal.

Giant Robots Smashing Into Other Giant Robots
456: Jeli.io with Laura Maguire

Giant Robots Smashing Into Other Giant Robots

Play Episode Listen Later Jan 5, 2023 46:37


Laura Maguire is a Researcher at Jeli.io, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Victoria talks to Laura about incident management, giving companies a powerful tool to learn from their incidents, and what types of customers are ideal for taking on a platform like Jeli.io. Jeli.io (https://www.jeli.io/) Follow Jeli.io on Instagram (https://www.instagram.com/jeli_io/), Twitter (https://twitter.com/jeli_io) or LinkedIn (https://www.linkedin.com/company/jeli-inc/). Follow Laura Maguire on Twitter (https://twitter.com/LauraMDMaguire) or LinkedIn (https://www.linkedin.com/in/lauramaguire/). Follow thoughtbot on Twitter (https://twitter.com/thoughtbot) or LinkedIn (https://www.linkedin.com/company/150727/). Become a Sponsor (https://thoughtbot.com/sponsorship) of Giant Robots! Transcript: VICTORIA: This is the Giant Robots Smashing Into Other Giant Robots Podcast, where we explore the design, development, and business of great products. I'm your host, Victoria Guido. And with me today is Laura Maguire, Researcher at Jeli, the first dedicated instant analysis platform that combines more comprehensive data to deliver more proactive solutions and identify problems. Laura, thank you for joining me. LAURA: Thanks for having me, Victoria. VICTORIA: This might be a very introductory level question but just right off the bat, what is an incident? LAURA: What we find is a lot of companies define this very differently across the space, but typically, it's where they are seeing an impact, either a customer impact or a degradation of their service. This can be either formally, it kind of impacts their SLOs or their SLAs, or informally it's something that someone on the team notices or someone, you know, one of their users notice as being degraded performance or something not working as intended. VICTORIA: Gotcha. From my background being in IT operations, I'm familiar with incidents, and it's been a practice in IT for a long time. But what brought you to be a part of building this platform and creating a product around incidents? LAURA: I am a, let's say, recovering safety professional. VICTORIA: [chuckles] LAURA: I started my career in the safety and risk management realm within natural resource industries in the physical world. And so I worked with people who were at the sharp end in high-risk, high-consequence type work. And they were really navigating risk and navigating safety in the real world. And as I was working in this domain, I noticed that there was a delta between what was being said, created safety, and helped risk management and what I was actually seeing with the people that I was working with on the front lines. And so I started to pull the thread on this, and I thought, is work as done really the same as work as written or work as prescribed? And what I found was a whole field of research, a whole field of practice around thinking about safety and risk management in the world of cognitive work. And so this is how people think about risk, how they manage risk, and how do they interpret change and events in the world around them. And so as I started to do my master's degree in human factors and system safety and then later my Ph.D. in cognitive systems engineering, I realized that whether you are on the frontlines of a wildland fire or you're on the frontlines of responding to an incident in the software realm, the ways in which people detect, diagnose, and repair the issues that they're facing are quite similar in terms of the cognitive work. And so when I was starting my Ph.D. work, I was working with Dr. David Woods at the Cognitive Systems Engineering Lab at The Ohio State University. And I came into it, and I was thinking I'm going to work with astronauts, or with fighter pilots, or emergency room doctors, these really exciting domains. And he was like, "We're going to have you work with software engineers." And at first, I really failed to see the connection there, but as I started to learn more about site reliability engineering, about DevOps, about the continuous deployment, continuous integration world, I realized software engineers are really at the forefront of managing critical digital infrastructure. They're keeping up the systems that run society, both for recreation and pleasure in the sense of Netflix, for example, as well as the critical functions within society like our 911 call routing systems, our financial markets. And so the ability to study how software engineers detect outages, manage outages, and work together collaboratively across the team was really giving us a way to study this kind of work that could actually feed back into other types of domains like emergency response, like emergency rooms, and even back to the fighter pilots and astronauts. VICTORIA: Wow, that's so interesting. And so is your research that went into your Ph.D. did that help you help define the product strategy and kind of market fit for what you've been building at Jeli? LAURA: Yeah, absolutely. So Nora Jones, who is the founder and CEO of Jeli, reached out to me at a conference and told me a little bit about what she was thinking about, about how she wanted to support software engineers using a lot of this literature and a lot of the learnings from these other domains to build this product to help support incident management in software engineering. So we base a lot of our thinking around how to help support this cognitive work and how to help resilient performance in these very dynamic, these very changing large scale, you know, distributed software systems on this research, as well as the research that we do with our own users and with our own members from learning from incidents in software engineering Slack community that Nora and several other fairly prominent names within the software community started, Lorin Hochstein, John Allspaw Dr. Richard Cook, Jessica DeVita, Ryan Kitchens, and I may be missing someone else but...and myself, oh, Will Galego as well. Yeah, we based a lot of our understandings, really deep qualitative understandings of what is work like for software engineers when they're, you know, in continuous deployment type environments. And we've translated this into building a product that we think helps but not hinders by getting in the way of engineers while they're under time pressure and there's a lot of uncertainty. And there's often quite a bit of stress involved with responding to incidents. VICTORIA: Right. And you mentioned resilience engineering. And for those who don't know, David Woods, who you worked on with your Ph.D., wrote "Resilience Engineering: Concepts and Precepts." So maybe you could talk a little bit about resilience engineering and what that really means, not just in technology but in the people who were running the tools, right? LAURA: Yeah. So resilience engineering is different from how we think about protecting and defending our software systems. And it's different in the sense that we aren't just thinking about how do we prevent incidents from happening again, like, how do we fix things that have happened to us in the past? But how do we better understand the ways in which our systems operate under a wide variety of conditions? So that includes normal operating conditions as well as abnormal or anomalous operating conditions, such as an incident response. And so resilience engineering was kind of this way of thinking differently about predicting failure, about managing failure, and navigating these kinds of worlds. And one of the fundamental differences about it is it sees people as being the most adaptive component within the system of work. So we can have really good processes and practices around deploying code; we can institute things like cross-checking and peer review of code; we can have really good robust backup and failover systems, but ultimately, it's very likely that in these kinds of complex and adaptive always-changing systems that you're going to encounter problems that you weren't able to anticipate. And so this is where the resilience part comes in because if you're faced with a novel problem, if you're faced with an issue you've never seen before, or a hidden dependency within your system, or an unanticipated failure mode, you have to adapt. You have to be able to take all of the information that's available to you in the moment. You have to interpret that in real-time. You have to think of who else might have skills, knowledge, expertise, access to information, or access to certain kinds of systems or software components. And you have to bring all of those people together in real-time to be able to manage the problem at hand. And so this is really quite a different way of thinking about supporting this work than just let's keep the runbooks updated, and let's make sure that we can write prescriptive processes for everything that we're going to encounter. Because this really is the difference that I saw when I was talking about earlier about that work is done versus work is prescribed. The rules don't cover all of the situations. And so you have to think of how do you help people adapt? How do you help people access information in real-time to be able to handle unforeseen failures? VICTORIA: Right. That makes a lot of sense. It's an interesting evolution of site reliability engineering where you're thinking about the users' experience of your site. It's also thinking about the people who are running your site and what their experience is, and what freedom they have to be able to solve the problems that you wouldn't be able to predict, right? LAURA: Yeah, it's a really good point, actually, because there is sort of this double layer in the product that we are building. So, as you mentioned earlier, we are an incident analysis platform, and so what does that mean? Well, it means that we pull in data whenever there's been an incident, and we help you to look at it a little bit more deeply than you may if you're just following a template and sort of reconstructing a timeline. And so we pull in the actual Slack data that, you know, say, an ops channel or an incident channel that's been spun up following a report of a degraded performance or of an outage. And we look very closely at how did people talk to one another? Who did they bring into the incident? What kinds of things did they think were relevant and important at different points in time? And in doing this, it helps us to understand what information was available to people at different points in time. Because after the incident and after it's been resolved, people often look back and say, "Oh, there's nothing we can learn from that. We figured out what it was." But if we go back and we start looking at how people detected it, how they diagnosed it, who they brought into the event, we can start to unpack these patterns and these ways of understanding how do people work together? What information is useful at different points in time? Which helps us get a deeper understanding of how our systems actually work and how they actually fail. VICTORIA: Right. And I see there are a few different ways the platform does that: there's a narrative builder, a people view, and also a visual timeline. So, do you find that combining all those things together really gives companies a powerful tool to learn from their incidents? LAURA: Yeah. So let me talk a little bit about each of those different components. Our MVP of the product we started out with this understanding of the incident analyst and the incident investigator who, you know, was ready to dive in and ready to understand their incident and apply some qualitative analysis techniques to thinking about their incidents. And what we found was there are a number of these people who are really interested in this deep dive within the software industry. But there's a broader subset of folks that they work with who maybe only do these kinds of incident analysis every once in a while, and they're not as interested in going quite as deep. And so the narrative builder is really this kind of bridge between those two types of users. And what it does is helps construct a timeline which is typically what most companies do to help drive the discussion that they might have in a post-mortem or to drive their kind of findings in their summary report. And it helps them take this closer look at the interactions that happened in that slack transcript and raise questions about what kinds of uncertainties there were, point out who was involved, or interesting aspects of the event at that point in time. And it helps them to summarize what was happening. What did people think was happening at this point in time to create this story about the incident? And the story element is really important because we all learn from stories. It helps bring to life some of the details about what was hard, who was involved, how did they get brought in, what the sources of technical failure were, and whether those were easy or difficult to understand and to repair once the source of the failure was actually understood. And so that narrative builder helps reconstruct this timeline in a much richer way but also do it very efficiently. And as you mentioned, the visual timeline is something that we've created to help that lightweight user or that every once in a while user to go a little bit deeper on their analysis. And how we do that is because it lays out the progression of the event in a way that helps you see, oh, this maybe wasn't straightforward. We didn't detect it in the beginning, and then diagnose it, and then repair it at the end. What happened actually was the detection was intermittent. The signals about what was going wrong was intermittent, and so that was going on in parallel with the diagnosis. The diagnosis took a really long time, and that may have been because we can also see the repair was happening concurrently. And so it starts to show these kinds of characteristics about whether the incident was difficult, whether it was challenging and hard, or whether it was simple and straightforward. This helps lend a bit more depth to metrics like MTTR and TTD by saying, oh, there was a lot more going on in this incident than we initially thought. The last thing that you mentioned was the people view, and so that really sets our product apart from other products in that we look at the sociotechnical system. So it's not just about the software that broke; it is about who was involved in managing that system, in repairing that system, and in communicating about that system outwardly. And so the people view this kind of pulls in some HR data. It helps us to understand who was involved. How long have they been in their role? Were they on-call? Were they not on-call? And other kinds of irrelevant details that show us what was their engagement or their interaction with this event. And so when we start to bring in the socio part of the sociotechnical system, we can identify things like what knowledge do we have within the organization? Is that knowledge well-distributed, or is it just isolated in one or two people? And so those people are constantly getting pulled into incidents when they may be not on-call, which can start to show us whether or not these folks are in danger of burning out or whether their knowledge might need to be transferred more broadly throughout the organization. So this is kind of where the resilience piece comes in because it helps us to distribute knowledge. It helps us to identify who is relevant and useful and how do they partner and collaborate with other people, and their knowledge and skill sets to be able to manage some of the outages that they face? VICTORIA: That's wonderful because one of my follow-up questions would be, as a CEO, as a founder, what kind of insights or choices do you get to make now that you have this insight to help make your team more resilient? [laughs] LAURA: So if this is a manager, or a founder, or a CEO that is looking at their data in Jeli, they can start to understand how to resource their teams more appropriately, as I mentioned, how to spread that knowledge around. They can start to see what parts of their system are creating the most problems or what parts of their system do they have maybe less insight into how it works, how it interacts with other parts of the system, and what this actually means for their ability to meet their SLOs or their SLAs. So it gives you a more in-depth understanding of how your business is actually operating on both the technical side of things, as well as on the people side of things. VICTORIA: That makes a lot of sense. Thank you for that overview of the platform. There's the incident analysis platform, and you also have the bot, the response chatbot. Can you tell me a little bit more about that? LAURA: Yeah, absolutely. We think that incident management should be conducted wherever your work actually takes place, and so for most of our customers and a lot of folks that we know about in the industry, that's Slack. And so, if you are communicating in real-time with your team in Slack, we think that you should stay there. And so, we built this incident management bot that is free and will be free for the lifetime of the product. Because we think that this is really the fundamental basis for helping you manage your incidents more efficiently and more effectively. So it's a pretty lightweight bot. It gives kind of some guardrails or some guidance around collaboration by spinning up a new incident channel, helping you to bring the right kinds of responders into that, helping you to communicate to interested stakeholders by broadcasting to channels they might be in. It kind of nudges you to think about how to communicate about what's happening during different stages of the event progression. And so it's prompting you in a very lightweight way; hey, do you have a status update? Do you have a summary of what the current thinking is? What are the hypotheses about what's going on? Who's conducting what kinds of activities right now? So that if I'm a responder that's coming into the event after 20-30 minutes after it started, I can very quickly come up to speed, understand what's going on, who's doing what, and figure out what's useful for me to do to help step in and not disrupt the incident management that's underway right now. Our users can choose to use the bot independently of the incident analysis platform. But of course, being able to ingest that incident into Jeli it helps you understand who's been involved in the incident, if they've been involved in similar incidents in the past, and helps them start to see some patterns and some themes that emerge over time when you start to look at incidents across the organization. VICTORIA: That makes sense. And I love that it's free and that there's something for every type of organization to take advantage of there. And I wonder if at Jeli you have data about what type of customer is it who'd be targeted or really ideal to take on this kind of platform. LAURA: So most organizations...I was actually recently at SREcon EMEA, and there was a really interesting series of talks; one was SRE for Enterprise, and the next talk was SRE for Startups. And so it was a very thought-provoking discussion around is SRE for everyone, so site reliability engineering? Even smaller teams are starting to have to be responsible for reliability and responsible for running their service. And so we kind of have built our platform thinking about how do we help not just big enterprises or organizations that may have dedicated teams for this but also small startups to learn from their incidents. So internally, we actually call incidents opportunities as in they are learning opportunities for checking out how does your system actually work? How do your people work together? What things were difficult and challenging about the incident? And how do you talk about those things as a team to help create more resilient performance in future? So in terms of an ideal customer, it's really folks that are interested in conducting these sort of lightweight but in-depth looks at how their system actually works on both the people side of things and the technical side of things. Those who we found are most successful with our product are interested in not so much figuring out who did the thing and who can they blame for the incident itself but rather how do they learn from what happened? And would another engineer, or another product owner, another customer service representative, whoever the incident may be sort of focused around, would another person in their shoes have taken the same actions that they took or made the same decisions that they made? Which helps us understand from a systems level how do we repair or how do we adjust the system of work surrounding folks so that they are better supported when they're faced with uncertainty, or with that kind of time pressure, or that ambiguity about what's actually going on? VICTORIA: And I love that you said that because part of the reason [laughs] I invited you on to the podcast is that a lot of companies I have experience with don't think about incidents until it happens to them, and then it can be a scramble. It can impact their customer base. It can stress their team out. But if you go about creating...the term obviously you all use is psychological safety on your team, and maybe you use some of the free tools from Jeli like the Post-Incident Guide and the Incident Analysis 101 blog to set your team up for success from the beginning, then you can increase your customer loyalty and your team loyalty as well to the company. Is that your experience? LAURA: Yeah, absolutely. So one thing that I have learned throughout my career, you know, starting way back in forestry and looking at safety and risk in that domain, was as soon as there is an accident or even a serious near miss, right away, everybody gets sweaty palms. Everybody is concerned about, uh-oh, am I going to get blamed for this? Am I going to get fired? Am I going to get publicly shamed for the decisions that I made when I was in this situation? And what that response, that reaction does is it drives a lot of the communication and a lot of the understanding of the conditions that that person was in. It drives that underground. And it's important to allow people to talk about here's what I was seeing, here's what I was experiencing because, in these kinds of complex systems, information is not readily available to people. The signals are not always coming through loud and clear about what's going on or about what the appropriate actions to take are. Instead, it's messy; it's loud, it's noisy. There are usually multiple different demands on that person's attention and on their time, and they're often managing trade-offs: do I keep the system down so that I can gather more information about what's actually going on, or do I just try and bring it up as quickly as I can so that there's less impact to users? Those kinds of decisions are having to be made under pressure. So when we create these conditions of psychological safety, when we say you know what? This happened. We want to learn from it. We've already made this investment. Richard Cook mentioned in the very first SNAFU Catchers Report, which was a report that came out of Ohio State, that incidents are unplanned investments into understanding how your system works. And so you've already had the incident. You've already paid the price of that downtime or of that outage. So you might as well extract some learning from it so that you can help create a safer and more resilient system in the future. So by helping people to reconstruct what was actually happening in real-time, not what they were retrospectively saying, "Oh, I should have done this," well, you didn't do that. So let's understand why you thought at that moment in time that was the right way to respond because, more than likely, other people in that same position would have made that same choice. And so it helps us to think more broadly about ways that we can support decision-making and sense-making under conditions of stress and uncertainty. And ultimately, that helps your system be more resilient and be more reliable for your customers. VICTORIA: What a great reframing: unplanned investment. [laughs] And if you don't learn from it, then you're going to lose out on what you've already invested that time in resolving it, right? LAURA: Absolutely. MID-ROLL AD: Are you an entrepreneur or start-up founder looking to gain confidence in the way forward for your idea? At thoughtbot, we know you're tight on time and investment, which is why we've created targeted 1-hour remote workshops to help you develop a concrete plan for your product's next steps. Over four interactive sessions, we work with you on research, product design sprint, critical path, and presentation prep so that you and your team are better equipped with the skills and knowledge for success. Find out how we can help you move the needle at: tbot.io/entrepreneurs. VICTORIA: Getting more into that psychological safety and how to create that culture where people feel safe telling about what really happened, but how does that relate to...Jeli says that they are a people software. [laughs] Talk to me more about that. Like, what advice do you give founders and CEOs on how to create that psychological safety which makes them be more resilient in these types of incidents? LAURA: So you mentioned the Howie Guide that we published last year, and this is our guidance around how to do incident analysis, how to help your team start to learn from their incidents, and Howie stands for how we got here. And that's really important, that language because what it says is there's a history that led up to this incident. And most teams, when they've had an outage, they'll kind of look backwards from that outage, maybe an hour, maybe a day, maybe to the last deploy. But they don't think about how the decisions got made to use that piece of software in the first place. They don't think about how did engineers actually get on-boarded to being on-call. They don't necessarily think about what kinds of skills, and knowledge, and expertise when we're hiring a DevOps engineer, and I'm using air quotes here or an SRE. What kinds of skills and knowledge do they actually have? Those are very broad terms. And what it means to be a DevOps engineer or an SRE is quite underspecified. And so the knowledge behind the folks that you might hire into the company is going to necessarily be very diverse. It's going to be partial and incomplete in many ways because not everyone can know everything about the system. And so, we need to have multiple diverse perspectives about how the system works, how our customers use that system, what kinds of pressures and constraints exist within our company that allow us some possibilities over others. We need to bring all of those perspectives together to get a more reflective picture of what was actually happening before this incident took place and how we actually got here. This reframing helps a lot of people disarm that initial defensiveness response or that initial, oh, shoot; I'm going to get in trouble for this kind of response. And it says to them, "Hey, you're a part of this bigger system of work. You are only one piece of this puzzle. And what we want to try and do is understand what was happening within the company, not just what you did, what you said, and what you decided." So once people realize that you're not just trying to find fault or place blame, but you're really trying to understand their work, and you're trying to understand their work with other teams and other vendors, and trying to understand their work relative to the competing demands that were going on, so those are some of the things that help create psychological safety. About ten years ago, John Allspaw and the team at Etsy put out The Etsy Debriefing Facilitation Guide, which also poses a number of questions and helps to frame the post-incident learnings in a way that moves it from the individual and looks more collectively at the company as a whole. And so these things are helpful for founders or for CEOs to help bring forward more information about what's really going on, more information about what are the real risks and threats and opportunities within the company, and gives you an opportunity to step back and do what we call microlearning, which is sharing knowledge about how the system works, sharing understandings of what people think is going on, and what people know about the system. We don't typically talk about those things unless there's a reason to, and incidents kind of give us that reason because they're uncomfortable and they can be painful. They can be very public. They can be very disruptive to what we think about how resilient and reliable we actually are. And so if you can kind of step away from this defensiveness and step away from this need to place blame and instead try and understand the conditions, you will get a lot more learning and a lot more resilience and reliability out of your teams and out of your systems. VICTORIA: That makes sense to me. And I'd like to draw a connection between that and some other things you mentioned with The 2022 Accelerate State of DevOps Report that highlights that the people who are often responding to those incidents or in that high-stress situation tend to be historically underrepresented or historically excluded groups. And so do you see that having this insight into both who is actually taking on a lot of the work when these incidents happen and creating that psychological safety can make a better environment for diversity, equity, inclusion at a company as well? LAURA: Well, I think anytime you work to establish trust and transparency, and you focus on recognizing the skills that people do have, the knowledge that they do have, and not over assuming that someone knows something or that they have been involved in the discussions that may have been relevant to an incident, anytime you focus on that trust and transparency you are really signaling to people within your organization that you value their contributions and that you recognize that they've come to work and trying to do a good job. But they have multiple competing demands on their attention and on their time. And so we're not making assumptions about people being complacent, or people being reckless or being sloppy in their work. So that creates an environment where people feel more willing to speak up and to talk about some of the challenges that they might face, to talk about the ways in which it's not clear to them how certain parts of the system work or how certain teams actually operate. So you're just opening the channels for communication, which helps to share more knowledge. It helps to share more information about what teams are doing at different points in time. And this helps people to preemptively anticipate how a change that they might be making in their part of the system could be influencing up or downstream teams. And so this helps create more resilience because now you're thinking laterally about your system and about your involvement across teams and across boundary lines. And an example of this is if a marketing team...this is a story that Nora tells quite a bit; if a marketing team is, say, launching a Super Bowl commercial for their company but they don't actually tell the engineers on-call that that is about to happen, you can create all sorts of breakdowns when all of a sudden you have this surge of traffic to your website because people see the Super Bowl commercial and they want to go to the site. And then you have a single person who's trying to respond to that in real-time. So, instead, when you do start thinking about that trust and transparency, you're helping teams to help each other and to think more broadly about how their work is actually impacting other parts of the system. So from a diversity and inclusion and underrepresented groups perspective, this is creating the conditions for more people to be involved, more people to feel like their voice is going to be heard, and that their perspective actually matters. VICTORIA: That sounds really powerful, and I'm glad we were able to touch on that. Shifting gears a little bit, I wanted to talk about two different questions; so one is if you could travel back in time to when Jeli first started, what advice would you give yourself, your past self? LAURA: I would encourage myself to recognize that our ability to experiment is fundamental to our ability to learn. And learning is what helps us to iterate faster. Learning is what helps us to reflect on the tool that we're building or the feature that we're building and what this actually means to our users. I actually copped that advice to myself from CEO Zoran Perkov of the Long-Term Stock Exchange. They launched a whole new stock market during the pandemic with a fully remote team. And I had interviewed him for an article that I wrote about resilient leadership. And he said to me, like, "My job as a CEO is 100% about protecting our ability to experiment as a company because if we stop learning, we're not going to be able to iterate. We're not going to be able to adapt to the changes that we see in the market and in our users." So I think I would tell myself to continually experiment. One of the things that I talk to our customers about a lot because many of them are implementing new incident management programs or they're trying to level up their engineering teams around incident analysis, and I would say, "This doesn't have to be a fully-fleshed out program where you know all of the ways in which this is going to unfold." It's really about trying experiments, conduct some training, start small. Do one incident analysis on a really particularly spicy incident that you may have had or a really challenging incident where a lot of people were surprised by what happened. Bring together that group and say, "Hey, we're going to try something a little bit different here. We'll use some questions from the Howie Guide. We'll use the format and the structure from the Etsy Debriefing Guide. And we're just going to try and learn what we can about this event. We're not going to try and place blame. We're not going to try and generate corrective actions. We just want to see what we can learn from this." Then ask people that were involved, "How did this go? What did we learn from it? What should we do differently next time?" And continually iterate on those small, little experiments so that you can grow your product and grow your team's capacity. I think it took us a little bit of time to figure that out within the organization, but once we did, we were just able to collaborate more effectively work more effectively by integrating some of the feedback that we were getting from our users. And then the last piece of advice that I would give myself is to really invest in cross-discipline coordination and collaboration. Engineers, designers, researchers, CEOs they all have a different view of the product. They all have a different understanding of what the goals and priorities are. And those mental models of the product and of what the right thing to do is are constantly changing. And they all have different language that they use to talk about the product and to talk about their processes for integrating this understanding of the changing conditions and the changing user into the product. And so I would say invest in establishing common ground across the different disciplines within your team to be able to talk about what people are seeing, to be able to stop and identify when we're making assumptions about what other people know or what other people's orientation towards the problem or towards the product are. And spend a little bit of time saying, "When I say this is important, I'm saying it's important because of XYZ, not just this is important." So spending a little bit of time elaborating on what your mental model is and where you're drawing from can help the teams work more effectively together across those disciplines. VICTORIA: That's pretty powerful advice. You're iterating and experimenting at Jeli. What's on the horizon that you are...what new experiments are you excited about? LAURA: One of the things that has been front and center for us since we started is this idea of cross-incident analysis. And so we've kind of built out a number of different features within the product, being able to help tag the incident with the relevant services and technologies that were involved, being able to identify which teams were involved, and also being able to identify different kinds of themes or patterns that emerge from individual incidents. So all of this data that we can get from mostly just from the ingested incident itself or from the incident that you bring into Jeli but also from the analysis that you do on it this helps us start to be able to see across incidents what's happening not just with the technical side of things. So is it always Travis that is causing a problem? Are there components that work together that kind of have these really hidden and strange interdependencies that are really hard for the team to actually cope with? What kinds of themes are emerging across your suite of opportunities, your suite of incidents that you've ingested? Some of the things that we're starting to see from those experiments is an ability to look at where are your knowledge islands within your organization? Do you have an engineer who, if they were to leave, would take the majority of your systems knowledge about your database, or about your users, or about some critical aspect of your system that would disappear with all of that tacit knowledge? Or are there engineers that work really effectively together during really difficult incidents? And so you can start to unpack what are these characteristics of these people, and of these teams, and of these technologies that offer both opportunities or threats to your organization? So basically, what we're doing is we're helping you to see how your system performs under different kinds of conditions, which I think as a safety and risk professional working in a variety of different domains for the last 15 years, I think this is really where the rubber hits the road in helping teams be more reliable, and be more resilient, and more proactive about where investments in maintenance, or training, or headcount are going to have the biggest bang for your buck. VICTORIA: That makes a lot of sense. In my experience, sometimes those decisions are made more on intuition or on limited data so having a more full picture to rely on probably produces better results. [laughs] LAURA: Yeah, and I think that we all want to be data-driven, thinking about not only the quantitative data is how many incidents do we have around certain parts of the system, or certain teams, or certain services? But also, the qualitative side of things is what does this actually mean? And what does this mean to our ability to grow and change over time and to scale? The partnership of that quantitative data and qualitative data means we're being data-driven on a whole other level. VICTORIA: Wonderful. And it seems like we're getting close to the end of our time here. Is there anything else you want to give as a final takeaway to our listeners? LAURA: Yeah. So I think that we are, you know, as a domain, as a field, software engineering is increasingly becoming responsible for not only critical infrastructure within society, but we have a responsibility to our users and to each other within our companies to help make work better, help make our services more reliable and more resilient over time. And there's a variety of lessons that we can learn from other domains. As I mentioned before, aviation, healthcare, nuclear power all of those kinds of domains have been thinking about supporting cognitive work and supporting frontline operators. And we can learn from this history and this literature that exists out there. There is a GitHub repo that Lorin Hochstein has curated with a number of other folks with the industry that points to some of these resources. And as well, we'll be hosting the first Learning From Incidents in Software Engineering Conference in Denver in February, February 15 and 16th. And one feature of this conference that I'm super excited about is affectionately called CasesConf. And it is going to be an opportunity for software engineers from a variety of organizations to tell real stories about incidents that they had, how they handled them, what was challenging, what went surprisingly well, and just what is actually going on within their organizations. And this is kind of a new thing for the software industry to be talking very publicly about failures and sharing the messy details of our incidents. This won't be a recorded part of the conference. It is going to be conducted under the Chatham House Rule, which is participants who are in the room while these stories are being told can share some of the stories but not any identifying details about the company or the engineers that were involved. And so this kind of real-world situations helps us to, as I talked about before, with that psychological safety, helps us to say this is the reality of operating complex systems. They're going to fail. We're going to have to learn from them. And the more that we can talk at an industry level about what's going on and about what kinds of things are creating problems or opportunities for each other, the more we're going to be able to lift the bar for the industry as a whole. So you can check out register.learningfromincidents.io for more information about the conference. And we can link Lorin's resilience engineering GitHub repo in the notes as well. VICTORIA: Wonderful. Well, I was looking for an excuse to come to Denver in February anyways. LAURA: We would love to have ya. VICTORIA: Thank you. And thank you so much for taking time to share with us today, Laura. You can subscribe to the show and find notes along with a complete transcript for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. And you can find me on Twitter @victori_ousg. This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time. ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success. Special Guest: Laura Maguire.

The Cloudcast
2023 Look Ahead to Platform Engineering

The Cloudcast

Play Episode Listen Later Jan 4, 2023 39:39


Rob Hirschfeld (@zehicle Founder & CEO @rackngo) talks about how Platform Engineering has evolved from DevOps and SRE and how it aligns to Cloud Platforms. SHOW: 682CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwCHECK OUT OUR NEW PODCAST - "CLOUDCAST BASICS"SHOW SPONSORS:Eaton HomepageEaton and Tripp Lite have joined forces to bring more sanity to IT pros days, every day. Visit www.eaton.com/audio to learn more!Datadog Application Monitoring: Modern Application Performance MonitoringGet started monitoring service dependencies to eliminate latency and errors and enhance your users app experience with a free 14 day Datadog trial. Listeners of The Cloudcast will also receive a free Datadog T-shirt.SHOW NOTES:RackN websiteWhat is Platform Engineering? - GartnerPlatform Engineering: What is it and Who Does it? - NewstackSpotify BackstageTopic 1 - Welcome back to the show, it was great to see you in person at events recently. What have you been focusing on the last couple of years?Topic 2 - There's been a lot of discussion about Platform Engineering over the last 6+ months. You've been around this space for a while. We're trying to understand if PE is different from DevOps or SRE or Cloud Platform in the past, or an evolution. Is PE just a common platform maintained with reusable tools, regardless of the infrastructure? Topic 3 - I've heard people say that Cloud Platform and Platform Engineering are colleagues. where one owns/operates the platform, and the other is the “product manager” to the application teams. Is this realistic? Topic 4 - What does “good” look like for Platform Engineering? Is the goal a frictionless developer experience? Are developer consistency and efficiency valid goals? Are there KPIs or Metrics that “good” teams are striving towards? Topic 5 -  Any interesting technologies that you're seeing that make Platform Engineering easier, or more manageable? Topic 6 - Any team dynamics that you're seeing that make Platform Engineering easier, or more manageable? FEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet

Manuel López San Martín
Cumbre de Líderes de América del Norte: ¿Qué temas discurtirán AMLO, Biden y Trudeau?

Manuel López San Martín

Play Episode Listen Later Jan 4, 2023 9:38


En entrevista Roberto Velasco, jefe de unidad para América del Norte de la Secretaría de Relaciones Exteriores, dio detalles de la reunión trilateral entre México, Estados Unidos y Canadá.

Duhovna misel
Stanislav Kerin: Polnost sreče

Duhovna misel

Play Episode Listen Later Jan 4, 2023 6:14


Informacije o dogodkih po svetu redko pridejo do vseh ljudi. Vseeno smo izvedeli, kaj se je zgodilo v Parizu, kaj se dogaja po svetu. So tudi dogodki, ki jih ne moremo zaslediti v medijih. Vedno znova je treba pomagati človeku, da bi svoje hrepenenje po sreči prav udejanjil v svojem življenju. Dogodki zadnjih let kažejo, koliko ljudem je to ukradeno ali celo uničeno. Ob beguncih smo postali prestrašeni, skušali smo se temu izogniti. Pri sprejemanje beguncev na obalah otoka Lampedusa v Sredozemlju je zdravnik Petro Bartolo naredil vse ,kar je lahko. Vsak je prinesel s seboj svojo zgodbo. Najbolj pretresljive so bile zgodbe otrok. Tihotapili so jih za prodajanja njihovih organov. Potem so jih odvrgli na smetišče. Veliko dogodkov niti ne pride v medije. Teh dogodkov ne vidimo in o njih ne slišimo. Vse to je življenje danes. Odprte človeške oči opazijo propadanje civilizacije, ki je storila veliko dobrega za vse človeštvo. Kaj se dogaja? Mislim, da nas pritegne govorjenje o pravici do sreče. Do nje skušamo priti prek raznih oglaševalcev, ki nam s svojimi izdelki zagotavljajo pravo pot k sreči. In mi temu verjamemo! TODA! Sreča sama po sebi ne obstaja. Vedno je povezana z ljudmi okoli nas. Zaprti v svoje lastne želje in trenutne občutke sreče, začnemo propadati. Nikakor ne moremo priti do obljubljene sreče in veselja. Zakaj nihče o tem ne govori? Zakaj nihče ne usmerja ljudi v pravo smer. NAPAKA! Veliko ljudi govori o svoji poti, o svojem zgrešenem pojmovanju življenja in o pravih vrednotah. Med njimi je uspešen mlad poslovnež Jean - Marc Potdevin, ki sam o sebi pravi, da je pri 40 letih imel vse, kar si človek želi: imel je dobro službo, imel je dovolj denarja, imel ugled v družbi, privoščil si je lahko vse, kar si je zaželel. Ob tem je prišel do spoznanja, da mu zavidanja vreden uspeh ni potešil globlje želje, želje po sreči. Spremenil je svoje življenje. Postal je svoboden, ker je imel pogum zapustiti lažno svobodo iz oseminšestdesetih let – pravzaprav suženjstva razpuščenih nagonov (tako sam pravi). Vedno znova potrebujemo spodbude, zglede, ki bi nas usmerjali v pravo smer. Vprašanje je, ali vidimo te zglede. Polnost isreče ne najdemo v trenutkih veselja, polnost sreče najdemo v prizadevanju za druge, za srečo in veselje drugih. Potrebno je veliko odpovedovanja, premagovanja samega sebe. Na tej poti hitro omagamo, zato so potrebni ljudje, ki nas spodbujajo in nam stojijo ob strani. Vsak izmed nas lahko postane opora in zgled za druge. To je odločitev za življenje, za kulturo življenja, to je odločitev za srečo. Odločitev je naša!

Chlani
TOP 22 CHLANI TRENUTKOV 2022

Chlani

Play Episode Listen Later Jan 3, 2023 25:05


EPIZODA 61Poljubne donacije: https://tinyurl.com/2v66h9ycHvala za super leto, brez vas ni nas. Pred nami je novo in še boljše leto in upamo, da bomo skupaj še naprej ustavrjali super trenutke.TIMESTAMPS00:00 - Intro00:11 - Jure in kradljive opice01:25 - Robert Kladnik in "slušalke"02:12 - Rimanićeva "break up" zgodba06:55 - Kako pokazat rit08:07 - JunioR oceni Mene zebe09:55 - Maja Grintal je okusna11:00 - Masayah freestyle11:50 - Damjan Murko bullyja Mateja12:25 - Iskanje nove sovoditeljice13:09 - Nov sovoditelj Tim13:47 - Zmaga v Cannesu14:50 - Drill release party15:54 - Tič in torta16:35 - 1st anniversary podcasta17:06 - Gaja Prestor fuka na svoj komad17:46 - David Amaro poje18:07 - Nika Krmec roleplaya winxico19:26 - Biznis se dela v savni20:05 - Bullyjanje Anabel21:24 - Tina Mentol in bel madež22:15 - Chlani pri Predsedniku Pahorju23:46 - Sex Eva Lune na jetskiju24:55 - Srečno novo leto 2023SPREMLJAJTE NASYoutube: https://youtube.com/channel/UCiy2dirXGqygqSsiXZv9PpgInstagram: https://www.instagram.com/chlani.podcast/TikTok: https://www.tiktok.com/@chlani.podcastVODITELJIJure: https://www.instagram.com/juresavron/Matej: https://www.instagram.com/matejrimanic/Tim: https://www.instagram.com/mit.t.tim/O PODCASTUCHLANI. Prebrano »člani«, ne pa klani. Ampak člani česa? Ne, ne … Tukaj ne gre za članstvo v klubu ali organizaciji, niti v klanu. »Član« je slengovska beseda, ki jo predvsem mladi zelo pogosto uporabljajo na najlepšem delu Slovenije – na Obali. Torej, ker ste tukaj, naj vam izrečemo dobrodošlico: »Kje ste, člani!« Ogrodje novega slovenskega podcasta sestavljamo 3 mladi ustvarjalci. Zaradi bližine, ki smo jo med seboj ustvarili s pogostim druženjem in delom, podcastu zagotovimo avtentičnost in poskrbimo za sproščeno dinamiko. Na pocastih se nam pogosto pridružijo še zanimivi gosti, – znani in manj znani – ki popestrijo epizode s svojim unikatnim pogledom na življenje in atraktivno osebnostjo. Teme, ki jih obravnavamo, so lahko absurdne in nenavadne, vsekakor pa se dotaknemo tudi življenjskih tem.

Gospoda
Zajtrk

Gospoda

Play Episode Listen Later Jan 1, 2023 15:53


Srečno 2023!

PurePerformance
What happened in 2022 and where 2023 is taking us!

PurePerformance

Play Episode Listen Later Jan 1, 2023 42:25


What a year 2022 was! We had 25! episodes with amazing guests from all over the world covering topics from Kubernetes, OpenTelemetry, DevOps, SRE, Cloud Migrations, DNS, Value Streams all the way to Persona Driven Engineering and drawing parallels with Digital Marketing. If you are new to our podcast check out the playlist and listen to some of those we mentioned during our episode!Now its time to say Thank You listeners for the continued support. After 5+ years of podcasting we still see rising numbers of downloads which is the best motivation for us to keep going. Stay tuned as we are going to cover industry relevant topics going into 2023 – or is it year 53? (only those will know that listen to the full episode)

Noticentro
Piden a mexicanos resguardarse de las protestas en Bolivia

Noticentro

Play Episode Listen Later Dec 31, 2022 1:31


· CNDH emite recomendación a la SSPC · Continúa el frío en la Ciudad de México · Francia pide pruebas PCR para provenientes de China · Más información en nuestro podcast

My life as a programmer
Is there any difference between a DevOps engineer and a SRE?

My life as a programmer

Play Episode Listen Later Dec 31, 2022 11:13


Is there any difference between a DevOps engineer and a SRE?

Software Engineering Radio - The Podcast for Professional Software Developers
Episode 544: Ganesh Datta on DevOps vs Site Reliability Engineering

Software Engineering Radio - The Podcast for Professional Software Developers

Play Episode Listen Later Dec 28, 2022 59:00


Ganesh Datta, CTO and cofounder of Cortex, joins SE Radio's Priyanka Raghavan to discuss site reliability engineering (SRE) vs DevOps. They examine the similarities and differences and how to use the two approaches together to build better software...

The Nonlinear Library
AF - Analogies between Software Reverse Engineering and Mechanistic Interpretability by Neel Nanda

The Nonlinear Library

Play Episode Listen Later Dec 26, 2022 16:18


Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: Analogies between Software Reverse Engineering and Mechanistic Interpretability, published by Neel Nanda on December 26, 2022 on The AI Alignment Forum. These are notes taken during a call with Itay Yona, an expert in software/hardware reverse engineering (SRE). Itay gave me an excellent distillation of key ideas and mindsets in the field, and we discussed analogies/disanalogies to mechanistic interpretability of neural networks. I'm generally very excited to learn about other fields of study that reverse engineer complex systems, and what relevant insights they may have (SRE, neuroscience, systems biology, etc). All mistakes are mine, and all insights are his! My Takeaways The underlying mindset actually feels pretty analogous! I find it super interesting that they also think a lot about motifs (weird patterns and phenomena that only occur in specific contexts), and that these are often the first hook into understanding something weird and that you can then work backwards. (Not to be confused with the SRE use of hooking) Also interesting that they also often focus on the inputs and outputs of the software as the starting point, to get a hook in, and then move on from there. It's very key to have a deep, gears-level model of the system you're working with (how does a CPU work, how are things represented in memory, the stack, registers, etc) The distinction between "newbies get caught up trying to understand every detail, experts think in higher-level abstractions, make educated guesses, and only zoom in on the details that matter" felt super interesting and surprising to me. My attempt to translate it into mechanistic interpretability is that (if it is analogous): There are certain principles and patterns by which networks learn, that we can identify and understand. We likely will understand these by deeply reverse engineering specific parts of systems (especially toy systems) and digging into the details. But the goal here is to build intuitions and mental models, and a sense for how models work as a whole. Once we have these intuitions and some solid grounding in what we do understand well, the best mindset for reverse engineering an unfamiliar system is to be less rigorous and more intuitive. Make educated guesses, look for partial evidence for or against hypotheses, think at a high-level and somewhat abstract mode about the system, and only zoom in on a specific part of the system to deeply reverse engineer once you've identified what to prioritise. I have no idea if it is analogous, but that mindset aligns a fair bit with my intuitions about MI (though I consider the field to be much more in the "building intuitions by deeply engineering things" phase lol) As Lawrence Chan notes, this is likely an example of a general pattern, where newbies (1) reason high-level in a very ungrounded way, (2) dig really rigorously into the details constantly and (3) build intuition that's actually grounded, and return to reasoning on a high-level. And that if you want to get to 3, you need to do a lot of digging through details in stage 2 first, while experts can make the mistake of recommending skipping to 3 (which in practice puts people at 1) I'm surprised at the emphasis on prioritisation, and identifying which part of the software you care about. My mental picture was that the goal was to fully de-compile things to source code, but it sounds like that's rarely the goal and is extremely hard. But this aligns with my intuitions that a lot of what I want to do with a network is to localise the parts that are relevant to a specific task. One approach to MI research that seems natural from a SRE perspective: Do extensive work reverse engineering toy models, and try to deeply understand the circuits there. Then, try to distill out motifs and find (ideally automated) tools to detec...

Getup Kubicast
#111 - Polêmicas com Gabriel de Biasi

Getup Kubicast

Play Episode Listen Later Dec 22, 2022 52:11


Numa onda de fomentar polêmicas, nosso host entrevista o Gabriel de Biasi, SRE com mestrado em Computação, sobre a necessidade de ter ou não diploma de universidade para trabalhar como DevOps; a busca do conhecimento no modelo Tik Tok; a vida do SRE em empresas grandes vs startups; a necessidade (ou não) de colocar limites nos pods; e até que ponto os devs deveriam conhecer de Kubernetes para fazer suas aplicações.Os LINKS dos assuntos comentados no programa seguem abaixo:https://getup.io/kubicast/: Kubicast #58 - Faculdade pra quê? Kubicast #96 - Back to basic com Mateus PradoKubicast #93 - Por dentro do TsuruThread no Twitter sobre LIMITES de CPU: https://twitter.com/todaywasawesome/status/1575131715604860929?s=46&t=O4rHamFW0FjKKMI0L9Z0qQ As RECOMENDAÇÕES dos participantes estão a seguir:Engenharia de Confiabilidade do Google (livro de Jennifer Petoff)A História do KubernetesParte 1:  https://youtu.be/BE77h7dmoQU Parte 2: https://youtu.be/318elIq37PE Toc Toc (filme que está na Netflix) O Kubicast é uma produção da Getup, a única empresa brasileira 100% focada e especializada em Kubernetes. Todos os episódios do podcast estão no site da Getup e nas principais plataformas de áudio digital. Alguns deles estão registrados no YT.

The Cloudcast
2022 Year in Review & 2023 Predictions

The Cloudcast

Play Episode Listen Later Dec 21, 2022 62:07


Aaron and Brian discuss the 2022 Year in Review, highlighting the biggest trends, as well as making 2023 predictions. SHOW: 679CLOUD NEWS OF THE WEEK - http://bit.ly/cloudcast-cnotwSHOW SPONSORS:Eaton HomepageEaton and Tripp Lite have joined forces to bring more sanity to IT pros days, every day. Visit www.eaton.com/audio to learn more!FujiFilm. Your archival and backup data strategy, built on tape. Fujifilm tape is helping businesses get a handle on their vast amounts of data in the most secure, scalable and efficient way. Find out more at builtontape.fujifilmusa.comAWS Insiders is an edgy, entertaining podcast about the services and future of cloud computing at AWS. Listen to AWS Insiders in your favorite podcast player. Cloudfix HomepageSHOW NOTES:THE BASICS:The show grew nearly 20% YoY (2nd year in a row), with our first 2M listen year.The Cloudcast named to Top 20 Kubernetes resources of 2022The Cloudcast hosts named to “Who's Who of Cloud (2022)” listThank you to all our sponsors throughout the year (Datadog, CloudZero, JumpCloud, Mergify, BMC, Teleport, NewRelic, StrongDM, Polyscale, LoadForge, NetApp, Revelo, Lightstep, Granulate, CDN77, Jetbrains, Eaton, Cloudfix)THE BIG NEWS AREAS:Tech layoffs in 2HCY22VMware got acquired by Broadcom (will be part of CA+others)The US made a big investment in CHIPSNVIDIA's acquisition of ARM fell throughAWS - $60B>$85B (+28%), Azure - $35B>$50B (+42%), GCP - $15B>$27B (+38%)Microsoft is now 50/50 in Software and Cloud revenuesAWS re:Invent is different under Adam SelipskyBetween Texts and Images, AI seemed to make a big leap WebAssembly (WASM) is starting to make noise in new ways (PaaS 2.0?)Is Platform Engineering replacing DevOps and SRE?Docker 2.0 is making money2023 PREDICTIONS: Our 2020 PredictionsOur 2021 PredictionsOur 2022 Predictions Aaron's Predictions:We'll see a Twitter clone founded by folks that leftAzure will become #1 public cloud (pulled from 2022 predictions)Docker will become a unicorn again and prove everyone wrong2023 will be the year of the down rounds:Worldwide: 450 unicorn and 24 decacornA unicorn will go underApple will  finally give everyone a peek at their EV car in development, just to mess with Elon a bit.Brian's Predictions:We'll start seeing some of the 2020-2022 unicorns acquired as sub-unicorn pricesServerless makes a comeback as a cheaper computer alternativeFinOps conferences become a must-attend eventGCP makes a huge hail-mary acquisitionFEEDBACK?Email: show at the cloudcast dot netTwitter: @thecloudcastnet

Ciro Gómez Leyva por la Mañana
¿Cuál es el futuro político entre México y Perú?: Maximiliano Reyes, SRE

Ciro Gómez Leyva por la Mañana

Play Episode Listen Later Dec 21, 2022 16:23


Maximiliano Reyes, subsecretario para América Latina y el Caribe de la SRE, informa cuál es el estado de la relación bilateral entre ambos países y qué se espera de la estancia de la familia de Pedro Castillo en México

The New Stack Podcast
Automation for Cloud Optimization

The New Stack Podcast

Play Episode Listen Later Dec 20, 2022 22:47


During the pandemic, many organizations sped up their move to the cloud — without fully understanding the costs, both human and financial, they would pay for the convenience and scalability of a digital transformation. “They really didn't have a baseline,” said Mekka Williams, principal engineer, at Spot by NetApp, in this episode of The New Stack Makers podcast. “And so the those first cloud bills, I'm sure were shocking, because you don't get a cloud bill, when you run on your on-premises environment, or even your private cloud, where you've already paid the cost for the infrastructure that you're using. What's especially worrisome is that many of those costs are simply wasted, Williams said. “Most of the containerized applications running in Kubernetes clusters are running underutilized,” she said. “And anything that's underutilized in the cloud equates to waste. And if we want to be really lean and clean and use resources in a very efficient manner, we have to have really good cloud strategy in order to do that.” This episode of The New Stack Makers, hosted by Heather Joslyn, TNS features editor, focused on CloudOps, which in this case stands for “cloud operations.” (It can also stand for “cloud optimization,” but more about that later.) The conversation was sponsored by Spot by NetApp. Automation for Cloud Optimization Many organizations that moved quickly to the cloud during the dog days of the pandemic have begun to revisit the decisions they made and update their strategies, Williams said. “We see some organizations that are trying to modernize their applications further, to make better use of the services that are available in the cloud,” she said. “The cloud is getting more complex as they grow and mature in their journey. “And so they're looking for ways to simplify their operations. And as always keep their costs down. Keep things simple for their DevOps and SRE, to  is not incur additional technical debt, but still make the most make the best use out of their cloud, wherever they are.” Automation holds the key to CloudOps — both definitions — according to Williams. For starters, it makes teams more efficient. “The less tasks that your workforce have to perform manually, the more time they have to spend focused on business logic and being innovative,” Williams said. “Automation also helps you with repeatability. And it's less error-prone, and it helps you standardize. Really good automation simplifies your environment greatly.” Automating repetitive tasks can also help prevent your site reliability engineers (SREs) from burnout, she said. Practicing “good data hygiene,” Williams said, also helps contain costs and reduce toil: “Making sure you're using the right tier of data, making sure you're not over-provisioned. And the type of storage you need, you don't need to pay top dollar for high-performing storage, if it's just backup data that doesn't get accessed that often.” Such practices are “good to know on-premises, but these are imperative to know when you're in the cloud,” she said, in order to reduce waste. During this episode, Williams pointed to solutions in the Spot by Netapp portfolio that use automation to help make the most of cloud infrastructure, such as its flagship product, Elastigroup, which takes advantage of excess capacity to scale workloads. In June, Spot by NetApp acquired Instaclustr, a solution for managing open source database and streaming technologies. The company recognizes the growing importance of open source for enterprises. “We're paying attention to trends for cloud applications,” Williams said, “and we're growing the portfolio to address the needs that are top of mind for those customers.” Check out the entire episode to learn more about CloudOps.

Noticentro
-La selección de Argentina se proclamó campeona del mundial de Qatar 2022

Noticentro

Play Episode Listen Later Dec 18, 2022 1:42


-La selección de Argentina se proclamó campeona del mundial de Qatar 2022-AMLO felicitó a la selección argentina por ganar la copa mundial SRE informó que en las próximas horas 12 turistas mexicanos varados en Perú serán evacuados-Más información en nuestro podcast

Les Cast Codeurs Podcast
LCC 289 - La revanche des dinosaures

Les Cast Codeurs Podcast

Play Episode Listen Later Dec 10, 2022 91:30


Guillaume et Emmanuel discutent les nouvelles de novembre décembre: spring boot 3, AWS SnapStart, GitHub Copilot en procès… Et aussi des articles de fond: performance, SRE et l'auto résilience, comment utiliser Git, le Devops pour les décideurs, l'age et la tech et d'autres sujets encore. Résumé Enregistré le 9 décembre 2022 Téléchargement de l'épisode LesCastCodeurs-Episode–289.mp3 News Langages Comment choisir ses collections ? Entre les différentes listes, maps, queues, etc. https://www.baeldung.com/java-choose-list-set-queue-map un bon rappel des fonctionalités des différentes collections (discuter le diagramme de choix) et un bon rappel des ordres de grandeur d'insertion, de lecture etc Attention o(n) ne veut pas dire plus lent que o(1), ca veut dire que ca scale linéraiement Tester avec le volume attendu de données Un bon viel Object[] et le traverser à chqaue fois peut etre bien plus efficace (moins gourmand en structure memoire, moins de jump memoire, etc) Librairies Spring Boot 3 est sorti https://spring.io/blog/2022/11/24/spring-boot–3–0-goes-ga Java 17 de base Support de GraalVM Native Image (au lieu de l'ancienne expérimentation de Spring Native) Amélioration de la traçabilité avec Micrometer et Micrometer Tracing JakartaEE 9 minimum et support de JakartaEE 10 Quarkus est 600 fois plus lent qu'un compétiteur, ou pas https://t.co/1c2sFSY9sE discute le lien entre les résultats et l'environnement Une erreur de code initiale Puis une erreur de limite de système ou deux Pour arriver au résultat Bon retour sur l'approche méthodologique Spring vault 3 https://spring.io/blog/2022/11/28/spring-vault–3–0-goes-ga Java 17 Plus de clients supportés comme le client jtm réactif du jdk Support du versionage des mots de passe pour les vaults clé valeur Cloud Mais pourquoi Twitter tourne t'il toujours malgré toutes les personnes renvoyées ? https://matthewtejo.substack.com/p/why-twitter-didnt-go-down-from-a Grâce au long travail de SRE, de mise en place d'auto-réparation, de cache, de monitoring, de sur-provisioning. Donc beaucoup d'automatisation pour faire en sorte que le tout fonctionne “presque” tout seul sans trop d'intervention humaine. C'est un article écrit par un des SRE qui travaillait en particulier sur le cache de Twitter. GitHub passe à un versioning par date de son API REST https://github.blog/2022–11–28-to-infinity-and-beyond-enabling-the-future-of-githubs-rest-api-with-api-versioning/ au lieu de continuer avec une v4, v5, etc, utilisation de date comme 2022–12–25 chacune de ses versions seraient supportées a minima 2 ans on peut spécifier la version avec un header HTTP spécial pas de changement pour l'API GraphQL par contre Stripes va encore plus loin en se rappelant la version utilisée au premier appel et le fixe par defaut Les appels sans version explicite utilisent celle là, et on peut la faire evoluer Amazon SnapStart pour lambda https://aws.amazon.com/blogs/aws/new-accelerate-your-lambda-functions-with-lambda-snapstart/ démarrer plus rapidement les lambda A une phase Init exécutée pour préparer la lambda snapshot Firecracker VM pas juste CRaC Remplace les seed et le réseau et le disque réduit les temps de démarrage Testé avec Quarkus https://quarkus.io/blog/quarkus-support-for-aws-lambda-snapstart/ Testé avec Micronaut https://twitter.com/sdelamo/status/1597535515758452736?s=46&t=iQ7IEvuv4e4eD1oM-Hi1IA Et avec Spring Boot Outillage Petit tip Git de Minko Gechev (monsieur Angular) https://twitter.com/mgechev/status/1594758205237706752 On peut faire un git clone [repo] —depth 1 pour dire qu'on ne veut que la dernière révision, et non pas tout l'historique du repo C'est pratique en particulier en CI pour gagner du temps lorsqu'on a un gros repo avec beaucoup de révisions Si vous luttez toujours avec git ce guide très détaillé peut vous aider. https://github.com/k88hudson/git-flight-rules Il s'agit d'un énorme “comment faire?” qui est même traduit en plusieurs langues dont le français: https://github.com/k88hudson/git-flight-rules/blob/master/README_fr.md Faire tourner ses Github Actions en local avec le projet open source Act https://github.com/nektos/act Pratique de vérifier en local le fonctionnement de son pipeline avant de le pousser sur Github en prod Utilise Docker sous le capot pour faire tourner chaque étape peut marcher sur podman mais pas garanti pour l'instant Comment transformer n'importe quel site web ou webapp en application autonome https://glaforge.appspot.com/article/turning-a-website-into-a-desktop-application utilisation d'une fonctionnalité de Chrome : création de raccourci avec ouverture dans une fenêtre “sans chrome” fonctionne sur tous les OS utilise le favicon comme icône pour l'application le site web se retrouve dans votre barre des tâches comme une application normale, et on peut faire un ALT/CMD-Tab pour aller vers son application, etc. Architecture Six patterns pour les architectures event driven https://medium.com/wix-engineering/6-event-driven-architecture-patterns-part–1–93758b253f47 de Wix Trois patterns dans cet article Consume and project : vue dematerialisee copie des données chaudes consommées par beaucoup. Et ces vues sont focalisées sur un consommateur. Kafka et CDC au milieu pour découpler Event driven de bout en bout : websocket utilise pour envoyer les demandes. Le web stocket serveur copie dans Kafka. Ces consommateurs font le job et un message est envoyé via le web socket serveur. Résilience, découplage K/V store: et en m'articuler avec kafka qui permet d'être consommé en k/v basse latence et en consommation d'évènement Peut être intéressant mais pas si simple a comprendre les usage dans cet article Méthodologies Un article sur quoi et pourquoi le DevOps (en fait englobant les bonnes pratiques du moment) https://enix.io/fr/blog/devops-benefices-difficultes/ les tech ne vont pas apprendre grand chose mais c'est un article pour les managers ou plutot les execs pour les aider à voir la valeur souvent plus facile de montrer la valeur par du contenu exterieur a l'entreprise percu comme neutre autres articles sur Kubernetes pour les execs https://enix.io/fr/blog/kubernetes-benefices-difficultes/ Amazon et la methode “working backwards” pour un produit https://www.productplan.com/glossary/working-backward-amazon-method/ imagine le produit pret a etre releasé ecrire la press release evaluer l'opportunité (doit-on le construire) découvrir les solution pour le faire et avoir l'appriobation des décideurs construire la roadmap construire le backlog Sécurité 1.5 million de lignes de code dans Android sont maintenant en Rust https://security.googleblog.com/2022/12/memory-safe-languages-in-android–13.html?m=1 de plus en plus de code memory safe (Java, Kotlin, Rust) Mais la majorité de nouveau code reste quand meme Java et C++ Et une correlation de baisse de vulnerabilités liées a la sureté de mémoire (moins de code de programme memory unsafe) Ou maturation du code avec moins de vuln? Autres efforts: outils de securisation de la memoire en C/C++, fuzzing Zero vuln memoire dans le code rust en 2 ans et en moyenne 1 / kLOC dans le code historique) Java -> JNI, Rust - unsafe {} pour les accès resource Loi, société et organisation Les dinosaures de la tech commencent à 40 ans? https://www.linkedin.com/pulse/non-nous-ne-sommes-pas-des-dinosaures-de-la-tech-pass%C3%A9-ramade/ Commentaire intéressant de Benjamin Marron qui explique “s'être restreint aux technos de son coeur d'activité car trop de veille technologique hétérogène l'avait épuisé et avait renforcé son sentiment d'être complètement obsolète et dépassé” https://twitter.com/bmarron/status/1596136098828148736 âge median des devs entre 28 et 31 ans chez Google ms Facebook Mais 50 ans c'est 30% de la force de travail Avantages seniors Expérience Mentorat (comm, interaction interpersonnelle, (atlassian un 40 ans dans chaque équipe rétention Moins de changement de travail tous les 3 ans Flexibilité : les vieux ont leurs enfants partis Aide à faire des produites pour les personnes de même âge pas souvent dans les politiques de DE&I GitHub copilot menacé par un procès https://www.infoq.com/news/2022/11/lawsuit-github-copilot/?utm_source=twitter&utm_medium=link&utm_campaign=calendar aux États Unis Class action contre copilot GitHub, ms et OpenAI Violation de copyright et notamment des licenses open source Hypothèse est que humain ou AI, même responsabilité face à la license Discussion autour de fair use vs rupture de contrat DMCA etc Piratage de logiciel à une échelle sans précédant Pour avoir des conséquences fortes sur l'IA et son utilisation des sources ouvertes pour construire du contenu Et Antonio va devoir recorder à la main Rubrique débutant Différentes méthodes d'interpolation des chaines en Java https://www.baeldung.com/java-string-interpolation la concatenation avec + la methode format() souvent intimidante mais plus optimisée et sure StringBuilder le plus flexible notamment dans les cas de if et autre variations mais moins sur que format. et plus rapide MessageFormat pour les chaines de caractère utilisateur (multi langage) Apache Commons (pas sur qu'il y ait beaucoup d'usage dans les JDK modernes Conférences La liste des conférences provenant de Developers Conferences Agenda/List par Aurélie Vache et contributeurs : 1 décembre 2022 : Devops DDay #7 - Marseille (France) 2 décembre 2022 : BDX I/O - Bordeaux (France) 2 décembre 2022 : DevFest Dijon 2022 - Dijon (France) 14–16 décembre 2022 : API Days Paris - Paris (France) & Online 15–16 décembre 2022 : Agile Tour Rennes - Rennes (France) 19 janvier 2023 : Archilocus - Bordeaux (France) 19–20 janvier 2023 : Touraine Tech - Tours (France) 25–28 janvier 2023 : SnowCamp - Grenoble (France) 2 février 2023 : Very Tech Trip - Paris (France) 2 février 2023 : AgiLeMans - Le Mans (France) 9–11 février 2023 : World AI Cannes - Cannes (France) 16–19 février 2023 : PyConFR - Bordeaux (France) 7 mars 2023 : Kubernetes Community Days France - Paris (France) 23–24 mars 2023 : SymfonyLive Paris - Paris (France) 23–24 mars 2023 : Agile Niort - Niort (France) 1–2 avril 2023 : JdLL - Lyon 3e (France) 5–7 avril 2023 : FIC - Lille Grand Palais (France) 12–14 avril 2023 : Devoxx France - Paris (France) 10–12 mai 2023 : Devoxx UK - London (UK) 12 mai 2023 : AFUP Day Lille & Lyon (France) 25–26 mai 2023 : Newcrafts Paris - Paris (France) 29–30 juin 2023 : Sunny Tech - Montpellier (France) 12–13 octobre 2023 : Volcamp 2023 - Clermont Ferrand (France) Nous contacter Pour réagir à cet épisode, venez discuter sur le groupe Google https://groups.google.com/group/lescastcodeurs Contactez-nous via twitter https://twitter.com/lescastcodeurs Faire un crowdcast ou une crowdquestion Soutenez Les Cast Codeurs sur Patreon https://www.patreon.com/LesCastCodeurs Tous les épisodes et toutes les infos sur https://lescastcodeurs.com/

Ship It! DevOps, Infra, Cloud Native
Red Hat's approach to SRE

Ship It! DevOps, Infra, Cloud Native

Play Episode Listen Later Dec 8, 2022 67:46 Transcription Available


Narayanan Raghavan leads the global SRE organization that runs Red Hat managed cloud services including OpenShift Dedicated, Azure Red Hat Openshift, Red Hat OpenShift Service on AWS, and Red Hat OpenShift Data Science among others across the three major cloud providers: AWS, GCP & Azure. We start with a high-level discussion about DevOps, SRE & platform engineering, and then we dig into SRE specifics, including what it takes to safely roll out updates across many tens of thousands of OpenShift clusters.

Changelog Master Feed
Red Hat's approach to SRE (Ship It! #82)

Changelog Master Feed

Play Episode Listen Later Dec 8, 2022 67:46 Transcription Available


Narayanan Raghavan leads the global SRE organization that runs Red Hat managed cloud services including OpenShift Dedicated, Azure Red Hat Openshift, Red Hat OpenShift Service on AWS, and Red Hat OpenShift Data Science among others across the three major cloud providers: AWS, GCP & Azure. We start with a high-level discussion about DevOps, SRE & platform engineering, and then we dig into SRE specifics, including what it takes to safely roll out updates across many tens of thousands of OpenShift clusters.

Getup Kubicast
#110 - Apanhando do Kubernetes

Getup Kubicast

Play Episode Listen Later Dec 8, 2022 56:44


Para não achar que só o Google entende de Kubernetes e desistir na primeira queda, siga essa recomendação: não pule etapas!  O conhecimento de base é seu alicerce para seguir evoluindo com a plataforma.Nesse Kubicast, você confere mais sobre o assunto e a experiência do Leonardo D. Lourenço, cara gente fina demais, que trocou a carreira administrativa pela de SRE. As RECOMENDAÇÕES do programa seguem abaixo:Cyberpunk: Mercenários (anime que está na Netflix)Narco-Santos (série que está na Netflix)Toc Toc (filme que está na Netflix)Kubicast #96 - Back to basic com Mateus PradoO Kubicast é uma produção da Getup, a única empresa brasileira 100% focada e especializada em Kubernetes. Todos os episódios do podcast estão no site da Getup e nas principais plataformas de áudio digital. Alguns deles estão registrados no YT.

Hipsters Ponto Tech
Observabilidade no Itaú – Hipsters Ponto Tech #334

Hipsters Ponto Tech

Play Episode Listen Later Dec 6, 2022 44:08


Hoje o assunto no Hipsters.Tech é sobre observabilidade além dos logs e para essa conversa convidamos o time do Itaú que é uma das maiores empresas do Brasil e do mundo! Nesse bate-papo vamos entender como o banco lida com a grande quantidade de logs, quais abordagens, métricas e recursos são utilizados para analisar performances, proteger e prevenir dados através da observabilidade. Vem conferir quem participa com a gente!

Screaming in the Cloud
Multi-Cloud in Sanity with Simen Svale Skogsrud

Screaming in the Cloud

Play Episode Listen Later Dec 6, 2022 34:34


About SimenEver since he started programming simple games on his 8-bit computer back in the day, Simen has been passionate about how software can deliver powerful experiences. Throughout his career he has been a sought-after creator and collaborator for companies seeking to push the envelope with their digital end-user experiences.He co-founded Sanity because the state of the art content tools were consistently holding him, his team and his customers back in delivering on their vision. He is now serving as the CTO of Sanity.Simen loves mountain biking and rock climbing with child-like passion and unwarranted enthusiasm. Over the years he has gotten remarkably good at going over the bars without taking serious damage.Links Referenced: Sanity: https://www.sanity.io/ Semin's Twitter: https://twitter.com/svale/ Slack community for Sanity: https://slack.sanity.io/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us by our friends at Pinecone. They believe that all anyone really wants is to be understood, and that includes your users. AI models combined with the Pinecone vector database let your applications understand and act on what your users want… without making them spell it out. Make your search application find results by meaning instead of just keywords, your personalization system make picks based on relevance instead of just tags, and your security applications match threats by resemblance instead of just regular expressions. Pinecone provides the cloud infrastructure that makes this easy, fast, and scalable. Thanks to my friends at Pinecone for sponsoring this episode. Visit Pinecone.io to understand more.Corey: This episode is brought to you in part by our friends at Veeam. Do you care about backups? Of course you don't. Nobody cares about backups. Stop lying to yourselves! You care about restores, usually right after you didn't care enough about backups. If you're tired of the vulnerabilities, costs, and slow recoveries when using snapshots to restore your data, assuming you even have them at all living in AWS-land, there is an alternative for you. Check out Veeam, that's V-E-E-A-M for secure, zero-fuss AWS backup that won't leave you high and dry when it's time to restore. Stop taking chances with your data. Talk to Veeam. My thanks to them for sponsoring this ridiculous podcast.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's guest is here to tell a story that I have been actively searching for, for years, and I have picked countless fights in pursuit of it. And until I met today's guest, I was unconvinced that it actually exists. Simen Svale is the co-founder and CTO of a company called Sanity. Simen, thank you for joining me, what is Sanity? What do you folks do over there?Simen: Thank you, Corey. Thank you. So, we used to be this creative agency that came in as, kind of—we would, kind of, Black Hawk Down into a company and help them innovate, and that would be our thing. And these were usually content, a project like media companies, corporate communication, these kinds of companies, we would be coming in and we would develop some ideas with them. And they would love those ideas and then invariably, we wouldn't ever be able to do those ideas because we couldn't change the workflows in their CMS, we couldn't extend their content models, we couldn't really do anything meaningful.So, then we would end up setting up separate tools next to those content tools and they would invariably get lost and never be used after a while. So, we were like, we need to solve this problem, we need to solve it at the source. So, we decided we wanted a new kind of content platform. It would be a content platform consisting of two parts. There will be the, kind of, workspace where you create the content and do the workflows and all that, that will be like an open-source project that you can really customize and build the exact workspace that you need for your company.And then on the other side, you would have this, kind of, content cloud, we call it the content lake. And the point with this is to very often you bring in several different sources, you have your content that you create specifically for a project, but very often you have content from an ERP system, availability of products, time schedules. Let's say you're real estate agent; you have data about your properties that come from other systems. So, this is a system to bring all that together. And then there is another thing that kind of really frustrated me was content systems had content APIs, and content APIs are really particularly, and specifically, about a certain way of using content, whereas we thought content is just data.It should be data, and the API should be a database query language. So, these are, kind of, the components of Sanity, it's a very customizable workspace for working with content and running your content workflows. And it's this content lake, which is this, kind of, cloud for your content.Corey: The idea of a content lake is fascinating, on some level, where it goes beyond what the data lake story, which I've always found to be a little of the weird side when cloud companies get up and talk about this. I remember this distinctly a few years ago at a re:Invent keynote, that Andy Jassy, then the CEO of AWS, got up and talked about customer's data lakes, and here's tools for using that. And I mentioned it to one of my clients it's like, and they looked at me like I was a very small, very simple child and said, “Yeah, that would be great, genius, if we had a data lake, but we don't.” It's like, “You… you have many petabytes of data hanging out in S3. What do you think that is?” “Oh, that just the logs and the assets and stuff.” It's… yeah.Simen: [laugh].Corey: So, it turns out that people don't think about what they have in the same terms, and meeting customers with their terms is challenging. Do you find that people have an idea of what a content cloud or a content lake is before you talk to them about it?Simen: I mean, that's why it took us some time to come up with the word content lake. But we realized, like, our thinking was, the content lake is where you bring all your content to make it curiable and to make it deliverable. So that's, like—you should think, like, as long as I need to present this to end-users, I need to bring it into the content lake. And it's kind of analogous to a data lake. Of course, if you can't curate your data in the data lake, it isn't a data lake, even if you have all the data there. You have to be able to analyze it and deliver it in the format you need it.So, it's kind of an analogy for the same kind of thinking. And a crux of a content lake is it gives you one, kind of, single API that works for all of your content sources. It kind of brings them all in together in one umbrella, which is, kind of, the key here, that teams can then leverage that without learning new APIs and without ordering up new APIs from the other teams.Corey: The story that really got me pointed in your direction is when a mutual friend of ours looked at me and said, “Oh, you haven't talked to them yet?” Because it was in response to a story I've told repeatedly, at length, at anyone who will listen, and by that I include happens to be unfortunate enough to share an elevator ride with me. I'll talk to strangers about this, it doesn't matter. And my argument has been for a long time that multi-cloud, in the sense of, “Oh yeah, we have this one workload and we can just seamlessly deploy it anywhere,” is something that is like cow tipping as Ben Kehoe once put it, in that it doesn't exist and you know it doesn't exist because there are no videos of it happening on YouTube. There are no keynote stories where someone walks out on stage and says, “Oh, yeah, thanks for this company's great product, I had my thing that I built entirely on AWS, and I can seamlessly flip a switch, and now it's running on Google Cloud, and flip the switch again, and now it's running on Azure.”And the idea is compelling, and they're very rarely individual workloads that are built from the beginning to be able to run like that, but it takes significant engineering work. And in practice, no one ever takes advantage of that optionality in most cases. It is vanishingly rare. And our mutual friend said, “Oh, yeah. You should talk to Simen. He's done it.”Simen: [laugh]. Yeah.Corey: Okay, shenanigans on that, but why not? I'm game. So, let me be very direct. What the hell have you done?Simen: [laugh]. So, we didn't know it was hard until I saw his face when I told him. That helps, right? Like, ignorance is bliss. What we wanted was, we were blessed with getting very, very big enterprise customers very early in our startup journey, which is fantastic, but also very demanding.And one thing we saw was, either for compliance reasons or for, kind of, strategic partnership reasons, there were reasons that big, big companies wanted to be on specific service providers. And in a sense, we don't care. Like, we don't want to care. We want to support whatever makes sense. And we are very, let's call it, principled architects, so actually, like, the lower levels of Sanity doesn't know they are part of Sanity, they don't even know about customers.Like, we had already the, kind of, separation of concerns that makes the lower—the, kind of, workload-specific systems of Sanity not know a lot of what they are doing. They are basically just, kind of, processing content, CDN requests, and just doing that, no idea about billing or anything like that. So, when we saw the need for that, we thought, okay, that means we have the, what we call the color charts, which is, kind of, the light bulbs, the ones we can have—we have hundreds and hundreds of them and we can just switch them off and the service still works. And then there's the control plane that is, kind of, the admin interface that the user is use to administrate the resources. We wanted customers to just be able to then say, “I want this workloads, this kind of content store to run on Azure, and I want this one on Google Cloud.” I wanted that to feel the same way regions do. Like, you just choose that and we'll migrate it to wherever you want it. And of course, charge you for that privilege.Corey: Even that is hard to do because when companies say, “Oh, yeah, we didn't have a multi-cloud strategy here,” it's okay, if you're multi-cloud strategy evolves, we have to have this thing on multiple clouds, okay, first as a step one, if you're on AWS—which is where this conversation usually takes place when I'm having this conversation with people, given the nature of what I do for a living—it's, great, first, deploy it to a second AWS region and go active-active between those two. You should—theoretically—have full-service and API compatibility between them, which removes a whole bunch of problems. Just go ahead and do that and show us how easy it is. And then for step two, then talk about other cloud providers. And spoiler, there's never a step two because that stuff is way more difficult than people who have not done it give it credit for being.How did you build your application in such a way that you aren't taking individual dependencies on things that only exist in one particular cloud, either in terms of the technology itself or the behaviors? For example, load balancers come up with different inrush times, RDS instances provision databases at different speeds with different guarantees around certain areas across different cloud providers. At some point, it feels like you have to go back to the building blocks of just rolling everything yourself in containers and taking only internal dependencies. How do you square that circle?Simen: Yeah, I think it's a good point. Like, I guess we had a fear of—my biggest fear in terms of single cloud was just that leverage you provide your cloud provider if you use too many of those kinds of super-specific services, the ones that only they run. Like, so it was, our initial architecture was based on the fact that we would be able to migrate, like, not necessarily multi-cloud, just, if someone really ups the price or behaves terribly, we can say, “Oh, yeah. Then we'll leave for another cloud provider.” So, we only use super generic services, like queue services, blob services, these are pretty generic across the providers.And then we use generic databases like Postgres or Elastic, and we run them pretty generically. So, anyone who can provide, like, a Postgres-style API, we can run on that. We don't use any exotic features. Let's say, picking boring Technologies was the most, kind of, important choice. And then this also goes into our business model because we are a highly integrated database provider.Like in one sense, Sanity is as a content database with this weird go-to-market. Like, people think of us as a CMS, but it is actually the database we charge for. So also, we can't use these very highly integrated services because that's our margin. Like, we want that money, right [laugh]? So, we create that value and then we build that on very simple, very basic building blocks if that makes sense.So, when we wanted to move to a different cloud, everything we needed access to, we could basically build a platform inside Azure that looks exactly like the one we built inside Google, to the applications.Corey: There is something to be said for the approach of using boring technologies. Of course, there's also the story of, “Yeah, I use boring technologies.” “Like what?” “Oh, like, Kubernetes,” is one of the things that people love to say. It's like, “Oh, yes.”My opinion on Kubernetes historically has not been great. Basically, I look at it as if you want to cosplay working at Google but can't pass their technical screen, then Kubernetes is the answer for you. And that's more than a little unfair. And starting early next year, I'm going to be running a production workload myself in Kubernetes, just so I can make fun of it with greater accuracy, honestly, but I'm going to learn things as I go. It is sort of the exact opposite of boring.Even my early experiments with it so far have been, I guess we'll call it unsettling as far as some of the non-deterministic behaviors that have emerged and the rest. How did you go about deciding to build on top of Kubernetes in your situation? Or was it one of those things that just sort of happened to you?Simen: Well, we had been building microservice-based products for a long time internal to our agency, so we kind of knew about all the pains of coordinating, orchestrating, scaling those—Corey: “We want to go with microservices because we're tired of being able to find the problem. We want this to be much more of an exciting murder mystery when something goes down.”Simen: Oh, I've heard that. But I think if you carve up the services the right way, every service becomes simple. It's just so much easier to develop, to reason about. And I've been involved in so many monoliths before that, and then every refactor is like guts on the table is, like, month, kind of, ordeal, super high risk. With the microservices, everything becomes a simple, manageable affair.And you can basically rebuild your whole stack service by service. And you can do—like, it's a realistic thing. Like, you—because all of them are pretty simple. But it's kind of complicated when they are all running inside instances, there's crosstalk with configuration, like, you change the library, and everything kind of breaks. So, Docker was obvious.Like, Docker, that kind of isolation, being able to have different images but sharing the machine resources was amazing. And then, of course, Kubernetes being about orchestrating that made a lot of sense. But that was also compatible with a few things that we have already discovered. Because workloads in Kubernetes needs to be incredibly boring. We talk about boring stuff, like, if you, for example—in the beginning, we had services that start up, they do some, kind of, sanity check, they validate their environment and then they go into action.That in itself breaks the whole experience because what you want Kubernetes-based service to do is basically just do one thing all the time in the same way, use the same amount of memory, the same amount of resources, and just do that one thing at that rate, always. So, we broke apart those things, even the same service runs in different containers, depending on their state. Like, this is the state for doing the Sanity check, this is the state for [unintelligible 00:13:05], this is the state for doing mutations. Same service. So, there's ways about that.I absolutely adore the whole thing. It saved—like, I haven't heard about those pains we used to have in the past ever again. But also, it wasn't an easy choice for me because my single SRE at the time said, like, he was either Kubernetes or he'd quit. So, it was very simple decision.Corey: Exactly. The resume-driven development is very much a thing. I've not one to turn up my nose at that; that's functionally what I've done my entire career. How long had your product been running in an environment like that before, “Well, we're going multi-cloud,” was on the table?Simen: So, that would be three-and-a-half years, I think, yeah. And then we started building it out in Azure.Corey: That's a sizable period of time in the context of trying to understand how something works. If I built something two months ago, and now I have to pick it up and move it somewhere else, that is generally a much easier task as far as migrations go than if the thing has been sitting there for ten years. Because whenever you leave something in an environment like that, it tends to grow roots and takes a number of dependencies, both explicit and implicit, on the environment in which runs. Like, in the early days of AWS, you sort of knew that local disks on the instances were ephemeral because in the early days, that was the only option you had. So, every application had to be written in such a way that it did not presume that there was going to be local disk persistence forever.Docker containers take that a significant step further. Where when that container is gone, it's gone. There is no persistent disk there without some extra steps. And in the early days of Docker, that wasn't really a thing either. Did you discover that you'd take in a bunch of implicit dependencies like that on the original cloud that you were building on?Simen: I'm old school developer. I would all the way back to C. And in C, you need to be incredibly, incredibly careful with your dependencies because you basically—your whole dependency mapping is happening inside of your mind. The language doesn't help you at all. So, I'm always thinking about my kind of project as, kind of, layers of abstraction.If someone talks to Postgres during a request, requests are supposed to be handled in the index, then I'm [laugh] pretty angry. Like, that breaks the whole point. Like, the whole point is that this service doesn't need to know about Postgres. So, we have been pretty hardcore on, like, not having any crosstalk, making sure every service just knows about—like, we had a clear idea which services were allowed to talk to which services. And we were using GVT tokens internally to make sure that authentication and the rights management was just handled on the ingress point and just passed along with records.So, no one was able to talk to user stores or authentication services. That always all happens on the ingress. So, in essence, it was a very pure, kind of, layered platform already. And then, like I said, also then built on super boring technologies. So, it wasn't really a dramatic thing.The drama was more than we didn't maybe, like [laugh] like these sort of cloud services that much. But as you grow older in this industry, you kind of realize that you just hate the technologies differently. And some of the time, you hate a little bit less than others. And that's just how it goes. That's fine. So, that was the pain. We didn't have a lot of pain with our own platform because of these things.Corey: It's so nice watching people who have been around in the ecosystem for long enough to have made all the classic mistakes and realized, oh, that's why common wisdom is what common wisdom is because generally speaking, that shit works, and you learn it yourself from first principles when you decide—poorly, in most cases—to go and reimplement things. Like oh, DNS goes down a lot, so we're just going to rsync around an ETSI hosts file on all of our Linux servers. Yeah, we tried that collectively back in the '70s. It didn't work so well then, either. But every once in a while, some startup founder feels the need to speed-run learning those exact same lessons.What I'm picking up from you is a distinct lack of the traditional startup founder vibe of, “Oh well, the reason that most people don't do things this way is because most people are idiots. I'm smarter than they are. I know best.” I'm getting the exact opposite of that from you where you seemed to wind up wanting to stick to things that are tried and true and, as you said earlier, not exciting.Simen: Yeah, at least for these kinds of [unintelligible 00:17:15]. Like, so we had a similar platform for our customers that we, kind of, used internally before we created Sanity, and when we decided to basically redo the whole thing, but for kind of a self-serve thing and make a product, I went around the developer team and I just asked them, like, “In your experience, what systems that we use are you not thinking about, like, or not having any problems with?” And, like, just make a list of those. And there was a short list that are pretty well known. And some of them has turned out, at the scale we're running now, pretty problematic still.So, it's not like it's all roses. We picked Elasticsearch for some things and that it can be pretty painful. I'm on the market for a better indexing service, for example. And then sometimes you get—let's talk about some mistakes. Like, sometimes you—I still am totally on the microservices train, and if you make sure you design your workloads clearly and have a clear idea about the abstractions and who gets to talk to who, it works.But then if you make a wrong split—so we had a split between a billing service and a, kind of, user and resource management service that now keeps talking back and forth all the time. Like, they have to know about what each other is. And it says, if two services need to know about each other's reciprocally, like, then you're in trouble, then those should be the same service, in my opinion. Or you can split it some other way. So, this is stuff that we've been struggling with.But you're right. My last, kind of, rah-rah thing was Rails and Ruby, and then when I weened off of that, I was like, these technologies work for me. For example, I use Golang a lot. It's a very ugly language. It's very, very useful. You can't argue against the productivity you have in Go, but also the syntax is kind of ugly. And then I realized, like, yeah, I kind of hate everything now, but also, I love the productivity of this.Corey: This episode is sponsored in part by our friends at Uptycs, because they believe that many of you are looking to bolster your security posture with CNAPP and XDR solutions. They offer both cloud and endpoint security in a single UI and data model. Listeners can get Uptycs for up to 1,000 assets through the end of 2023 (that is next year) for $1. But this offer is only available for a limited time on UptycsSecretMenu.com. That's U-P-T-Y-C-S Secret Menu dot com.Corey: There's something to be said for having been in the industry long enough to watch today's exciting new thing becomes tomorrow's legacy garbage that you've got to maintain and support. And I think after a few cycles of that, you wind up becoming almost cynical and burned out on a lot of things that arise that everyone leaves everyone breathless. I am generally one of the last adopters of something. I was very slow to get on virtualization. I was a doomsayer on cloud itself for many years.I turned my nose up at Docker. I mostly skipped the whole Kubernetes thing and decided to be early to serverless, which does not seem to be taking off the way that I wanted it to, so great. It's one of those areas where just having been in the operation side particularly, having to run things and fix them at two in the morning when they inevitably break when some cron job in the middle of the night fires off because no one will be around then to bother. Yeah, great plan. It really, at least in my case, makes me cynical and tired to the point where I got out of running things in anger.You seem to have gone a different direction where oh, you're still going to build and run things. You're just going to do it in a ways that are a lot more well-understood. I think there's a lot of value to that and I don't think that we give enough credit as an industry to people making those decisions.Simen: You know, I was big into Drum and Bass back in the '90s I just love that thing. And then you went away, and then something came was called dubstep. It's the same thing. And it's just better. It's a better Drum and Bass.Corey: Oh yeah, the part where it goes doof, doof, doof, doof, doof, doof, doof—Simen: [laugh]. Exactly.Corey: Has always been—it's yeah, we call it different things, but the doof, doof, doof, doof, doof music is always there. Yeah.Simen: Yeah, yeah, yeah. And I think the thing to recognize, you could either be cynical and say, like, you kids, you're just making the same music we did like 20 years ago, or you can recognize that actually it—Corey: Kids love that, being told that. It's their favorite thing, telling them, “Oh yeah, back when I was your age…” that's how you—that's a signifier of a story that they're going to be riveted to and be really interested in hearing.Simen: [laugh]. Exactly. And I don't think like that because I think you need to recognize that this thing came back and it came back better and stronger. And I think Mark Twain probably didn't say that history doesn't repeat itself, it rhymes. And this is similar thing.Right now I have to contend with the fact that server-based rendering is coming back as a completely new thing, which was like, the thing, always, but also it comes back with new abstractions and new ways of thinking about that and comes back better with better tooling. And kind of—I think the one thing if you can take away from that kind of journey, that you can be stronger by not being excited by shiny new things and not being, kind of, a champion for one specific thing over every other thing. You can just, kind of, see the utility of that. And then when they things come back and they pretend to be new, you can see both the, kind of, tradition of it and maybe see it clearer than most of the people, but also, it's like you said, don't bore the kids because also you should see how it is new, how it is solving new things, and how these kids coming back with the same old thing as a new thing, they saw it differently, they framed it slightly differently, and we are better for it.Corey: There's so much in this industry that we take from others. We all stand on the shoulders of giants, and I think that is something that is part of what makes this industry so fantastic in different ways. Some of the original computer scientists who built some of the things that everyone takes for granted these days are still alive. It's not like the world of physics, for example, where some of the greats wound up discovering these things hundreds of years ago. No, it's all evolved within living memory.That means that we can talk to people, we can humanize them, on some level. It's not some lofty great sitting around and who knows what they would have wanted or how they would have intended this. Now, you have people who helped build the TCP stack stand up and say, “Oh yeah, that was a dumb. We did a dumb. We should not have done it that way.” Oh, great.It's a constant humbling experience watching people evolve things. You mentioned that Go was a really neat language. Back when I wound up failing out of school, before I did that, I took a few classes in C and it was challenging and obnoxious. About like you would expect. And at the beginning of this year, I did a deep-dive into learning go over the course of a couple days enough to build a binary that winds up controlling my internet camera in my home office.And I've learned an awful lot and how to do things and got a lot of things wrong, and it was a really fun language. It was harder to do a lot of the ill-considered things that get people into trouble with C.Simen: Hmm.Corey: The idea that people are getting nice things in a way that we didn't have them back when we were building things the first time around is great. If you're listening to this, it is imperative—listen to me—it is imperative. Do not email me about Rust. I don't want to hear it.Simen: [laugh].Corey: But I love the fact that our tools are now stuff that we can use in sensible ways. These days, as you look at using sensible tools—which in this iteration, I will absolutely say that using a hyperscale public cloud provider is the right move; that's the way to go—do you find that, given that you started over hanging out on Google Cloud, and now you're running workloads everywhere, do you have an affinity for one as your primary cloud, or does everything you've built wind up seamlessly flowing back and forth?Simen: So, of course, we have a management interface that our end-users, kind of, use to monitor, and it has to be—at least has to have a home somewhere, even though the data can be replicated everywhere. So, that's in Google Cloud because that's where we started. And also, I think GCP is what our team likes the most. They think it's the most solid platform.Corey: Its developer experience is far and away the best of all the major cloud providers. Bar none. I've been saying that for a while. When I first started using it, I thought I was going to just be making fun of it, but this is actually really good was my initial impression, and that impression has never faded.Simen: Yeah. No, it's like it's terrible, as well, but it's the least terrible platform of them all. But I think we would not make any decisions based on that. As long as it's solid, as long as it's stable, and as long as, kind of, price is reasonable and business practices is, kind of, sound, we would work with any provider. And hopefully, we would also work with less… let's call it less famous, more niche providers in the future to provide, let's say, specific organizations that need very, very specific policies or practices, we will be happy to support. I want to go there in the future. And that might require some exotic integrations and ways of building things.Corey: A multi-cloud story that I used to tell—in the broader sense—used PagerDuty as an example because that is the service that does one thing really well, and that is wake you up when something sends the right kind of alert. And they have multiple cloud providers historically that they use. And the story that came out of it was, yeah, as I did some more digging into what they've done and how they talked about this, it's clear that the thing that wakes you up in the middle of the night absolutely has to work across a whole bunch of different providers because if it's on one, what happens when that's the one that goes down? We learned that when AWS took an outage in 2011 or 2012, and PagerDuty went down as a result of that. So, the thing that wakes you up absolutely lives in a bunch of different places on a bunch of different providers.But their marketing site doesn't have to. Their user control panel doesn't have to. If there's an outage in their primary cloud that is sufficiently gruesome enough, okay, they can have a degraded mode where you're not able to update and set up new alerts and add new users into your account because everything's on fire in those moments anyway, that's an acceptable trade-off. But the thing that wakes you up absolutely must work all the time. So, it's the idea of this workload has got to live in a bunch of places, but not every workload looks like that.As you look across the various services and things you have built that comprise a company, do you find that you're biasing for running most things in a single provider or do you take that default everywhere approach?Simen: No, I think that to us, it is—and we're not—that's something we haven't—work we haven't done yet, but architecturally, it will work fine. Because as long as we serve queries, like, we have to—like components, like, people write stuff, they create new content, and that needs to be up as much as possible. But of course, when that goes down, if we still serve queries, their properties are still up, right? Their websites or whatever is still serving content.So, if we were to make things kind of cross-cloud redundant, it would be the CDN, like, indexes and the varnish caches and have those [unintelligible 00:27:23]. But it is a challenge in terms of how you do routing. And let's say the routing provider is down. How do you deal with that? Like, there's been a number of DNS outages and I would love to figure out how to get around that. We just, right now, people would have to manually, kind of, change their—we have backup ingress points with the—yeah, that's a challenge.Corey: One of the areas where people get into trouble with multi-cloud as well, that I've found, has been that people do it with that idea of getting rid of single points of failure, which makes a lot of sense. But in practice, what so many of them have done is inadvertently added multiple points of failure, all of which are single-tracked. So okay, now we're across to cloud providers, so we get exposure to everyone's outages, is how that winds up looking. I've seen companies that have been intentionally avoiding AWS because great, when they go down and the internet breaks, we still want our store to be up. Great, but they take a dependency on Stripe who is primarily in AWS, so depending on the outage, people may very well not be able to check out of their store, so what did they gain by going to another provider? Because now when that provider goes down, their site is down then too.Simen: Mmm. Yeah. It's interesting that anything works at all, actually, like, seeing how intertwined everything is. But I think that is, to me, the amazing part, like you said, someone's marketing site doesn't have to be moved to the cloud, or maybe some of it does. And I find it interesting that, like, in the serverless space, even if we provide a very—like, we have super advanced engineers and we do complex orchestration over cloud services, we don't run anything else, right?Like, all of our, kind of, web properties is run with highly integrated, basically on Vercel, mostly, right? Like we don't want to know about—like, we don't even know which cloud that's running on, right? And I think that's how it should be because most things, like you said, most things are best outsourced to another company and have them worry, like, have them worry when things are going down. And that's how I feel about these things that, yes, you cannot be totally protected, but at least you can outsource some of that worry to someone who really knows what—like, if Stripe goes down, most people don't have the resources to worry at the level that Stripe would worry, right? So, at least you have that.Corey: Exactly. Yeah, if you ignore the underlying cloud provider stuff, they do a lot of things I don't want to have to become an expert in. Effectively, you wind up getting your payment boundary through them; you don't have to worry about PCI yourself at all; you can hand it off to them. That's value.Simen: Exactly. Yeah.Corey: Like, the infrastructure stuff is just table stakes compared to a lot of the higher up the stack value that companies in that position enjoy. Yeah, I'm not sitting here saying don't use Stripe. I want to be very clear on that.Simen: No, no, no. No, I got you. I got you. I just remember, like, so we talked about maybe you hailing all the way back to Seattle, so hail all the way back to having your own servers in a, kind of, place somewhere that you had to drive to, to replace a security card because when the hard drive was down. Or like, oh, you had to scale up and now you have to buy five servers, you have to set them up and drive them to the—and put them into the slots.Like, yes, you can fix any problem yourself. Perfect. But also, you had to fix every problem yourself. I'm so happy to be able to pay Google or AWS or Azure to have that worry for me, to have that kind of redundancy on hand. And clearly, we are down less time now that we have less control [laugh] if that makes sense.Corey: I really want to thank you for being so generous with your time. If people want to learn more, where's the best place for them to find you?Simen: So, I'm at @svale—at Svale—on Twitter, and my DMs are open. And also we have a Slack community for Sanity, so if you want to kind of engage with Sanity, you can join our Slack community, and that will be on there as well. And you find it in the footer on all of the sanity.io webpages.Corey: And we will put links to that in the show notes.Simen: Perfect.Corey: Thank you so much for being so generous with your time. I really appreciate it.Simen: Thank you. This was fun.Corey: Simen Svale, CTO and co-founder at Sanity. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an insulting comment, and make sure you put that insulting comment on all of the different podcast platforms that are out there because you have to run everything on every cloud provider.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

What the Dev?
The evolution of the SRE - Episode 191

What the Dev?

Play Episode Listen Later Dec 6, 2022 19:16


In this episode of the SD Times "What the Dev?" podcast this week, editor-in-chief David Rubinstein discusses the evolving role of the site reliability engineer (otherwise known as the SRE) within organizations. His guest is Narayanan Raghavan, senior director of site reliability engineering of OpenShift at Red Hat.

PurePerformance
SRE for the non-unicorns (aka Enterprises) with James Brookbank

PurePerformance

Play Episode Listen Later Dec 5, 2022 52:50


You have a CISO (Chief Security Information Officer) but no CRO (Chief Reliability Officer)? You blame people if systems crash? You scale your people in the rate of scaling your infrastructure? If you answer any of those questions with YES then you should tune into this podcast as you probably struggle adopting Site Reliability Engineering (SRE) in your organization.James Brookbank, Cloud Solutions Architect, has dealt with resiliency topics in a large enterprise prior to joining Google. In our conversation he shares advice he gives Enterprises to convert the excitement about SRE into actual implementation. James gave some good guidance on what good and not so good projects are to start with. He gives practical examples on what it means to change your company culture and why there doesn't have to be an SRE for every service.In our call we discussed the SRE in Enterprise talk at DevOpsDays Boston and SRECon EMEA as well as their recent book. Here are all the relevant links:James Brookbank on Linkedin:https://www.linkedin.com/in/jamesbrookbank/SRECon EMEA Slides: https://www.usenix.org/system/files/srecon22_slides_mcghee.pdfDevOpsDays Boston 2022 Session Recording: https://www.youtube.com/watch?v=__e7b25QOHcEnterprise Roadmap to SRE Book: https://sre.google/resources/practices-and-processes/enterprise-roadmap-to-sre/

Changelog Master Feed
Let's deploy straight to production! (Ship It! #81)

Changelog Master Feed

Play Episode Listen Later Dec 1, 2022 67:25


In today's episode, we have the pleasure of two guests: Whitney Lee, Staff Technical Advocate at VMware, the one behind the ⚡️ Enlightning episodes, and Mauricio Salatino, which you already know from

Ship It! DevOps, Infra, Cloud Native
Let's deploy straight to production!

Ship It! DevOps, Infra, Cloud Native

Play Episode Listen Later Dec 1, 2022 67:25 Transcription Available


In today's episode, we have the pleasure of two guests: Whitney Lee, Staff Technical Advocate at VMware, the one behind the ⚡️ Enlightning episodes, and Mauricio Salatino, which you already know from

TestGuild Performance Testing and Site Reliability Podcast
Performance/SRE Awesomeness in Latin America

TestGuild Performance Testing and Site Reliability Podcast

Play Episode Listen Later Nov 29, 2022 23:28


In this episode, I'll speak with 12 speakers (Federico Toledo, Mark Tomlinson, Leandro Melendez, Henrik Rexed, Anisbert Suárez, Roger Abelenda, Laura Gayo, Nicolás Paez, Lucía Lavagna, Gerencia, Vera Babat, Mercedes Quintero Martínez & Andy Hohenner) from the upcoming Quality Sense Conference in Uruguay on Dec 9th. Many of the sessions taking place cover Performance and SRE-related topics. I had planned on being there myself, but I have other commitments I need to honor, BUT it's not too late for you. I had planned on being there myself, but I have other commitments I need to honor, BUT it's not too late for you. After listening to this episode to the end, go to QualitySenseConf.com and register for free to attend this conference in person or watch it online. Discover more about OpenTelemetry, performance testing, measure and metrics, Chaos Engineering, and more.  

TestGuild Performance Testing and Site Reliability Podcast
What is Platform Engineering? with Evan Niedojadlo

TestGuild Performance Testing and Site Reliability Podcast

Play Episode Listen Later Nov 22, 2022 29:55


What is Platform Engineering? In this episode, Evan Niedojadlo, an engineer at Peddle, shares his real-world experience with Platform Engineering. Discover if Platform Engineering is yet another extension of DevOps practices, the difference from SRE, Platform Metrics, the future of platform engineering, and more.

Let's Talk New York
#132 An English journey to a tech position

Let's Talk New York

Play Episode Listen Later Nov 20, 2022 51:15


Thiago Ghisi has over 15 years of experience in the Software Industry, working for Large Corporations and Startups in Brazil and the US, as Software Engineer, QA, Project Manager, SRE, Agile Consultant, and lately as Engineering Manager and Director. On this special episode, 100% in English, he shares more about his story on how he got a job in the tech industry in the United States - especially on how early on he realized English would play a huge role in his career and which steps he took to learn and improve his English. Plus, he also shares some career advice. EPISODE'S LINKS: LINKS DO EPISÓDIO: Thiago's twitter: @thiagoghisi Get U$30 off on TOEFL test: LAURA30 - terms and conditions: https://cirql.me/b/uvffq

Southpaws Podcast
Episode 529 - 28 Gallons

Southpaws Podcast

Play Episode Listen Later Nov 20, 2022 89:51


Ajax did the math. Twitter continues to burn, Savrin had some fun, BLFC got put into an odd position, Qatar bans beer at the World Cup, and how the heck is it nearly US Thanksgiving? Southpaws is creating and promoting The Queer Agenda | Patreon LINKS (1) Mosquito Capital on Twitter: "I've seen a lot of people asking "why does everyone think Twitter is doomed?" As an SRE and sysadmin with 10+ years of industry experience, I wanted to write up a few scenarios that are real threats to the integrity of the bird site over the coming weeks." / Twitter Qatar Bans All Beer at 2022 FIFA World Cup (thetakeout.com) Telegram - https://t.me/+Ma4PTE0IsWVmMDQ5

Noticentro
SRE desmiente que un aficionado mexicano haya sido detenido en Qatar por introducir alcohol

Noticentro

Play Episode Listen Later Nov 19, 2022 1:52


SRE desmiente que un aficionado mexicano haya sido detenido en Qatar por introducir alcohol En votación unánime en la Cámara de Diputados aprobaron modificaciones a la Ley OlimpiaTras ocho meses de ocupación rusa llegó a Jersón el primer tren desde KievMás información en nuestro Podcast

Software Engineering Daily
Collaborative Notebooks for DevOps and SRE with Micha Hernandez

Software Engineering Daily

Play Episode Listen Later Nov 15, 2022 38:55


The complexity of the software infrastructure has been increasing as companies have migrated towards kubernetes, containers, microservices and other distributed systems. However the tools around observability and monitoring have not seen much improvement. These tools are usually managed by teams distributed across different locations and time zones, which results in siloing of knowledge of your The post Collaborative Notebooks for DevOps and SRE with Micha Hernandez appeared first on Software Engineering Daily.

TestGuild Performance Testing and Site Reliability Podcast
Collaborative Notebooks for Debugging Your Infrastructure with Micha Hernandez Van Leuffen

TestGuild Performance Testing and Site Reliability Podcast

Play Episode Listen Later Nov 15, 2022 25:42


Want an easy way to query, visualize, and understand metrics and logs in your infrastructure? In this episode, Micha Hernandez van Leuffen is the founder of Fiberplane shares a new to enhance your SRE collaboration. Discover all about collaborative notebooks for resolving incidents, coordinating work related to downtime, how to build up a structured knowledge base, and more. Try Fiberplane yourself: https://studio.fiberplane.com/ Learn more about Fiberplanes features: http://docs.fiberplane.com/

The Changelog
Sonic search, building software like an SRE, leaving the cloud, an HTTP crash course & breaking up with CSS-in-JS

The Changelog

Play Episode Listen Later Oct 24, 2022 8:09


Valerian Saliou's Sonic search backend, Brandon Willett on how to build software like an SRE, DHH on why they're leaving the cloud, Amos' HTTP crash course nobody asked for & Sam Magura tells why he and the Spot team are breaking up with CSS-in-JS.