Podcasts about Nagios

Computer system and network monitoring application software

  • 42PODCASTS
  • 91EPISODES
  • 44mAVG DURATION
  • ?INFREQUENT EPISODES
  • Feb 11, 2025LATEST
Nagios

POPULARITY

20172018201920202021202220232024


Best podcasts about Nagios

Latest podcast episodes about Nagios

XenTegra XenCast
XenServer 8 Unveiled: Navigating Hypervisor Upgrades and New Features

XenTegra XenCast

Play Episode Listen Later Feb 11, 2025 26:52 Transcription Available


Join hosts Bill Sutton, Todd Smith, and Geremy Meyers for episode 175 of the Citrix Session Podcast as they dive into the critical aspects of upgrading to XenServer 8. In this episode, they discuss the impending end-of-life status of Citrix Hypervisor 8.2 cu1, scheduled for June 25th, 2025, highlighting the necessity for users to transition to the newer XenServer 8. The team outlines the enhanced features, including support for Windows 11 with VTPM, integrated PvS accelerator, added support for Nagios and SNMP, and the revolutionary Xen Conversion Manager which eases VM conversions. They also navigate through operational tips, such as leveraging a content delivery network for updates and the benefits of infrastructure as code with Terraform support. Whether you're planning a migration or looking to optimize your Citrix workloads, this episode offers invaluable insights into making the most of XenServer 8's capabilities.

Engineering Kiosk
#132 Prometheus: Revolution im Monitoring mit Mitbegründer Julius Volz

Engineering Kiosk

Play Episode Listen Later Jul 16, 2024 76:51


Überwachen von Applikationen in Zeiten von dynamischer InfrastrukturCloud hier, Serverless da, Container-Scheduler dort. In Zeiten von dynamischen Infrastrukturen weiß man gar nicht mehr so genau, auf welchem Server und Port deine Applikation eigentlich läuft. Dies wirft die große Frage auf: Wie überwache ich meine Applikation denn eigentlich so ordentlich, dass ich sicherstellen kann, dass diese so funktioniert, wie ich mir das initial gedacht habe?Die Antwort dreht sich oft um den de facto Standard im Cloud Native Monitoring-Segment: Prometheus.In dieser Episode sprechen wir mit Julius Volz, einem der zwei initialen Autoren von Prometheus.Mit ihm sprechen wir über die Entstehungsgeschichte von Prometheus bei SoundCloud, wie sich das System von traditionellen Monitoring-Systemen unterscheidet, warum mit PromQL eine eigene Query-Language ins leben gerufen wurde aber auch welche Flaws er nach 12 Jahren Entwicklung gerne beheben würde.Bonus: Wer kennt noch Nagios, Ganglia oder Graphite?Das schnelle Feedback zur Episode:

PING
The check is in the (e)Mail(s)

PING

Play Episode Listen Later Jun 12, 2024 38:02


This time on PING, Philip Paeps from the FreeBSD Cluster Administrators and Security teams discusses their approach to systems monitoring and measurement. Its eMail. “Short podcast” you say, but no, there's a wealth of war-stories and “why” to explore in this episode. We caught up at the APNIC57/APRICOT meeting held in Bangkok in February of 2024. Philip has a wealth of experience in systems management and security and a long history of participation in the free software movement. So his ongoing of support of email as a fundamental measure of system health isn't a random decision, it's based on experience. Mail may not seem like the obvious go-to for a measurement podcast, but Philip makes a strong case that it's one of the best tools available for a high-trust measure of how systems are performing, and in the first and second order derivative can indicate aspects of velocity and rate of change of mail flows, indicative of the continuance or change in the underlying systems issues. Philip has good examples of how Mail from the FreeBSD cluster systems indicates different aspects of systems health. Network delays, disk issues. He's realistic that there are other tools in the armoury, especially the Nagios and Zabbix systems which are deployed in parallel. But from time to time, the first best indication of trouble emerges from a review of the behaviour of email. A delightfully simple, and robust approach to systems monitoring can emerge from use of the fundamental tools which are part of your core distribution.

ScanNetSecurity 最新セキュリティ情報
Nagios XI の monitoringwizard.php における SQL Injection の脆弱性(Scan Tech Report)

ScanNetSecurity 最新セキュリティ情報

Play Episode Listen Later May 22, 2024 0:09


2024 年 2 月に公開された、Nagios XI の脆弱性の悪用を試みるエクスプロイトコードが公開されています。

Giant Robots Smashing Into Other Giant Robots
525: Tech, Public Service, and Serendipity

Giant Robots Smashing Into Other Giant Robots

Play Episode Listen Later May 16, 2024 37:24


Victoria Guido hosts Robbie Holmes, the founder and CEO of Holmes Consulting Group. The conversation kicks off with Robbie recounting his initial foray into the tech world at a small web hosting company named A1 Terabit.net, chosen for its alphabetical advantage in the white pages. This job was a stepping stone to a more significant role at Unisys, working for the state of New York's Department of Social Services, where Robbie inadvertently ventured into civic tech and public interest technology. Robbie shares his career progression from supporting welfare systems in New York to becoming a technological liaison between the city and state, leading to a deeper involvement in open-source solutions. His journey through tech spaces includes developing websites, diving into the Drupal community, and eventually establishing his consulting business. Robbie emphasizes the serendipitous nature of his career path, influenced significantly by community involvement and networking rather than a planned trajectory. Additionally, Robbie gives insights on the impact of technology in public services and his stint with the U.S. Digital Service (USDS), where he contributed to significant projects like vets.gov. Robbie promotes the value of community engagement in shaping one's career, stressing how connections and being in the right place at the right time can lead to unexpected opportunities and career pivots. Follow Robbie Holmes on LinkedIn (https://www.linkedin.com/in/robbiethegeek/), X (https://twitter.com/RobbieTheGeek), Facebook (https://www.facebook.com/robbiethegeek), Instagram (https://www.instagram.com/robbiethegeek), or GitHub (https://github.com/robbiethegeek). Check out his website at robbiethegeek (https://about.me/robbiethegeek). Follow thoughtbot on X (https://twitter.com/thoughtbot) or LinkedIn (https://www.linkedin.com/company/150727/). Transcript: VICTORIA: This is the Giant Robots Smashing Into Other Giant Robots podcast, where we explore the design, development, and business of great products. I'm your host, Victoria Guido. And with me today is Robbie Holmes, Founder and CEO of Holmes Consulting Group. Robbie, thank you for joining me. ROBBIE: I'm so happy to be here. It's great to talk to you, Victoria. VICTORIA: Yes. I have known you for a long time now, but I don't know everything about you. So, I thought I would start with the question: What was your first job that you ever had? ROBBIE: My first technical job, I ended up working for an internet web hosting company called A1 Terabit.net. And note the A1 because it came first in the white pages. It was a really small web hosting company run by a man named [SP] Maxim Avrutsky. I worked there for about six months before I submitted my resume to an online job forum. That's how old I am. And it ended up in the hands of Unisys, where I eventually worked for the state of New York. VICTORIA: Wow [laughs]. So, what a journey that you've been on to get from starting there, and what a marketing ploy back in the day with the white pages. So, tell me a little bit more about how you went from that first job to where you are today with having your own business in consulting. ROBBIE: Yeah, I wasn't even aware that I was jumping into the sort of civic tech space and public interest technology because the job I ended up with was working for New York State in the Department of Social Services. And welfare is federally funded and distributed to states and then states to localities. And New York City and New York State have a weird parasymbiotic relationship because over 50% of the welfare in New York State goes to the five boroughs in New York City. So, so much of my job was supporting the welfare system within the city, which was run by the human resources administration. So, that just led to this cascade of me, like, getting invested in supporting that, and then eventually jumping over to the other side where I worked for the City of New York. And at that point, I ended up becoming sort of a technology project manager and almost a tech liaison between the city and state. And I was out in the welfare centers, helping get the job centers up to a new application called the Paperless Office System, which was a client-server app that was a wrapper around welfare. All of that ended up leading to me finally making it to the network operation center for the City of New York, where I started replacing expensive solutions like HP OpenView with open-source solutions like Nagios and another open-source solution that provided an interface. And it really opened my eyes to the idea of open source. And I had really paid attention to a lot of open-source operating systems. So, I was kind of just a general tech nerd. And eventually, I started building websites, and that led me to the Drupal community in New York City, which was sort of this cascade that led me to communities. And I think that's sort of a through line for my entire career is I don't really think I ever had a plan. I think my entire career has been this sort of a lucky happenstance of being prepared when an opportunity arose and sometimes being in the right place because of my connections and community. VICTORIA: That's interesting about being involved with the people around you and seeing what problems are out there to solve and letting that lead you to where your interests lie. And then, following that, naturally led you to, like, this really long career and these really interesting, big projects and problems that you get to solve. ROBBIE: Yeah. And I think one interesting aspect is like, I feel I spent a lot of time worried about what I was going to do and where I was going to do it. I don't have a bachelor's degree. I don't have an advanced degree. I have a high school diploma and a couple of years in college. Well, 137 credits, not the right 125 or 124 to have a bachelor's degree. I have enough credits for a couple of minors though, definitely Greek art history, I think mathematics, maybe one more. I just never got it together and actually got my degree. But that was so interesting because it was limiting to what jobs I could find. So, I was in the tech space as an IT person and specifically doing networking. So, I was running the network operation center. I helped, like, create a whole process for how we track tickets, and how we created tickets, and how things were moved along. And, in the process, I started building websites for family and friends. And I built a website for our network operation center, so that way we could have photos to go with our diagrams of the network. So that way, when we were troubleshooting remotely, we could actually pull up images and say, "The cable that's in port six goes off to the router. I think that port is dead. Can we move it to the port two to its right, and I'll activate it?" And that made a really interesting solution for something we weren't even aware we had, which was lack of visibility. So many of the people in the fields were newer or were trying to figure it out. And some of us had really deep knowledge of what was going on in those network rooms and hubs. It led me to this solution of like, well, why don't we just start documenting it and making it easier for us to help when they're in the field? That led me to, like, the Drupal community because I started building sites in the Drupal CMS. And I went to, like, my first Drupal meetup in 2007, and there was, like, five of us around the table. That led to eventually me working for Sony Music and all these other things. But the year before I found my way to the Drupal community, I probably sent out, like, 400 resumes for jobs in the tech space, didn't really get any callbacks. And then, I met the community, and I started attending events, and then eventually, I started organizing events. And then, Sony I interviewed and talked to them a couple of times. And then, a friend of mine became the boss. And she contacted me and was like, "Hey, are you in the market?" And I was like, "I don't know. Why? What's up?" And she's like, "I became Doug." And I was like, "What?" And she was like, "I'm now replacing Doug at Sony. I'm running the team." And I was like, "Yeah, I'm happy to talk." And that was the big transition in my career from IT to sort of development and to delivery, right? Like, when it comes right down to it, is I became the manager of interactive media at Sony Music, which was really a job I landed because I was connected to the community, and running events, and getting to know everyone. VICTORIA: Yeah. And I think it's really cool that you had this exposure early on to what you called civic tech, which we'll get into a little bit, and then you went from the community into a commercial technology space and really getting into engineering with Drupal. ROBBIE: Yeah, it was an interesting transition because what they needed at Sony was sort of somebody who could ride the line between systems engineer, database administrator, and Drupal engineer, and also probably pre-DevOps DevOps person. So, I was responsible for all deployments and all tickets that came in. I was sort of both the technical arm of the help desk. When I joined, there was 24 websites on the Drupal platform, and when I left, there was over 200. And we upgraded it from Drupal 5 to Drupal 6 to Drupal 7 while I was there. So, I was heavily involved in all of those updates, and all those upgrades, and all of the deployments of all the new themes, and all the changes to all these sites. So, what was great was they, I believe, if I understand it correctly, they actually created a role for me out of, like, two or three jobs because they needed a me, and they didn't have a role that existed. So, all of a sudden, they made a manager of interactive media role. And I was able to work there for two years, sort of being what I jokingly say, like, a digital janitor. I used to say that I had, like, an eight-bit key ring in a push broom. And I was always mad at your kids for trying to break my stuff. VICTORIA: [laughs] That's so good. A digital service janitor [laughs]? The connection for me between that and where I met you in the U.S. digital service space [laughs] I feel like there's a lot of parallels between that and where your career evolved later on in life. ROBBIE: Yeah. What's amazing is I did all this early work in my career in civic tech and didn't realize it was civic tech at the time. I just realized what I was doing was providing this huge impact and was value. You know, I spent a couple of years in the welfare centers, and I used to say all the time that the two hardest jobs in the welfare center are the person applying for welfare and the person deciding whether or not that person gets welfare. So, being a technologist and trying to help make that as simple as possible or easier and smooth the edges off of that process was really important. And it really taught me how important technology is to delivering service. And I really never thought about it before. And then, when I was working for Phase2 technologies, I was a director of Digital Services. And I read in a blog post, I believe that was written by Mikey Dickerson, who was the original administrator for USDS, and he talked about HealthCare.gov. And he walked in the door, and he said, "How do you know HealthCare.gov is down?" And I think there was some allusion to the fact that we were like, we turn on the television and if they're yelling at us, we know it's down. And Mikey was like, "We know how to monitor things." So, like, if you don't know Mikey Dickerson, he's the person who sort of created the web application hierarchy of needs in Google. He was an SRE. And his pyramid, like Maslow's hierarchy of needs, was all over Google when I was there. I was so impressed with the idea that, like, we aren't talking about how do we solve this problem? We're talking about knowing when there's a problem. And then, if we know there's a problem, we can put some messaging around that. We can say, like, "We're aware," right? Like if the president calls the secretary, the secretary can say, "We know it's down. We're working on it," which is building up political capital. It's a really amazing process that I kept reading this blog post, and I was like, God, that's how I would approach it. And then, I was like, wow, I wonder if I could use my skills to help America, and very shortly submitted an application and was like, well, we'll see what happens. And about six months later, I walked in the door at the VA and was the eighth employee of the Digital Service team at the VA. That was a franchise team of the USDS model. VICTORIA: And can you say a little bit more about what is the U.S. Digital Service and expand upon your early experience there? ROBBIE: Yeah. So, the United States Digital Service was created after HealthCare.gov had its issues. Todd Park had convinced President Obama to reach out to get support from the private industry. And the few of the people who were there, Todd convinced to stick around and start creating a team that could support if there was this kind of issue in the future. I believe the team that was there on the ground was Mikey Dickerson, Erie Meyer, Haley Van Dyck, and Todd. And there was a few other people who came back or were very close at the beginning, including the current administrator of USDS. She has been around a long time and really helped with HealthCare.gov. It's amazing that Mina is back in government. We're very lucky to have her. But what came out of that was what if we were able to stand up a team that was here in case agencies needed support or could vet solutions before these types of problems could exist? So, USDS was what they called the startup inside the White House that was created during the 2014 administration of President Obama. The team started that year, and I joined in May of 2016. So, I would be, like, sort of the beginning of the second team of the VA U.S. Digital Service team. So, USDS supported this idea of tours of duty, where you're a schedule A employee, which meant you were a full-time government employee, but you were term-limited. You could do up to two years of duty and work, and then you could theoretically stick around and do two more years. That was how these roles were envisioned. I think there's lots of reasons why that was the case. But what's nice is it meant that you would come in with fresh eyes and would never become part of the entrenched IT ecosystem. There are people that transition from USDS into government, and I think that's a huge value prop nowadays. It's something that I don't know they were thinking about when the original United States Digital Service was stood up, but it was hugely impactful. Like, I was part of the team at the VA that helped digitize the first form on vets.gov and all the work that was done. When the VA team started, there was a team that was helping with veteran benefits, and they worked on the appeals process for veteran benefits. And I joined. And there was a team that was...eventually, it became dubbed the veteran-facing tools team. And we worked on vets.gov, which was a new front door to expose and let veterans interact with the VA digitally. And over time, all the work that went into the tools and the solutions that were built there, everything was user-researched. And all of that work eventually got brought into VA.gov in what they called a brand merger. So, we took, like, the sixth most trafficked front door of the VA and took all the modern solutioning that that was and brought it into VA.gov, the main front door. So, all of a sudden, there was an identity, a login provided on VA.gov for the first time. So big, impactful work that many people were a part of and is still ongoing today. Surprisingly, so much of this work has now fallen under OCTO, which is the Office of the CTO in the VA. And the CTO is Charles Worthington, who was a USDSer who's the epitome of a person who goes where the work is. Charles was a Presidential Innovation Fellow who helped out in the times of HealthCare.gov and, joined USDS and did anything and everything that was necessary. He interviewed engineers. He was a product person. Charles is one of the most unique technologists and civic tech people I've ever met in my life. But Charles, at the end of the Obama administration and in this transition, realized that the VA was in need of someone to fill the CTO role. So, he came over to become the interim CTO because one of the values of USDS is to go where the work is. And he realized, with the transition, that Marina Nitze, who was the CTO who was transitioning out, there was going to be a need for continuity. So, he came in to provide that continuity and eventually became the full-time CTO and has been there ever since. So, he has helped shape the vision of what the VA is working towards digitally and is now...he was just named the Chief AI Officer for the agency. Charles is a great person. He has successfully, you know, shepherded the work that was being done early by some of us into what is now becoming a sort of enterprise-wide solution, and it's really impressive. VICTORIA: I appreciate you sharing that. And, you know, I think there's a perception about working for public service or for government, state or federal agencies, that they are bureaucratic, difficult to work with, very slow. And I think that the USDS was a great example of trying to really create a massive change. And there's been this ripple effect of how the government acquires products and services to support public needs, right? ROBBIE: Yeah, I would say there's a couple of arms of the government that were sort of modernization approaches, so you have the Presidential Innovation Fellows, which are the equivalent of, like, entrepreneurs and residents in government. And they run out of...I think they're out of the TTS, the Technology Transformation Service over at GSA, which is the General Services Administration. But the PIFs are this really interesting group of people that get a chance to go in and try to dig in and use their entrepreneurial mindset and approach to try to solve problems in government. And a lot of PIFS work in offices. Like, Charles' early team when he first became the CTO included a lot of Presidential Innovation Fellows. It was basically like, "Hey, the VA could use some support," and these people were available and were able to be convinced to come and do this work. And then, you have the Presidential Management Fellows, which I think is a little bit more on the administration side. And then, we have 18F and USDS. The United States Digital Service is a funded agency with an OMB. And we were created as a way to provide the government with support either by detailing people over or dropping in when there was a problem. And then, 18F is an organization that is named because the offices of GSA and TTS (Technology Transformation Service), where it's housed, are on the corner of 18th Street and F in DC. And 18F is sort of like having a technology or a digital agency for hire within the government. So, they are full-time employees of the government, sort of like USDS, except government agencies can procure the support of that 18F team, just like they would procure the support of your company. And it was a really interesting play. They are fully cost-recoupable subcomponent of TTS, which means they have to basically make back all the money that they spend, whereas USDS is different. It's congressionally funded for what it does. But they're all similar sibling organizations that are all trying to change how government works or to bring a more modern idea or parlance into the government. I used to say to people all the time that at USDS, you know, we would set a broken bone say, and then we would come back around and say, like, "Hey, does your arm hurt anymore?" The idea being like, no. Be like, "Cool, cool. Maybe you should go to the gym, and you should eat better." And that would be, like, procurement change. That would be, like, changing for the long term. So, all the work I was doing was building political capital so we could do better work in changing how procurement was done and then changing how the government delivered these things. So, what was awesome was, like, we used to have these fights at USDS about whether or not we were a culture change or we were firefighters. And I think the reality is once we're involved, culture changes happen. The bigger question is, are we going to be there for the long haul, or are we only there for a shorter period of time? And I think there are reasons why USDS teams had both plays. And I think it really is just two different plays for the same outcome. VICTORIA: Yeah, that makes sense. And to pivot a little bit, I think, you know, our audience, we have clients and listeners who are founders of products that are aimed at making these, like, public service needs, or to give some examples, like, maybe they're trying to track Congressional voting patterns or contact information for different state representatives, and they're trying to navigate this space [laughs]. So, maybe you can give some advice for founders interested in selling their products to government agencies. What can they do to make it more appealing and less painful for themselves? ROBBIE: I wouldn't consider myself a procurement expert, but at USDS, the procurement team called themselves the [SP] procurementati. And I was a secret member of the procurementati. I often was the engineer they would call to evaluate statements of work or sometimes be on technology evaluation panels. And it was fun to be a part of that. Things that most companies don't realize is government agencies will put out things like request for information or sources sought in the government space. And this is a way for industry to influence how government tries to solve problems. If you are trying to go after government work and you're only responding to an RFP, you're probably behind in your influence that you could have on the type of work. So, you'll see if a procurement seems to be, like, specifically focused on an approach, or a technology, or a framework, it's probably because some companies have come through and said, "I think this would probably solve your problem," and they gave examples. So, that's one way to be more connected to what's going on is to follow those types of requests. Another is to follow the money. My wife is this amazing woman who helped write The Data Act and get it passed through government. And The Data Act is the Data and Transparency Act. And that led to her heading over to treasury and leading up a team that built USAspending.gov. So, there is a website that tracks every dollar, with some exceptions, of the funding that comes out of Congress every year. And what's great is you can track it down to where it's spent, and how it's spent, and things like that. For education purposes, I think that is a really good thing that business and growth people can focus on is try to see and target where competitors or where solutions that you've looked at have gone in the past. It's just a good set of data for you to take a look at. The other piece is if you're creating a solution that is a delivery or a deliverable, like a SaaS solution, in order for something to be utilized in the government, it probably needs to be FedRAMP-approved, which is a process by which security approvals have been given so that government agencies have the green light to utilize your solution. So, there's tons of documentation out there about FedRAMP and the FedRAMP approval process. But that is one of those things that becomes a very big stopping point for product companies that are trying to work in the government. The easiest way to work your way through that is to read up on it a bunch, but also find an agency that was probably willing to sponsor you getting FedRAMP approval. Most companies start working with a government agency, get an exemption for them to utilize your product, and then you get to shape what that FedRAMP process looks like. You start applying for it, and then you have to have some sort of person who's helping shepherd it for you internally in the government and accepting any issues that come along in the process. So, I guess FedRAMP approval is one that's a little complicated but would be worth looking into if you were planning on delivering a product in government. VICTORIA: Right. And does that apply to state governments as well? ROBBIE: So, lots of state-related and city and locality-related governments will actually adopt federal solutions or federal paradigms. So, I think in the state of California, I think FedRAMP as one of the guiding principles for accepting work into the state of California, so it's not consistent. There's not a one-to-one that every state, or every city, or every locality will pull this in. But if you are already approved to be a federal contractor, or a federal business, or a federal product, it's probably going to be easier to make your way into the local spaces also. VICTORIA: Right. And as you said, there's plenty of resources, and tools, and everything to help you go along that journey if that's the group you're going for [laughs]. Mid-Roll Ad: When starting a new project, we understand that you want to make the right choices in technology, features, and investment but that you don't have all year to do extended research. In just a few weeks, thoughtbot's Discovery Sprints deliver a user-centered product journey, a clickable prototype or Proof of Concept, and key market insights from focused user research. We'll help you to identify the primary user flow, decide which framework should be used to bring it to life, and set a firm estimate on future development efforts. Maximize impact and minimize risk with a validated roadmap for your new product. Get started at: tbot.io/sprint. VICTORIA: So, kind of bringing it back to you, like, you're saying you want those partnerships within the government. You want someone advocating for you or for your product or your service. Drawing that back to what you said earlier about community, like, how do you form a community with this group of people who are in the state, or federal government, or civic tech spaces? ROBBIE: Yeah, I think it's an interesting problem because so much of it feels impenetrable from the outside. Most people don't even know where to start. There are organizations out there that are pretty good community connections, an example I would give is ACT-IAC. It is a public-private partnership where people from within the government, experts in their fields, and people in the private industry who are experts in their fields will be together on community boards and engaging in panels. And so, it's a really nice way to start connecting those dots. I have no direct affiliation with ACT-IAC. But if they'd like to give me my own account, that would be great. But it is one of those organizations I've seen be successful for people trying to find their way into a community that is a little harder to find. I think, also, so much of the community engagement happens at conferences and around...so, like, if you're in the healthcare space, this last month, you've had multiple conferences that I think were really great for people to get to know one another, you know, an example is ViVE. It just happened out in LA, which is a little more on the private sector health space, but still, government agencies were there. I know that the Department of Veterans Affairs had people there and were on panels. And then, HIMSS is another conference that takes place, and that just took place down in Miami. And in Miami, HIMSS happened and a whole bunch of other social community events took place. So, I'm close with a thing called the Digital Services Coalition, which is 47 companies that all try to deliver good government based on the Digital Services Playbook that was created by USDS that lives at playbook.cio.gov and the way that they try to accomplish this work. And that organization, while they were in Miami, hosted a happy hour. So, there's a lot of connections that can be made once you start seeing the players and getting to know who's around. So, it's a little bit about trying to find your way to that first event, and I think that will really open up everything for you. Within a week or two, I was at an International Women's Day event at MetroStar, which is a really great company that I've gotten a chance to spend some time with. And then, I was at an event for the Digital Services Coalition talking about open source in government. So, there's a lot of stuff out there for you to be a part of that isn't super cost-prohibitive and also doesn't take a lot once you start to open the door. You know, once you peek around that corner and you find some people, there's a lot more to be done. VICTORIA: Yeah. And you touched on something at the end there that wants me to bring up some of the advantages you can have being a small business, a minority-owned business, or woman-owned business, or veteran-owned business, so thinking about how you can form those connections, especially if you have one of those socio and economic set-asides that you might want to consider if they're looking to work with the government as well. ROBBIE: Yeah. Those socioeconomic set-asides include small businesses, woman-owned small business. I think it's Native and Alaskan 8(a), which is historically underrepresented and service-disabled veteran-owned. So, there are also sub-communities of associations, like there's the Digital WOSB, the digital Women-Owned Small Business alliance that was founded by Jess Morris from Pluribus Digital, and a bunch of other companies in the Digital Services Coalition. I believe she's the president of the Digital WOSB right now. That is a sub-community of women-owned small businesses that are trying to connect and create a community that they can support one another. And that's just one example of the type of connection you can make through those types of socioeconomic set-asides. But once you have those official socioeconomic set-asides, it will allow you to get specific contracts engagements in the government that are not allowed or available for others. So, the government procurement process will have some amount of these specific socioeconomic set-asides that need to be hit. Like, 8% of all procurements need to go to this and 10% of all procurements need to go to this. So, I think the VA is probably one of the most effective at hitting any of the socioeconomic set-asides, specifically related to service-disabled veteran-owned small businesses. So, if you happen to be a person of color and you found a business and you are female presenting, right? You may have 8(a) woman-owned small business. If you also happen to be a veteran and you're service-disabled, all of those things stack. You don't just get to have one of them. And they can be really effective in helping a business get a leg up and starting out and trying to help even the playing field for those communities. VICTORIA: Yeah. What I really appreciated about my experience working with Pluribus Digital, and you, and people who had had that experience in the U.S. Digital Services, that there is this community and desire to help each other out and that you can have access to people who know how to move your product forward, get you the connections that you need to be competitive, and to go after the work. So, I love encouraging people to consider civic tech options. And maybe even say more about just how impactful some of it can be. And what kind of missions are you solving in these spaces? ROBBIE: Yeah, I often try to remind people, especially those who are heading towards or considering civic tech, there are very few places in this world that you can work on something that can impact millions of people. Sure, I was lucky enough. I have tons of privilege. I worked at a lot of wacky places that have given me the access to do the type of work that I think is impactful, but very little has the kind of impact. Like, when I was interviewed by Marina Nitze as, my last interview when I joined USDS, she sent me an email at the end of it and said like, "Everything was great. I look forward to working with you. And remember, every time you commit code into our GitHub, you'll be helping 8 million veterans." And then, she cc'd Todd Park. And Todd was the CTO of the U.S., and he responded back within a minute. Todd is one of the most busy people. It was amazing to me how fast he responded. But he was like, "Lemme tell you, as somebody who can talk on behalf of our president, our country needs you." And those kinds of things they're hard to comprehend. And then, I joined the Digital Service team at the VA. And one of the first things that I got to support was the 10-10EZ. It's the healthcare application for veterans. And before I got there, it was a hosted PDF that we were trying to replace. And the team had been working for months to try to build a new, modern solution. What it was is it was, like, less than six submissions were happening a day because it only worked in Adobe Acrobat, I think it was 6.5 and below, and Internet Explorer 8.5 and below. And if you think about the people that could submit utilizing that limited set of technologies, it was slowly becoming homeless veterans who were using library computers that had not been upgraded. So, there was a diminishing amount of value that it was providing. And then, on top of it, it was sort of lying to veterans. If the version of the Adobe Acrobat was out of date, or wrong, or too new, it would tell them to upgrade their browser. So, like, it was effectively not providing any value over time. We were able to create a new version of that and that was already well on its way when I joined, but we were able to get it out the door. And it was a React frontend using a Node backend to talk to that SOAP API endpoint. Within the first week, we went from 6 submissions a day to 60 submissions a day. It's a joke, right? We were all 10x developers. We were like, "Look at us. We're killing it." But about three years later, Matt Cutts came to a staff meeting of USDS, and he was the second administrator of the USDS. And he brought the cake that had the actual 10-10EZ form on it, and it said, "500,000." And he had checked with the analytics team, and there were over 500,000 submissions of that form, which means there are 500,000 possible veterans that now may or may not have access to healthcare benefits. Those are big problems. All of that was done by changing out one form. It opened up the world. It opened up to a group of veterans that no one else was able to do. They would have had to go into a veteran's office, and they would have had to fill it out in paper. And some veterans just don't have the ability to do that, or don't have an address, or don't have a...so, there are so many reasons why having a digital form that worked for veterans was so important. But this one form that we digitized and we helped make modern has been submitted so many times and has helped so many veterans and their families. And that's just one example. That's just one form that we helped digitize. But now the team, I mean, I'm back in the VA ecosystem. There's, like, 2,500 people in the general channel in the office of the CTO Slack organization. That's amazing. There's people there that are working all day, every day, trying to solve the same problems that I was trying to solve when I got here. And there's so much work being done to help veterans. But that's just one example, right? Like, at USDS, I know that the digital filing for the free version of your tax form, the IRS e-file Direct, just went live. That was something that USDS had been working on for a very long time behind the scenes. And that's going to impact everybody who submits their taxes. These are the kind of problems that you get to work on or the scope of some of the problems if you work in these types of organizations, and that's really powerful. It's the thing that keeps drawing me back. I'm back supporting the VA again through some contracts in my business. But it's funny, like, I was working for another agency. I was over working at DHS on an asylum project. And a friend of mine kept telling people to tell me, "Man, veterans need you. If only there was another one of you to help us over here, that would be great." And eventually, it led to me being like, well, veterans need me. I'm going to go back to the VA. And that was my second tour at USDS at the Department of Veteran Affairs. And now I'm back there again. So, it's a very impactful place to work. There's tons of value you can provide to veterans. And, to me, it's the kind of work that keeps bringing me back. I didn't realize just how much I was a, like, impact junkie until I joined USDS, and then it really came to a head. I cannot believe how much work I've gotten to be a part of that has affected and supported those who get benefits and services from the federal government. VICTORIA: [inaudible 33:47] impact junkie. That's funny. But yeah, no, thank you for sharing that. That's really interesting. Let me see if you could go back in time to when you first started in this journey; if you could give yourself any advice, what would you say? ROBBIE: Yeah, I think I spent so much time being nervous about not having my degree that I was worried it was going to hinder me forever. And it's pretty amazing the career I've been able to thread together, right? Like, you know, I've hit on a few of them already. But, like, I started with a small web hosting company, and then New York State in the Department of Social Services, then New York City in the Human Resources Administration, Sony Music, Zagat Survey, Google, Johnson & Johnson, IDT telecommunications, Phase2 technologies, where I got to work on an awful lot of problems in lots of awesome places like NBA.com, and Major League Soccer, and Bassmaster. And then, the United States Digital Service where I got to work on things supporting the Department of Veterans Affairs and the Department of Homeland Security over at ADA.gov in DOJ. I helped them out. And I worked over at USDA helping get Farmers.gov off the ground. So, everything on my left leg, tattoo-wise, is something that changed my life from my perspective. And I have a Drupal tattoo on the back of my leg. I have a DrupalCon. So, anytime somebody said "Drupal" anywhere near a USDS person, I would magically appear because they would just be like, "Oh, Robbie has that Drupal tattoo." So, I got to work on a lot of dot govs that eventually landed or were being built in Drupal. So, I got to support a lot of work. And it meant that I got to, like, float around in government and do a lot of things that others didn't get to do. When CISA stood up, which is the office of security inside of DHS, it's one of the newest sub-communities or subcomponents, they built DHS SchoolSafety.gov, which is a cross-MOU'd site. And I got to sit in and help at the beginning of that because of my Drupal background. But it was really fun to be the person who helped them work with the vendors and make sure that they understood what they were trying to accomplish and be a sort of voice of reason in the room. So, I did all of that work, and then after that, I went and worked at Pluribus Digital, where I got a chance to work side by side with you. And then, that led to other things, like, I was able to apply and become the director of Digital Services and software engineering for my local county. So, I worked for Prince William County, where I bought a house during the pandemic. And then, after that contract ended, I had already started my own business. So, that's led to me having multiple individual contracts with companies and so many people. I've gotten to work on so many different things. And I feel very lucky. If I could go back and tell myself one thing, it's just, take a breath. Everything's going to be okay. And focus on the things that matter. Focus on the things that are going to help you. Focus on community. Focus on delivering value. Everything else will work itself out. You know, I joke all the time that I'm really good at providing impact. If you can measure my life in impact and value, I would be a very rich man. If you can measure it in money, I'm doing all right, but I'm never going to be yacht Robbie, you know, but I'm going to do okay. VICTORIA: Oh my god, yacht Robbie. That's great. So, just to recap, everything's going to be okay. You never know where it's going to take you. And don't be limited by the things that you think, you know, make you not enough. Like, there's a lot of things that you can do out there. I really like that advice. ROBBIE: And I think one last piece is, like, community matters, if you are a part of communities and you do it genuinely, how much that will impact your career. I gave a talk from Drupal NYC to the White House and beyond. And I talked through my entire resume and how everything changed when I started doing community engagement. When I went to the Drupal community in New York City and how that led to Sony, and that led to Zagat, which led to me getting acquired by Google, like, these things all cascaded. And then, when I moved to the DMV, I was able to join here and continue supporting communities, which allowed me to bring people into the local civic tech community from the local DC tech community. So, so many of the best USDS engineers, and designers, and product people I was able to help influence to come to government were people I met in the community or the communities I helped support. You know, I was an early revivalist of Alexandria Code and Coffee. It was a community that was started and then wavered. And then, Sean McBeth reached out to the community and said, "Do we want to help and support getting it off the ground again?" And I immediately said, "Yes." And then, that led to my friends at BLACK CODE COLLECTIVE wanting to create a community where they could feel safe and connected and create a community of their own. And then DC Code & Coffee started. And from there, Baltimore Code & Coffee kicked off. And it's just really nice that, like, it doesn't matter where I've been. All these things keep coming back to be a part of community and help support others. And you will be surprised at how much you get back in return. I wouldn't be the person I am today in my community. I wouldn't have my career if it wasn't for the people who started and helped shepherd me when I was starting out. And I feel like I've been trying to do the same for people for a really long time. VICTORIA: I love that. That's what I say, too, when people ask me for advice on careers and how to grow. And my biggest piece is always to go out and meet people. And go to your community, like, look and see what's happening. Like, find people you like hanging out with and learning from. And just that should be the majority of your time probably if you're trying to figure out where to go with your career or even just, like, expand as a person sometimes [laughs]. Robbie, I was going to ask; you mentioned that you had bought a house in Virginia. One of my other warmup questions was going to be, what's your favorite thing to put on the grill? ROBBIE: My house in Virginia definitely gets a lot of use, especially in the spring and the fall. I'm a big fan of team no extreme when it comes to temperature. But during those time periods, my grill is often fired up. My favorite is probably to make skirt steak on the grill. I'm a huge fan of tacos, especially made out of skirt steak. I'm in all day. That's one of my favorites. I also love to smoke. I have a smoker because I'm a caricature-esque suburban dude. I'm going to live into all of the possible things I could have. But I've had a smoker for a long time, and I love making sort of poor man's burnt ends. It's one of my favorite things to make. But you got to have some time. That's the kind of thing that takes, you know, 14 hours or 16 hours, but it's really fun to take advantage of it. A quick thing I love to make is actually smoked salmon. It takes longer to brine it than it usually does to smoke it. But it is one of the nicest things I've made on my smoker, you know, fresh pesto on a piece of salmon is pretty awesome, or everything bagel. Everything with the bagel seasoning is a pretty fun way to smoke some salmon. VICTORIA: Wow, that sounds so good. I'm going to have to stop in next time I'm in Virginia and get some [laughs] and hang out. Do you have any questions for me? ROBBIE: I'm excited to see where you've gone and how you've gotten here. I think this is such a cool job for you. Knowing who you are as a person and seeing you land in a company like this is really exciting. And I think you getting to be a part of this podcast, which we were joking about earlier, is I've been listening to probably since it started. I've been a big fan for a long time. So, it's cool to be here on this podcast. But it's also cool that my friend is a part of this and gets to be a part of this legacy. I'm really excited to see where you go over time. I know my career has been changing, right? I worked in government. Before that, I did all kinds of other stuff. Nowadays I have my own business where I often joke I have sort of, like, three things I offer, which is, like, consigliere services. Wouldn't it be nice to have a Robbie on your executive team without having to pay them an executive salary? You know, another one is like, you know, strategy and mentoring, but these are all things I know you do also, which I think is cool. But I've been working on contracts where I support companies trying to figure out how they modernize, or how their CTO can be more hands-off keyboard, or how their new director of business development can be more of a technical leader and taking on their first direct reports. So, I just enjoy all those aspects, and I just think it's something that I've watched you do in the company where we worked together. And it's always fun to see what you're working on and getting a chance to catch up with you. I feel like you're one of those people that does a really great job of staying connected. Every once in a while, I'll get a random text message like, "Hey, how you doing?" It always makes me smile. I'm like, Victoria is a really good connector, and I feel like I am, but you're even better at it on the being proactive side. That's how this all came about, right? We caught up, and you were like, "Why don't you come on the podcast?" So, that's really exciting. VICTORIA: Well, thank you, Robbie. Yeah, I think that's one of the great things about community is you meet people. You're like, "Oh, you're really cool. And you're doing cool stuff all the time. Like, how can I support you in your journey [laughs]? Like, what's up?" Yeah, for me, it was hard to actually leave DC. I didn't, you know, really think about the impact of leaving behind my tech community, like, that network of people. It was pretty emotional for me, actually, especially when we finally, like, stopped doing the digital version [laughs]. And I, like, kind of gave up managing it from California, which was kind of funny anyways [laughs]. Yeah, so no, I'm grateful that we stayed in touch and that you made time to be here with us today. Is there anything else that you would like to promote? ROBBIE: You know, just to remind you, you've done a great job of transitioning into where you are today, but anybody can do that, right? Like, before I moved to the DC area, I was in New York, and I was helping to organize JavaScript events. And I started looking at the DC area before I moved down here. And I found the DC Tech Community. And I found the Node School DC GitHub organization and reached out to the person who had ownership of it and said I wanted to help and support. I looked at this the other day. I think I moved on May 8th, and then, like, May 11th, when I walked in the door, somebody was like, "Are you new?" And I was like, "Yeah, I just moved here." And they were like, "Oh, from where?" And I was like, "New York." And they were like, "Are you that guy who's been bugging Josh about running Node school events?" I was like, "Yeah." And like, they were planning an unconference at the end of the month. And they were like, "Would you like to run a Node school at that unconference?" Like, 27 days later. So, it was amazing that, like, I immediately, like, fell from the New York Community where I was super connected, but I went out of my way to try to, like, see what the community looked like before I got there. And I was lucky enough to find the right people, and immediately I joked...I think I wrote a blog post that said like, "I found my new friends. By, like, going from one community to another, gave a person who was in his 40s a chance to meet new people very quickly." And it was pretty amazing, and I felt very lucky. But I did spend a little bit of energy and capital to try to figure it out because I knew it was going to be important to me. So, I think you've done a really good job. You've helped launch and relaunch things that were going on in San Diego and becoming a part of this connection to more people. I think you and I have a very similar spirit, which is like, let's find a way to connect with humans, and we do it pretty effectively. VICTORIA: Well, thank you. That really boosts my confidence, Robbie [laughs]. Sometimes, you show up to an event you've never been to before by yourself, and it's like a deer in headlights kind of moment. Like, oh God, what have I done [laughs]? ROBBIE: Oh, and the last thing I need to mention is I also have a podcast. I have my podcast about film. It's called Geek on Film. I used to record it with my friend, Jon. He's a little busy right now. But I used to pitch it as a conversation show about the current films that were going on. Now, it's one lone geek's ramblings about what he just saw. It's a great podcast for me because it gives me an opportunity to think a little more critically about film, which is one of the things that I probably have almost enough credits to get a minor in. But I absolutely love cinema and film in general. And it's given me an opportunity to connect with a lot more people about this subject and also to scratch the itch of me being able to create something around a community and around a thing I really love. VICTORIA: That's super cool. Yeah. You're top of mind because I also like films. I'm like, what's Robbie up to? Like, what's the recommendations, you know [laughs]? Do you have a top film recommendation from the Oscars? Is that too big of a question? ROBBIE: So, the one I will say that didn't get enough spotlight shined on it was Nimona. So, I'm a huge fan of the Spider-Man movies. I think Spider-Man Into the Spider-Verse and Across the Spider-Verse are both masterpieces. But Nimona is an animated film that was picked up by Netflix, and it is amazing. I don't know that I laughed or cried or was more moved by a film last year. And I don't know that it gets enough credit for what it was. But it did get nominated for best-animated film, but I don't know that enough people paid attention to it. Like it may have gotten lost in the algorithm. So, if you get a chance, check out Nimona. It's one of those beautiful, little gems that, if you travel down its story, there's all these twists and turns. It was based on a webcomic that became a graphic novel. One of the production companies picked it up, and it wasn't going to see the light of day. And then, Netflix bought its distribution rights. There's going to be a great documentary someday about, like, Inside Nimona. But I think the movie itself is really charming and moving, and I was really impressed with it. So, that was the one that got me, like, just before the Oscars this year, where I was like, this is the little animated movie that could, in my opinion. It's so charming. VICTORIA: I will definitely have to check that out. Thank you for giving us that recommendation. ROBBIE: Totally. VICTORIA: Final question. I just wanted to see if you had anything to share about being an advisory board member for Gray and for Hutch Studio. Could you tell us a little bit more about that? ROBBIE: Yeah. So, Gray Digital was founded by a friend of mine. We met through United States Digital Service. And his organization...I had been supporting him for a while and just being behind the scenes, talking to him and talking through business-related issues. And it was really nice. He offered to make me an official advisory board member. It was a great acknowledgment, and I really felt moved. There's some great people that are supporting him and have supported him. They've done really great work. Gray is out there delivering digital services in this space. And I think I was really lucky to be a part of it and to support my friend, Randall. Hutch is different. Hutch is an organization that's kind of like if you think about it, it almost is a way to support entrepreneurs of color who are trying to make their way into the digital service delivery space. Being an advisory board member there has been really interesting because it's shaping how Hutch provides services and what their approach is to how to support these companies. But over the last year, I've convinced the person who's running it, Stephanie, with a couple of other people, to open the door up or crack the door so we could talk directly and support the individual companies. So, it's been really great to be a Hutch advisory member to help shape how Hutch is approaching things. But I've also been a part of, like, many interview processes. I've reviewed a lot of, like, [inaudible 48:01] who want to join the organization. And I've also created personal relationships with many of the people who are part of Hutch. And, you know, like, you know me personally, so you know I run a Day of the Dead party. We'll just party at my house every year. I have a huge amount of affection for Mexican culture and, in general, the approach of how to remember people who are a part of your life. So, this is, like, the perfect way for me to bring people together at my house is to say, like, "Hey, my dad was awesome. What about your family? Who are your people?" What's really nice is that has given me an opportunity to host people at my house. And I've had Hutch company owners at my house the last couple of years and the person who runs Hutch. So, it's a really great community that I look at that is trying to shape the next emergent companies that are helping deliver digital services across the government. And it's really fun to be early on in their career and help them grow. Again, it seems silly, but it's the thing I care a lot about. How do I connect with people and provide the most value that I can? And this is a way I can provide that value to companies that may also go off and provide that value. It's a little bit of an amplifier. So, I'm a huge fan of what we've been able to accomplish and being a part of it in any way, shape, or form. VICTORIA: Well, I think that's a really beautiful way to wrap it up. ROBBIE: Really glad to catch up with you and be a part of this amazing podcast. VICTORIA: Yeah, so much fun. Thank you again so much. It was great to be here with you today. You can subscribe to the show and find notes along with a complete transcript for this episode at giantrobots.fm. If you have questions or comments, email us at hosts@giantrobots.fm. You can find me on thoughtbotsocial@vguido. This podcast is brought to you by thoughtbot and produced and edited by Mandy Moore. Thanks for listening. See you next time. AD: Did you know thoughtbot has a referral program? If you introduce us to someone looking for a design or development partner, we will compensate you if they decide to work with us. More info on our website at: tbot.io/referral. Or you can email us at: referrals@thoughtbot.com with any questions.

LINUX Unplugged
549: Will it Nixcloud?

LINUX Unplugged

Play Episode Listen Later Feb 12, 2024 94:10


Deploying Nextcloud the Nix way promises a paradise of reproducibility and simplicity. But is it just a painful trek through configuration hell? We built the dream Nextcloud using Nix and faced reality. Special Guest: Alex Kretzschmar.

2.5 Admins
2.5 Admins 177: Don't Pay the Dane

2.5 Admins

Play Episode Listen Later Jan 11, 2024 29:16


Why the problems with open source licenses aren't quite as easy to fix as some people think, the reasons you should never pay ransomware gangs, and running a Nagios distro on a Raspberry Pi.   Plug Support us on patreon and get an ad-free RSS feed with early episodes sometimes   News/discussion What comes after […]

Late Night Linux All Episodes
2.5 Admins 177: Don't Pay the Dane

Late Night Linux All Episodes

Play Episode Listen Later Jan 11, 2024 29:16


Why the problems with open source licenses aren't quite as easy to fix as some people think, the reasons you should never pay ransomware gangs, and running a Nagios distro on a Raspberry Pi.   Plug Support us on patreon and get an ad-free RSS feed with early episodes sometimes   News/discussion What comes after... Read More

Paul's Security Weekly TV
Nagios and Abandoned Projects, Hacking Trains (to Fix Them), OAuth Threats, 5Ghoul - ASW #267

Paul's Security Weekly TV

Play Episode Listen Later Dec 20, 2023 40:22


Nagios gets a review from NCC Group, hackers hack some anti-fixing code to fix trains in Poland, abusing OAuth post-compromise, 5Ghoul flaws in 5G networks, MITRE teases a new threat model for embedded systems, a conversation on vuln scoring systems, and more! Show Notes: https://securityweekly.com/asw-267

Paul's Security Weekly
Making Service Meshes Work for People - Idit Levine - ASW #267

Paul's Security Weekly

Play Episode Listen Later Dec 19, 2023 77:40


Service meshes create the opportunity to make security a team sport. They can improve observability and service identity. Turning monoliths into micro services sounds appealing, but maybe not every monolith needs to be broken up. We'll also talk about the maturity and design choices that go into service meshes and when a monolith should just remain a monolith. Segment Resources: https://www.solo.io/blog/kubernetes-security-cloud-native-applications/ https://www.solo.io/blog/apis-data-breach-zero-trust/ https://www.solo.io/blog/api-gateways-productivity-resilience-security-cloud-applications/ In the news, Nagios gets a review from NCC Group, hackers hack some anti-fixing code to fix trains in Poland, abusing OAuth post-compromise, 5Ghoul flaws in 5G networks, MITRE teases a new threat model for embedded systems, a conversation on vuln scoring systems, and more! Visit https://securityweekly.com/asw for all the latest episodes! Follow us on Instagram: https://www.instagram.com/secweekly/ Like us on Facebook: https://www.facebook.com/secweekly Show Notes: https://securityweekly.com/asw-267

Application Security Weekly (Audio)
Making Service Meshes Work for People - Idit Levine - ASW #267

Application Security Weekly (Audio)

Play Episode Listen Later Dec 19, 2023 77:40


Service meshes create the opportunity to make security a team sport. They can improve observability and service identity. Turning monoliths into micro services sounds appealing, but maybe not every monolith needs to be broken up. We'll also talk about the maturity and design choices that go into service meshes and when a monolith should just remain a monolith. Segment Resources: https://www.solo.io/blog/kubernetes-security-cloud-native-applications/ https://www.solo.io/blog/apis-data-breach-zero-trust/ https://www.solo.io/blog/api-gateways-productivity-resilience-security-cloud-applications/ In the news, Nagios gets a review from NCC Group, hackers hack some anti-fixing code to fix trains in Poland, abusing OAuth post-compromise, 5Ghoul flaws in 5G networks, MITRE teases a new threat model for embedded systems, a conversation on vuln scoring systems, and more! Visit https://securityweekly.com/asw for all the latest episodes! Follow us on Instagram: https://www.instagram.com/secweekly/ Like us on Facebook: https://www.facebook.com/secweekly Show Notes: https://securityweekly.com/asw-267

Application Security Weekly (Video)
Nagios and Abandoned Projects, Hacking Trains (to Fix Them), OAuth Threats, 5Ghoul - ASW #267

Application Security Weekly (Video)

Play Episode Listen Later Dec 19, 2023 40:22


Nagios gets a review from NCC Group, hackers hack some anti-fixing code to fix trains in Poland, abusing OAuth post-compromise, 5Ghoul flaws in 5G networks, MITRE teases a new threat model for embedded systems, a conversation on vuln scoring systems, and more! Show Notes: https://securityweekly.com/asw-267

Screaming in the Cloud
Chronosphere on Crafting a Cloud-Native Observability Strategy with Rachel Dines

Screaming in the Cloud

Play Episode Listen Later Nov 28, 2023 29:41


Rachel Dines, Head of Product and Technical Marketing at Chronosphere, joins Corey on Screaming in the Cloud to discuss why creating a cloud-native observability strategy is so critical, and the challenges that come with both defining and accomplishing that strategy to fit your current and future observability needs. Rachel explains how Chronosphere is taking an open-source approach to observability, and why it's more important than ever to acknowledge that the stakes and costs are much higher when it comes to observability in the cloud. About RachelRachel leads product and technical marketing for Chronosphere. Previously, Rachel wore lots of marketing hats at CloudHealth (acquired by VMware), and before that, she led product marketing for cloud-integrated storage at NetApp. She also spent many years as an analyst at Forrester Research. Outside of work, Rachel tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston where she's based.Links Referenced: Chronosphere: https://chronosphere.io/ LinkedIn: https://www.linkedin.com/in/rdines/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's featured guest episode is brought to us by our friends at Chronosphere, and they have also brought us Rachel Dines, their Head of Product and Solutions Marketing. Rachel, great to talk to you again.Rachel: Hi, Corey. Yeah, great to talk to you, too.Corey: Watching your trajectory has been really interesting, just because starting off, when we first started, I guess, learning who each other were, you were working at CloudHealth which has since become VMware. And I was trying to figure out, huh, the cloud runs on money. How about that? It feels like it was a thousand years ago, but neither one of us is quite that old.Rachel: It does feel like several lifetimes ago. You were just this snarky guy with a few followers on Twitter, and I was trying to figure out what you were doing mucking around with my customers [laugh]. Then [laugh] we kind of both figured out what we're doing, right?Corey: So, speaking of that iterative process, today, you are at Chronosphere, which is an observability company. We would have called it a monitoring company five years ago, but now that's become an insult after the observability war dust has settled. So, I want to talk to you about something that I've been kicking around for a while because I feel like there's a gap somewhere. Let's say that I build a crappy web app—because all of my web apps inherently are crappy—and it makes money through some mystical form of alchemy. And I have a bunch of users, and I eventually realize, huh, I should probably have a better observability story than waiting for the phone to ring and a customer telling me it's broken.So, I start instrumenting various aspects of it that seem to make sense. Maybe I go too low level, like looking at all the discs on every server to tell me if they're getting full or not, like their ancient servers. Maybe I just have a Pingdom equivalent of is the website up enough to respond to a packet? And as I wind up experiencing different failure modes and getting yelled at by different constituencies—in my own career trajectory, my own boss—you start instrumenting for all those different kinds of breakages, you start aggregating the logs somewhere and the volume gets bigger and bigger with time. But it feels like it's sort of a reactive process as you stumble through that entire environment.And I know it's not just me because I've seen this unfold in similar ways in a bunch of different companies. It feels to me, very strongly, like it is something that happens to you, rather than something you set about from day one with a strategy in mind. What's your take on an effective way to think about strategy when it comes to observability?Rachel: You just nailed it. That's exactly the kind of progression that we so often see. And that's what I really was excited to talk with you about today—Corey: Oh, thank God. I was worried for a minute there that you'd be like, “What the hell are you talking about? Are you just, like, some sort of crap engineer?” And, “Yes, but it's mean of you to say it.” But yeah, what I'm trying to figure out is there some magic that I just was never connecting? Because it always feels like you're in trouble because the site's always broken, and oh, like, if the disk fills up, yeah, oh, now we're going to start monitoring to make sure the disk doesn't fill up. Then you wind up getting barraged with alerts, and no one wins, and it's an uncomfortable period of time.Rachel: Uncomfortable period of time. That is one very polite way to put it. I mean, I will say, it is very rare to find a company that actually sits down and thinks, “This is our observability strategy. This is what we want to get out of observability.” Like, you can think about a strategy and, like, the old school sense, and you know, as an industry analyst, so I'm going to have to go back to, like, my roots at Forrester with thinking about, like, the people, and the process, and the technology.But really what the bigger component here is like, what's the business impact? What do you want to get out of your observability platform? What are you trying to achieve? And a lot of the time, people have thought, “Oh, observability strategy. Great, I'm just going to buy a tool. That's it. Like, that's my strategy.”And I hate to bring it to you, but buying tools is not a strategy. I'm not going to say, like, buy this tool. I'm not even going to say, “Buy Chronosphere.” That's not a strategy. Well, you should buy Chronosphere. But that's not a strategy.Corey: Of course. I'm going to throw the money by the wheelbarrow at various observability vendors, and hope it solves my problem. But if that solved the problem—I've got to be direct—I've never spoken to those customers.Rachel: Exactly. I mean, that's why this space is such a great one to come in and be very disruptive in. And I think, back in the days when we were running in data centers, maybe even before virtual machines, you could probably get away with not having a monitoring strategy—I'm not going to call it observability; it's not we call the back then—you could get away with not having a strategy because what was the worst that was going to happen, right? It wasn't like there was a finite amount that your monitoring bill could be, there was a finite amount that your customer impact could be. Like, you're paying the penny slots, right?We're not on the penny slots anymore. We're in the $50 craps table, and it's Las Vegas, and if you lose the game, you're going to have to run down the street without your shirt. Like, the game and the stakes have changed, and we're still pretending like we're playing penny slots, and we're not anymore.Corey: That's a good way of framing it. I mean, I still remember some of my biggest observability challenges were building highly available rsyslog clusters so that you could bounce a member and not lose any log data because some of that was transactionally important. And we've gone beyond that to a stupendous degree, but it still feels like you don't wind up building this into the application from day one. More's the pity because if you did, and did that intelligently, that opens up a whole world of possibilities. I dream of that changing where one day, whenever you start to build an app, oh, and we just push the button and automatically instrument with OTel, so you instrument the thing once everywhere it makes sense to do it, and then you can do your vendor selection and what you said were decisions later in time. But these days, we're not there.Rachel: Well, I mean, and there's also the question of just the legacy environment and the tech debt. Even if you wanted to, the—actually I was having a beer yesterday with a friend who's a VP of Engineering, and he's got his new environment that they're building with observability instrumented from the start. How beautiful. They've got OTel, they're going to have tracing. And then he's got his legacy environment, which is a hot mess.So, you know, there's always going to be this bridge of the old and the new. But this was where it comes back to no matter where you're at, you can stop and think, like, “What are we doing and why?” What is the cost of this? And not just cost in dollars, which I know you and I could talk about very deeply for a long period of time, but like, the opportunity costs. Developers are working on stuff that they could be working on something that's more valuable.Or like the cost of making people work round the clock, trying to troubleshoot issues when there could be an easier way. So, I think it's like stepping back and thinking about cost in terms of dollar sense, time, opportunity, and then also impact, and starting to make some decisions about what you're going to do in the future that's different. Once again, you might be stuck with some legacy stuff that you can't really change that much, but [laugh] you got to be realistic about where you're at.Corey: I think that that is a… it's a hard lesson to be very direct, in that, companies need to learn it the hard way, for better or worse. Honestly, this is one of the things that I always noticed in startup land, where you had a whole bunch of, frankly, relatively early-career engineers in their early-20s, if not younger. But then the ops person was always significantly older because the thing you actually want to hear from your ops person, regardless of how you slice it, is, “Oh, yeah, I've seen this kind of problem before. Here's how we fixed it.” Or even better, “Here's the thing we're doing, and I know how that's going to become a problem. Let's fix it before it does.” It's the, “What are you buying by bringing that person in?” “Experience, mostly.”Rachel: Yeah, that's an interesting point you make, and it kind of leads me down this little bit of a side note, but a really interesting antipattern that I've been seeing in a lot of companies is that more seasoned ops person, they're the one who everyone calls when something goes wrong. Like, they're the one who, like, “Oh, my God, I don't know how to fix it. This is a big hairy problem,” I call that one ops person, or I call that very experienced person. That experience person then becomes this huge bottleneck into solving problems that people don't really—they might even be the only one who knows how to use the observability tool. So, if we can't find a way to democratize our observability tooling a little bit more so, like, just day-to-day engineers, like, more junior engineers, newer ones, people who are still ramping, can actually use the tool and be successful, we're going to have a big problem when these ops people walk out the door, maybe they retire, maybe they just get sick of it. We have these massive bottlenecks in organizations, whether it's ops or DevOps or whatever, that I see often exacerbated by observability tools. Just a side note.Corey: Yeah. On some level, it feels like a lot of these things can be fixed with tooling. And I'm not going to say that tools aren't important. You ever tried to implement observability by hand? It doesn't work. There have to be computers somewhere in the loop, if nothing else.And then it just seems to devolve into a giant swamp of different companies, doing different things, taking different approaches. And, on some level, whenever you read the marketing or hear the stories any of these companies tell you also to normalize it from translating from whatever marketing language they've got into something that comports with the reality of your own environment and seeing if they align. And that feels like it is so much easier said than done.Rachel: This is a noisy space, that is for sure. And you know, I think we could go out to ten people right now and ask those ten people to define observability, and we would come back with ten different definitions. And then if you throw a marketing person in the mix, right—guilty as charged, and I know you're a marketing person, too, Corey, so you got to take some of the blame—it gets mucky, right? But like I said a minute ago, the answer is not tools. Tools can be part of the strategy, but if you're just thinking, “I'm going to buy a tool and that's going to solve my problem,” you're going to end up like this company I was talking to recently that has 25 different observability tools.And not only do they have 25 different observability tools, what's worse is they have 25 different definitions for their SLOs and 25 different names for the same metric. And to be honest, it's just a mess. I'm not saying, like, go be Draconian and, you know, tell all the engineers, like, “You can only use this tool [unintelligible 00:10:34] use that tool,” you got to figure out this kind of balance of, like, hands-on, hands-off, you know? How much do you centralize, how much do you push and standardize? Otherwise, you end up with just a huge mess.Corey: On some level, it feels like it was easier back in the days of building it yourself with Nagios because there's only one answer, and it sucks, unless you want to start going down the world of HP OpenView. Which step one: hire a 50-person team to manage OpenView. Okay, that's not going to solve my problem either. So, let's get a little more specific. How does Chronosphere approach this?Because historically, when I've spoken to folks at Chronosphere, there isn't that much of a day one story, in that, “I'm going to build a crappy web app. Let's instrument it for Chronosphere.” There's a certain, “You must be at least this tall to ride,” implicit expectation built into the product just based upon its origins. And I'm not saying that doesn't make sense, but it also means there's really no such thing as a greenfield build out for you either.Rachel: Well, yes and no. I mean, I think there's no green fields out there because everyone's doing something for observability, or monitoring, or whatever you want to call it, right? Whether they've got Nagios, whether they've got the Dog, whether they've got something else in there, they have some way of introspecting their systems, right? So, one of the things that Chronosphere is built on, that I actually think this is part of something—a way you might think about building out an observability strategy as well, is this concept of control and open-source compatibility. So, we only can collect data via open-source standards. You have to send this data via Prometheus, via Open Telemetry, it could be older standards, like, you know, statsd, Graphite, but we don't have any proprietary instrumentation.And if I was making a recommendation to somebody building out their observability strategy right now, I would say open, open, open, all day long because that gives you a huge amount of flexibility in the future. Because guess what? You know, you might put together an observability strategy that seems like it makes sense for right now—actually, I was talking to a B2B SaaS company that told me that they made a choice a couple of years ago on an observability tool. It seemed like the right choice at the time. They were growing so fast, they very quickly realized it was a terrible choice.But now, it's going to be really hard for them to migrate because it's all based on proprietary standards. Now, of course, a few years ago, they didn't have the luxury of Open Telemetry and all of these, but now that we have this, we can use these to kind of future-proof our mistakes. So, that's one big area that, once again, both my recommendation and happens to be our approach at Chronosphere.Corey: I think that that's a fair way of viewing it. It's a constant challenge, too, just because increasingly—you mentioned the Dog earlier, for example—I will say that for years, I have been asked whether or not at The Duckbill Group, we look at Azure bills or GCP bills. Nope, we are pure AWS. Recently, we started to hear that same inquiry specifically around Datadog, to the point where it has become a board-level concern at very large companies. And that is a challenge, on some level.I don't deviate from my typical path of I fix AWS bills, and that's enough impossible problems for one lifetime, but there is a strong sense of you want to record as much as possible for a variety of excellent reasons, but there's an implicit cost to doing that, and in many cases, the cost of observability becomes a massive contributor to the overall cost. Netflix has said in talks before that they're effectively an observability company that also happens to stream movies, just because it takes so much effort, engineering, and raw computing resources in order to get that data do something actionable with it. It's a hard problem.Rachel: It's a huge problem, and it's a big part of why I work at Chronosphere, to be honest. Because when I was—you know, towards the tail end at my previous company in cloud cost management, I had a lot of customers coming to me saying, “Hey, when are you going to tackle our Dog or our New Relic or whatever?” Similar to the experience you're having now, Corey, this was happening to me three, four years ago. And I noticed that there is definitely a correlation between people who are having these really big challenges with their observability bills and people that were adopting, like Kubernetes, and microservices and cloud-native. And it was around that time that I met the Chronosphere team, which is exactly what we do, right? We focus on observability for these cloud-native environments where observability data just goes, like, wild.We see 10X 20X as much observability data and that's what's driving up these costs. And yeah, it is becoming a board-level concern. I mean, and coming back to the concept of strategy, like if observability is the second or third most expensive item in your engineering bill—like, obviously, cloud infrastructure, number one—number two and number three is probably observability. How can you not have a strategy for that? How can this be something the board asks you about, and you're like, “What are we trying to get out of this? What's our purpose?” “Uhhhh… troubleshooting?”Corey: Right because it turns into business metrics as well. It's not just about is the site up or not. There's a—like, one of the things that always drove me nuts not just in the observability space, but even in cloud costing is where, okay, your costs have gone up this week so you get a frowny face, or it's in red, like traffic light coloring. Cool, but for a lot of architectures and a lot of customers, that's because you're doing a lot more volume. That translates directly into increased revenues, increased things you care about. You don't have the position or the context to say, “That's good,” or, “That's bad.” It simply is. And you can start deriving business insight from that. And I think that is the real observability story that I think has largely gone untold at tech conferences, at least.Rachel: It's so right. I mean, spending more on something is not inherently bad if you're getting more value out of it. And it definitely a challenge on the cloud cost management side. “My costs are going up, but my revenue is going up a lot faster, so I'm okay.” And I think some of the plays, like you know, we put observability in this box of, like, it's for low-level troubleshooting, but really, if you step back and think about it, there's a lot of larger, bigger picture initiatives that observability can contribute to in an org, like digital transformation. I know that's a buzzword, but, like that is a legit thing that a lot of CTOs are out there thinking about. Like, how do we, you know, get out of the tech debt world, and how do we get into cloud-native?Maybe it's developer efficiency. God, there's a lot of people talking about developer efficiency. Last week at KubeCon, that was one of the big, big topics. I mean, and yeah, what [laugh] what about cost savings? To me, we've put observability in a smaller box, and it needs to bust out.And I see this also in our customer base, you know? Customers like DoorDash use observability, not just to look at their infrastructure and their applications, but also look at their business. At any given minute, they know how many Dashers are on the road, how many orders are being placed, cut by geos, down to the—actually down to the second, and they can use that to make decisions.Corey: This is one of those things that I always found a little strange coming from the world of running systems in large [unintelligible 00:17:28] environments to fixing AWS bills. There's nothing that even resembles a fast, reactive response in the world of AWS billing. You wind up with a runaway bill, they're going to resolve that over a period of weeks, on Seattle business hours. If you wind up spinning something up that creates a whole bunch of very expensive drivers behind your bill, it's going to take three days, in most cases, before that starts showing up anywhere that you can reasonably expect to get at it. The idea of near real time is a lie unless you want to start instrumenting everything that you're doing to trap the calls and then run cost extrapolation from there. That's hard to do.Observability is a very different story, where latencies start to matter, where being able to get leading indicators of certain events—be a technical or business—start to be very important. But it seems like it's so hard to wind up getting there from where most people are. Because I know we like to talk dismissively about the past, but let's face it, conference-ware is the stuff we're the proudest of. The reality is the burning dumpster of regret in our data centers that still also drives giant piles of revenue, so you can't turn it off, nor would you want to, but you feel bad about it as a result. It just feels like it's such a big leap.Rachel: It is a big leap. And I think the very first step I would say is trying to get to this point of clarity and being honest with yourself about where you're at and where you want to be. And sometimes not making a choice is a choice, right, as well. So, sticking with the status quo is making a choice. And so, like, as we get into things like the holiday season right now, and I know there's going to be people that are on-call 24/7 during the holidays, potentially, to keep something that's just duct-taped together barely up and running, I'm making a choice; you're make a choice to do that. So, I think that's like the first step is the kind of… at least acknowledging where you're at, where you want to be, and if you're not going to make a change, just understanding the cost and being realistic about it.Corey: Yeah, being realistic, I think, is one of the hardest challenges because it's easy to wind up going for the aspirational story of, “In the future when everything's great.” Like, “Okay, cool. I appreciate the need to plant that flag on the hill somewhere. What's the next step? What can we get done by the end of this week that materially improves us from where we started the week?” And I think that with the aspirational conference-ware stories, it's hard to break that down into things that are actionable, that don't feel like they're going to be an interminable slog across your entire existing environment.Rachel: No, I get it. And for things like, you know, instrumenting and adding tracing and adding OTEL, a lot of the time, the return that you get on that investment is… it's not quite like, “I put a dollar in, I get a dollar out,” I mean, something like tracing, you can't get to 60% instrumentation and get 60% of the value. You need to be able to get to, like, 80, 90%, and then you'll get a huge amount of value. So, it's sort of like you're trudging up this hill, you're charging up this hill, and then finally you get to the plateau, and it's beautiful. But that hill is steep, and it's long, and it's not pretty. And I don't know what to say other than there's a plateau near the top. And those companies that do this well really get a ton of value out of it. And that's the dream, that we want to help customers get up that hill. But yeah, I'm not going to lie, the hill can be steep.Corey: One thing that I find interesting is there's almost a bimodal distribution in companies that I talk to. On the one side, you have companies like, I don't know, a Chronosphere is a good example of this. Presumably you have a cloud bill somewhere and the majority of your cloud spend will be on what amounts to a single application, probably in your case called, I don't know, Chronosphere. It shares the name of the company. The other side of that distribution is the large enterprise conglomerates where they're spending, I don't know, $400 million a year on cloud, but their largest workload is 3 million bucks, and it's just a very long tail of a whole bunch of different workloads, applications, teams, et cetera.So, what I'm curious about from the Chronosphere perspective—or the product you have, not the ‘you' in this metaphor, which gets confusing—is, it feels easier to instrument a Chronosphere-like company that has a primary workload that is the massive driver of most things and get that instrumented and start getting an observability story around that than it does to try and go to a giant company and, “Okay, 1500 teams need to all implement this thing that are all going in different directions.” How do you see it playing out among your customer base, if that bimodal distribution holds up in your world?Rachel: It does and it doesn't. So, first of all, for a lot of our customers, we often start with metrics. And starting with metrics means Prometheus. And Prometheus has hundreds of exporters. It is basically built into Kubernetes. So, if you're running Kubernetes, getting Prometheus metrics out, actually not a very big lift. So, we find that we start with Prometheus, we start with getting metrics in, and we can get a lot—I mean, customers—we have a lot of customers that use us just for metrics, and they get a massive amount of value.But then once they're ready, they can start instrumenting for OTEL and start getting traces in as well. And yeah, in large organizations, it does tend to be one team, one application, one service, one department that kind of goes at it and gets all that instrumented. But I've even seen very large organizations, when they get their act together and decide, like, “No, we're doing this,” they can get OTel instrumented fairly quickly. So, I guess it's, like, a lining up. It's more of a people issue than a technical issue a lot of the time.Like, getting everyone lined up and making sure that like, yes, we all agree. We're on board. We're going to do this. But it's usually, like, it's a start small, and it doesn't have to be all or nothing. We also just recently added the ability to ingest events, which is actually a really beautiful thing, and it's very, very straightforward.It basically just—we connect to your existing other DevOps tools, so whether it's, like, a Buildkite, or a GitHub, or, like, a LaunchDarkly, and then anytime something happens in one of those tools, that gets registered as an event in Chronosphere. And then we overlay those events over your alerts. So, when an alert fires, then first thing I do is I go look at the alert page, and it says, “Hey, someone did a deploy five minutes ago,” or, “There was a feature flag flipped three minutes ago,” I solved the problem right then. I don't think of this as—there's not an all or nothing nature to any of this stuff. Yes, tracing is a little bit of a—you know, like I said, it's one of those things where you have to make a lot of investment before you get a big reward, but that's not the case in all areas of observability.Corey: Yeah. I would agree. Do you find that there's a significant easy, early win when customers start adopting Chronosphere? Because one of the problems that I've found, especially with things that are holistic, and as you talk about tracing, well, you need to get to a certain point of coverage before you see value. But human psychology being what it is, you kind of want to be able to demonstrate, oh, see, the Meantime To Dopamine needs to come down, to borrow an old phrase. Do you find that some of there's some easy wins that start to help people to see the light? Because otherwise, it just feels like a whole bunch of work for no discernible benefit to them.Rachel: Yeah, at least for the Chronosphere customer base, one of the areas where we're seeing a lot of traction this year is in optimizing the costs, like, coming back to the cost story of their overall observability bill. So, we have this concept of the control plane in our product where all the data that we ingest hits the control plane. At that point, that customer can look at the data, analyze it, and decide this is useful, this is not useful. And actually, not just decide that, but we show them what's useful, what's not useful. What's being used, what's high cardinality, but—and high cost, but maybe no one's touched it.And then we can make decisions around aggregating it, dropping it, combining it, doing all sorts of fancy things, changing the—you know, downsampling it. We can do this, on the trace side, we can do it both head based and tail based. On the metrics side, it's as it hits the control plane and then streams out. And then they only pay for the data that we store. So typically, customers are—they come on board and immediately reduce their observability dataset by 60%. Like, that's just straight up, that's the average.And we've seen some customers get really aggressive, get up to, like, in the 90s, where they realize we're only using 10% of this data. Let's get rid of the rest of it. We're not going to pay for it. So, paying a lot less helps in a lot of ways. It also helps companies get more coverage of their observability. It also helps customers get more coverage of their overall stack. So, I was talking recently with an autonomous vehicle driving company that recently came to us from the Dog, and they had made some really tough choices and were no longer monitoring their pre-prod environments at all because they just couldn't afford to do it anymore. It's like, well, now they can, and we're still saving the money.Corey: I think that there's also the downstream effect of the money saving to that, for example, I don't fix observability bills directly. But, “Huh, why is your CloudWatch bill through the roof?” Or data egress charges in some cases? It's oh because your observability vendor is pounding the crap out of those endpoints and pulling all your log data across the internet, et cetera. And that tends to mean, oh, yeah, it's not just the first-order effect; it's the second and third and fourth-order effects this winds up having. It becomes almost a holistic challenge. I think that trying to put observability in its own bucket, on some level—when you're looking at it from a cost perspective—starts to be a, I guess, a structure that makes less and less sense in the fullness of time.Rachel: Yeah, I would agree with that. I think that just looking at the bill from your vendor is one very small piece of the overall cost you're incurring. I mean, all of the things you mentioned, the egress, the CloudWatch, the other services, it's impacting, what about the people?Corey: Yeah, it sure is great that your team works for free.Rachel: [laugh]. Exactly, right? I know, and it makes me think a little bit about that viral story about that particular company with a certain vendor that had a $65 million per year observability bill. And that impacted not just them, but, like, it showed up in both vendors' financial filings. Like, how did you get there? How did you get to that point? And I think this all comes back to the value in the ROI equation. Yes, we can all sit in our armchairs and be like, “Well, that was dumb,” but I know there are very smart people out there that just got into a bad situation by kicking the can down the road on not thinking about the strategy.Corey: Absolutely. I really want to thank you for taking the time to speak with me about, I guess, the bigger picture questions rather than the nuts and bolts of a product. I like understanding the overall view that drives a lot of these things. I don't feel I get to have enough of those conversations some weeks, so thank you for humoring me. If people want to learn more, where's the best place for them to go?Rachel: So, they should definitely check out the Chronosphere website. Brand new beautiful spankin' new website: chronosphere.io. And you can also find me on LinkedIn. I'm not really on the Twitters so much anymore, but I'd love to chat with you on LinkedIn and hear what you have to say.Corey: And we will, of course, put links to all of that in the [show notes 00:28:26]. Thank you so much for taking the time to speak with me. It's appreciated.Rachel: Thank you, Corey. Always fun.Corey: Rachel Dines, Head of Product and Solutions Marketing at Chronosphere. This has been a featured guest episode brought to us by our friends at Chronosphere, and I'm Corey Quinn. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry and insulting comment that I will one day read once I finished building my highly available rsyslog system to consume it with.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business, and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud
The Evolution of OpenTelemetry with Austin Parker

Screaming in the Cloud

Play Episode Listen Later Sep 5, 2023 40:09


Austin Parker, Community Maintainer at OpenTelemetry, joins Corey on Screaming in the Cloud to discuss OpenTelemetry's mission in the world of observability. Austin explains how the OpenTelemetry community was able to scale the OpenTelemetry project to a commercial offering, and the way Open Telemetry is driving innovation in the data space. Corey and Austin also discuss why Austin decided to write a book on OpenTelemetry, and the book's focus on the evergreen applications of the tool. About AustinAustin Parker is the OpenTelemetry Community Maintainer, as well as an event organizer, public speaker, author, and general bon vivant. They've been a part of OpenTelemetry since its inception in 2019.Links Referenced: OpenTelemetry: https://opentelemetry.io/ Learning OpenTelemetry early release: https://www.oreilly.com/library/view/learning-opentelemetry/9781098147174/ Page with Austin's social links: https://social.ap2.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Look, I get it. Folks are being asked to do more and more. Most companies don't have a dedicated DBA because that person now has a full time job figuring out which one of AWS's multiple managed database offerings is right for every workload. Instead, developers and engineers are being asked to support, and heck, if time allows, optimize their databases. That's where OtterTune comes in. Their AI is your database co-pilot for MySQL and PostgresSQL on Amazon RDS or Aurora. It helps improve performance by up to four x OR reduce costs by 50 percent – both of those are decent options. Go to ottertune dot com to learn more and start a free trial. That's O-T-T-E-R-T-U-N-E dot com.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's been a few hundred episodes since I had Austin Parker on to talk about the things that Austin cares about. But it's time to rectify that. Austin is the community maintainer for OpenTelemetry, which is a CNCF project. If you're unfamiliar with, we're probably going to fix that in short order. Austin, Welcome back, it's been a month of Sundays.Austin: It has been a month-and-a-half of Sundays. A whole pandemic-and-a-half.Corey: So, much has happened since then. I tried to instrument something with OpenTelemetry about a year-and-a-half ago, and in defense to the project, my use case is always very strange, but it felt like—a lot of things have sharp edges, but it felt like this had so many sharp edges that you just pivot to being a chainsaw, and I would have been at least a little bit more understanding of why it hurts so very much. But I have heard from people that I trust that the experience has gotten significantly better. Before we get into the nitty-gritty of me lobbing passive-aggressive bug reports at you have for you to fix in a scenario in which you can't possibly refuse me, let's start with the beginning. What is OpenTelemetry?Austin: That's a great question. Thank you for asking it. So, OpenTelemetry is an observability framework. It is run by the CNCF, you know, home of such wonderful award-winning technologies as Kubernetes, and you know, the second biggest source of YAML in the known universe [clear throat].Corey: On some level, it feels like that is right there with hydrogen as far as unlimited resources in our universe.Austin: It really is. And, you know, as we all know, there are two things that make, sort of, the DevOps and cloud world go around: one of them being, as you would probably know, AWS bills; and the second being YAML. But OpenTelemetry tries to kind of carve a path through this, right, because we're interested in observability. And observability, for those that don't know or have been living under a rock or not reading blogs, it's a lot of things. It's a—but we can generally sort of describe it as, like, this is how you understand what your system is doing.I like to describe it as, it's a way that we can model systems, especially complex, distributed, or decentralized software systems that are pretty commonly found in larg—you know, organizations of every shape and size, quite often running on Kubernetes, quite often running in public or private clouds. And the goal of observability is to help you, you know, model this system and understand what it's doing, which is something that I think we can all agree, a pretty important part of our job as software engineers. Where OpenTelemetry fits into this is as the framework that helps you get the telemetry data you need from those systems, put it into a universal format, and then ship it off to some observability back-end, you know, a Prometheus or a Datadog or whatever, in order to analyze that data and get answers to your questions you have.Corey: From where I sit, the value of OTel—or OpenTelemetry; people in software engineering love abbreviations that are impenetrable from the outside, so of course, we're going to lean into that—but what I found for my own use case is the shining value prop was that I could instrument an application with OTel—in theory—and then send whatever I wanted that was emitted in terms of telemetry, be it events, be it logs, be it metrics, et cetera, and send that to any or all of a curation of vendors on a case-by-case basis, which meant that suddenly it was the first step in, I guess, an observability pipeline, which increasingly is starting to feel like a milit—like an industrial-observability complex, where there's so many different companies out there, it seems like a good approach to use, to start, I guess, racing vendors in different areas to see which performs better. One of the challenges I've had with that when I started down that path is it felt like every vendor who was embracing OTel did it from a perspective of their implementation. Here's how to instrument it to—send it to us because we're the best, obviously. And you're a community maintainer, despite working at observability vendors yourself. You have always been one of those community-first types where you care more about the user experience than you do this quarter for any particular employer that you have, which to be very clear, is intended as a compliment, not a terrifying warning. It's why you have this authentic air to you and why you are one of those very few voices that I trust in a space where normally I need to approach it with significant skepticism. How do you see the relationship between vendors and OpenTelemetry?Austin: I think the hard thing is that I know who signs my paychecks at the end of the day, right, and you always have, you know, some level of, you know, let's say bias, right? Because it is a bias to look after, you know, them who brought you to the dance. But I think you can be responsible with balancing, sort of, the needs of your employer, and the needs of the community. You know, the way I've always described this is that if you think about observability as, like, a—you know, as a market, what's the total addressable market there? It's literally everyone that uses software; it's literally every software company.Which means there's plenty of room for people to make their numbers and to buy and sell and trade and do all this sort of stuff. And by taking that approach, by taking sort of the big picture approach and saying, “Well, look, you know, there's going to be—you know, of all these people, there are going to be some of them that are going to use our stuff and there are some of them that are going to use our competitor's stuff.” And that's fine. Let's figure out where we can invest… in an OpenTelemetry, in a way that makes sense for everyone and not just, you know, our people. So, let's build things like documentation, right?You know, one of the things I'm most impressed with, with OpenTelemetry over the past, like, two years is we went from being, as a project, like, if you searched for OpenTelemetry, you would go and you would get five or six or ten different vendor pages coming up trying to tell you, like, “This is how you use it, this is how you use it.” And what we've done as a community is we've said, you know, “If you go looking for documentation, you should find our website. You should find our resources.” And we've managed to get the OpenTelemetry website to basically rank above almost everything else when people are searching for help with OpenTelemetry. And that's been really good because, one, it means that now, rather than vendors or whoever coming in and saying, like, “Well, we can do this better than you,” we can be like, “Well, look, just, you know, put your effort here, right? It's already the top result. It's already where people are coming, and we can prove that.”And two, it means that as people come in, they're going to be put into this process of community feedback, where they can go in, they can look at the docs, and they can say, “Oh, well, I had a bad experience here,” or, “How do I do this?” And we get that feedback and then we can improve the docs for everyone else by acting on that feedback, and the net result of this is that more people are using OpenTelemetry, which means there are more people kind of going into the tippy-tippy top of the funnel, right, that are able to become a customer of one of these myriad observability back ends.Corey: You touched on something very important here, when I first was exploring this—you may have been looking over my shoulder as I went through this process—my impression initially was, oh, this is a ‘CNCF project' in quotes, where—this is not true universally, of course, but there are cases where it clearly—is where this is an, effectively, vendor-captured project, not necessarily by one vendor, but by an almost consortium of them. And that was my takeaway from OpenTelemetry. It was conversations with you, among others, that led me to believe no, no, this is not in that vein. This is clearly something that is a win. There are just a whole bunch of vendors more-or-less falling all over themselves, trying to stake out thought leadership and imply ownership, on some level, of where these things go. But I definitely left with a sense that this is bigger than any one vendor.Austin: I would agree. I think, to even step back further, right, there's almost two different ways that I think vendors—or anyone—can approach OpenTelemetry, you know, from a market perspective, and one is to say, like, “Oh, this is socializing, kind of, the maintenance burden of instrumentation.” Which is a huge cost for commercial players, right? Like, if you're a Datadog or a Splunk or whoever, you know, you have these agents that you go in and they rip telemetry out of your web servers, out of your gRPC libraries, whatever, and it costs a lot of money to pay engineers to maintain those instrumentation agents, right? And the cynical take is, oh, look at all these big companies that are kind of like pushing all that labor onto the open-source community, and you know, I'm not casting any aspersions here, like, I do think that there's an element of truth to it though because, yeah, that is a huge fixed cost.And if you look at the actual lived reality of people and you look at back when SignalFx was still a going concern, right, and they had their APM agents open-sourced, you could go into the SignalFx repo and diff, like, their [Node Express 00:10:15] instrumentation against the Datadog Node Express instrumentation, and it's almost a hundred percent the same, right? Because it's truly a commodity. There's no—there's nothing interesting about how you get that telemetry out. The interesting stuff all happens after you have the telemetry and you've sent it to some back-end, and then you can, you know, analyze it and find interesting things. So, yeah, like, it doesn't make sense for there to be five or six or eight different companies all competing to rebuild the same wheels over and over and over and over when they don't have to.I think the second thing that some people are starting to understand is that it's like, okay, let's take this a step beyond instrumentation, right? Because the goal of OpenTelemetry really is to make sure that this instrumentation is native so that you don't need a third-party agent, you don't need some other process or jar or whatever that you drop in and it instruments stuff for you. The JVM should provide this, your web framework should provide this, your RPC library should provide this right? Like, this data should come from the code itself and be in a normalized fashion that can then be sent to any number of vendors or back ends or whatever. And that changes how—sort of, the competitive landscape a lot, I think, for observability vendors because rather than, kind of, what you have now, which is people will competing on, like, well, how quickly can I throw this agent in and get set up and get a dashboard going, it really becomes more about, like, okay, how are you differentiating yourself against every other person that has access to the same data, right? And you get more interesting use cases and how much more interesting analysis features, and that results in more innovation in, sort of, the industry than we've seen in a very long time.Corey: For me, just from the customer side of the world, one of the biggest problems I had with observability in my career as an SRE-type for years was you would wind up building your observability pipeline around whatever vendor you had selected and that meant emphasizing the things they were good at and de-emphasizing the things that they weren't. And sometimes it's worked to your benefit; usually not. But then you always had this question when it got things that touched on APM or whatnot—or Application Performance Monitoring—where oh, just embed our library into this. Okay, great. But a year-and-a-half ago, my exposure to this was on an application that I was running in distributed fashion on top of AWS Lambda.So great, you can either use an extension for this or you can build in the library yourself, but then there's always a question of precedence where when you have multiple things that are looking at this from different points of view, which one gets done first? Which one is going to see the others? Which one is going to enmesh the other—enclose the others in its own perspective of the world? And it just got incredibly frustrating. One of the—at least for me—bright lights of OTel was that it got away from that where all of the vendors receiving telemetry got the same view.Austin: Yeah. They all get the same view, they all get the same data, and you know, there's a pretty rich collection of tools that we're starting to develop to help you build those pipelines yourselves and really own everything from the point of generation to intermediate collection to actually outputting it to wherever you want to go. For example, a lot of really interesting work has come out of the OpenTelemetry collector recently; one of them is this feature called Connectors. And Connectors let you take the output of certain pipelines and route them as inputs to another pipeline. And as part of that connection, you can transform stuff.So, for example, let's say you have a bunch of [spans 00:14:05] or traces coming from your API endpoints, and you don't necessarily want to keep all those traces in their raw form because maybe they aren't interesting or maybe there's just too high of a volume. So, with Connectors, you can go and you can actually convert all of those spans into metrics and export them to a metrics database. You could continue to save that span data if you want, but you have options now, right? Like, you can take that span data and put it into cold storage or put it into, like, you know, some sort of slow blob storage thing where it's not actively indexed and it's slow lookups, and then keep a metric representation of it in your alerting pipeline, use metadata exemplars or whatever to kind of connect those things back. And so, when you do suddenly see it's like, “Oh, well, there's some interesting p99 behavior,” or we're hitting an alert or violating an SLO or whatever, then you can go back and say, like, “Okay, well, let's go dig through the slow da—you know, let's look at the cold data to figure out what actually happened.”And those are features that, historically, you would have needed to go to a big, important vendor and say, like, “Hey, here's a bunch of money,” right? Like, “Do this for me.” Now, you have the option to kind of do all that more interesting pipeline stuff yourself and then make choices about vendors based on, like, who is making a tool that can help me with the problem that I have? Because most of the time, I don't—I feel like we tend to treat observability tools as—it depends a lot on where you sit in the org—but you certainly seen this movement towards, like, “Well, we don't want a tool; we want a platform. We want to go to Lowe's and we want to get the 48-in-one kit that has a bunch of things in it. And we're going to pay for the 48-in-one kit, even if we only need, like, two things or three things out of it.”OpenTelemetry lets you kind of step back and say, like, “Well, what if we just got, like, really high-quality tools for the two or three things we need, and then for the rest of the stuff, we can use other cheaper options?” Which is, I think, really attractive, especially in today's macroeconomic conditions, let's say.Corey: One thing I'm trying to wrap my head around because we all find when it comes to observability, in my experience, it's the parable of three blind people trying to describe an elephant by touch; depending on where you are on the elephant, you have a very different perspective. What I'm trying to wrap my head around is, what is the vision for OpenTelemetry? Is it specifically envisioned to be the agent that runs wherever the workload is, whether it's an agent on a host or a layer in a Lambda function, or a sidecar or whatnot in a Kubernetes cluster that winds up gathering and sending data out? Or is the vision something different? Because part of what you're saying aligns with my perspective on it, but other parts of it seem to—that there's a misunderstanding somewhere, and it's almost certainly on my part.Austin: I think the long-term vision is that you as a developer, you as an SRE, don't even have to think about OpenTelemetry, that when you are using your container orchestrator or you are using your API framework or you're using your Managed API Gateway, or any kind of software that you're building something with, that the telemetry data from that software is emitted in an OpenTelemetry format, right? And when you are writing your code, you know, and you're using gRPC, let's say, you could just natively expect that OpenTelemetry is kind of there in the background and it's integrated into the actual libraries themselves. And so, you can just call the OpenTelemetry API and it's part of the standard library almost, right? You add some additional metadata to a span and say, like, “Oh, this is the customer ID,” or, “This is some interesting attribute that I want to track for later on,” or, “I'm going to create a histogram here or counter,” whatever it is, and then all that data is just kind of there, right, invisible to you unless you need it. And then when you need it, it's there for you to kind of pick up and send off somewhere to any number of back-ends or databases or whatnot that you could then use to discover problems or better model your system.That's the long-term vision, right, that it's just there, everyone uses it. It is a de facto and du jour standard. I think in the medium term, it does look a little bit more like OpenTelemetry is kind of this Swiss army knife agent that's running on—inside cars in Kubernetes or it's running on your EC2 instance. Until we get to the point of everyone just agrees that we're going to use OpenTelemetry protocol for the data and we're going to use all your stuff and we just natively emit it, then that's going to be how long we're in that midpoint. But that's sort of the medium and long-term vision I think. Does that track?Corey: It does. And I'm trying to equate this to—like the evolution back in the Stone Age was back when I was first getting started, Nagios was the gold standard. It was kind of the original Call of Duty. And it was awful. There were a bunch of problems with it, but it also worked.And I'm not trying to dunk on the people who built that. We all stand on the shoulders of giants. It was an open-source project that was awesome doing exactly what it did, but it was a product built for a very different time. It completely had the wheels fall off as soon as you got to things were even slightly ephemeral because it required this idea of the server needed to know where all of the things that was monitoring lived as an individual host basis, so there was this constant joy of, “Oh, we're going to add things to a cluster.” Its perspective was, “What's a cluster?” Or you'd have these problems with a core switch going down and suddenly everything else would explode as well.And even setting up an on-call rotation for who got paged when was nightmarish. And a bunch of things have evolved since then, which is putting it mildly. Like, you could say that about fire, the invention of the wheel. Yeah, a lot of things have evolved since the invention of the wheel, and here we are tricking sand into thinking. But we find ourselves just—now it seems that the outcome of all of this has been instead of one option that's the de facto standard that's kind of terrible in its own ways, now, we have an entire universe of different products, many of which are best-of-breed at one very specific thing, but nothing's great at everything.It's the multifunction printer conundrum, where you find things that are great at one or two things at most, and then mediocre at best at the rest. I'm excited about the possibility for OpenTelemetry to really get to a point of best-of-breed for everything. But it also feels like the money folks are pushing for consolidation, if you believe a lot of the analyst reports around this of, “We already pay for seven different observability vendors. How about we knock it down to just one that does all of these things?” Because that would be terrible. What do you land on that?Austin: Well, as I intu—or alluded to this earlier, I think the consolidation in the observability space, in general, is very much driven by that force you just pointed out, right? The buyers want to consolidate more and more things into single tools. And I think there's a lot of… there are reasons for that that—you know, there are good reasons for that, but I also feel like a lot of those reasons are driven by fundamentally telemetry-side concerns, right? So like, one example of this is if you were Large Business X, and you see—you are an engineering director and you get a report, that's like, “We have eight different metrics products.” And you're like, “That seems like a lot. Let's just use Brand X.”And Brand X will tell you very, very happily tell you, like, “Oh, you just install our thing everywhere and you can get rid of all these other tools.” And usually, there's two reasons that people pick tools, right? One reason is that they are forced to and then they are forced to do a bunch of integration work to get whatever the old stuff was working in the new way, but the other reason is because they tried a bunch of different things and they found the one tool that actually worked for them. And what happens invariably in these sort of consolidation stories is, you know, the new vendor comes in on a shining horse to consolidate, and you wind up instead of eight distinct metrics tools, now you have nine distinct metrics tools because there's never any bandwidth for people to go back and, you know—you're Nagios example, right, Nag—people still use Nagios every day. What's the economic justification to take all those Nagios installs, if they're working, and put them into something else, right?What's the economic justification to go and take a bunch of old software that hasn't been touched for ten years that still runs and still does what needs to do, like, where's the incentive to go and re-instrument that with OpenTelemetry or anything else? It doesn't necessarily exist, right? And that's a pretty, I think, fundamental decision point in everyone's observability journey, which is what do you do about all the old stuff? Because most of the stuff is the old stuff and the worst part is, most of the stuff that you make money off of is the old stuff as well. So, you can't ignore it, and if you're spending, you know, millions of millions of dollars on the new stuff—like, there was a story that went around a while ago, I think, Coinbase spent something like, what, $60 million on Datadog… I hope they asked for it in real money and not Bitcoin. But—Corey: Yeah, something I've noticed about all the vendors, and even Coinbase themselves, very few of them actually transact in cryptocurrency. It's always cash on the barrelhead, so to speak.Austin: Yeah, smart. But still, like, that's an absurd amount of money [laugh] for any product or service, I would argue, right? But that's just my perspective. I do think though, it goes to show you that you know, it's very easy to get into these sort of things where you're just spending over the barrel to, like, the newest vendor that's going to come in and solve all your problems for you. And just, it often doesn't work that way because most places aren't—especially large organizations—just aren't built in is sort of like, “Oh, we can go through and we can just redo stuff,” right? “We can just roll out a new agent through… whatever.”We have mainframes [unintelligible 00:25:09], mainframes to thinking about, you have… in many cases, you have an awful lot of business systems that most, kind of, cloud people don't like, think about, right, like SAP or Salesforce or ServiceNow, or whatever. And those sort of business process systems are actually responsible for quite a few things that are interesting from an observability point of view. But you don't see—I mean, hell, you don't even see OpenTelemetry going out and saying, like, “Oh, well, here's the thing to let you know, observe Apex applications on Salesforce,” right? It's kind of an undiscovered country in a lot of ways and it's something that I think we will have to grapple with as we go forward. In the shorter term, there's a reason that OpenTelemetry mostly focuses on cloud-native applications because that's a little bit easier to actually do what we're trying to do on them and that's where the heat and light is. But once we get done with that, then the sky is the limit.[midroll 00:26:11]Corey: It still feels like OpenTelemetry is evolving rapidly. It's certainly not, I don't want to say it's not feature complete, which, again, what—software is never done. But it does seem like even quarter-to-quarter or month-to-month, its capabilities expand massively. Because you apparently enjoy pain, you're in the process of writing a book. I think it's in early release or early access that comes out next year, 2024. Why would you do such a thing?Austin: That's a great question. And if I ever figure out the answer I will tell you.Corey: Remember, no one wants to write a book; they want to have written the book.Austin: And the worst part is, is I have written the book and for some reason, I went back for another round. I—Corey: It's like childbirth. No one remembers exactly how horrible it was.Austin: Yeah, my partner could probably attest to that. Although I was in the room, and I don't think I'd want to do it either. So, I think the real, you know, the real reason that I decided to go and kind of write this book—and it's Learning OpenTelemetry; it's in early release right now on the O'Reilly learning platform and it'll be out in print and digital next year, I believe, we're targeting right now, early next year.But the goal is, as you pointed out so eloquently, OpenTelemetry changes a lot. And it changes month to month sometimes. So, why would someone decide—say, “Hey, I'm going to write the book about learning this?” Well, there's a very good reason for that and it is that I've looked at a lot of the other books out there on OpenTelemetry, on observability in general, and they talk a lot about, like, here's how you use the API. Here's how you use the SDK. Here's how you make a trace or a span or a log statement or whatever. And it's very technical; it's very kind of in the weeds.What I was interested in is saying, like, “Okay, let's put all that stuff aside because you don't necessarily…” I'm not saying any of that stuff's going to change. And I'm not saying that how to make a span is going to change tomorrow; it's not, but learning how to actually use something like OpenTelemetry isn't just knowing how to create a measurement or how to create a trace. It's, how do I actually use this in a production system? To my point earlier, how do I use this to get data about, you know, these quote-unquote, “Legacy systems?” How do I use this to monitor a Kubernetes cluster? What's the important parts of building these observability pipelines? If I'm maintaining a library, how should I integrate OpenTelemetry into that library for my users? And so on, and so on, and so forth.And the answers to those questions actually probably aren't going to change a ton over the next four or five years. Which is good because that makes it the perfect thing to write a book about. So, the goal of Learning OpenTelemetry is to help you learn not just how to use OpenTelemetry at an API or SDK level, but it's how to build an observability pipeline with OpenTelemetry, it's how to roll it out to an organization, it's how to convince your boss that this is what you should use, both for new and maybe picking up some legacy development. It's really meant to give you that sort of 10,000-foot view of what are the benefits of this, how does it bring value and how can you use it to build value for an observability practice in an organization?Corey: I think that's fair. Looking at the more quote-unquote, “Evergreen,” style of content as opposed to—like, that's the reason for example, I never wind up doing tutorials on how to use an AWS service because one console change away and suddenly I have to redo the entire thing. That's a treadmill I never had much interest in getting on. One last topic I want to get into before we wind up wrapping the episode—because I almost feel obligated to sprinkle this all over everything because the analysts told me I have to—what's your take on generative AI, specifically with an eye toward observability?Austin: [sigh], gosh, I've been thinking a lot about this. And—hot take alert—as a skeptic of many technological bubbles over the past five or so years, ten years, I'm actually pretty hot on AI—generative AI, large language models, things like that—but not for the reasons that people like to kind of hold them up, right? Not so that we can all make our perfect, funny [sigh], deep dream, meme characters or whatever through Stable Fusion or whatever ChatGPT spits out at us when we ask for a joke. I think the real win here is that this to me is, like, the biggest advance in human-computer interaction since resistive touchscreens. Actually, probably since the mouse.Corey: I would agree with that.Austin: And I don't know if anyone has tried to get someone that is, you know, over the age of 70 to use a computer at any time in their life, but mapping human language to trying to do something on an operating system or do something on a computer on the web is honestly one of the most challenging things that faces interface design, face OS designers, faces anyone. And I think this also applies for dev tools in general, right? Like, if you think about observability, if you think about, like, well, what are the actual tasks involved in observability? It's like, well, you're making—you're asking questions. You're saying, like, “Hey, for this metric named HTTPrequestsByCode,” and there's four or five dimensions, and you say, like, “Okay, well break this down for me.” You know, you have to kind of know the magic words, right? You have to know the magic promQL sequence or whatever else to plug in and to get it to graph that for you.And you as an operator have to have this very, very well developed, like, depth of knowledge and math and statistics to really kind of get a lot of—Corey: You must be at least this smart to ride on this ride.Austin: Yeah. And I think that, like that, to me is the real—the short-term win for certainly generative AI around using, like, large language models, is the ability to create human language interfaces to observability tools, that—Corey: As opposed to learning your own custom SQL dialect, which I see a fair number of times.Austin: Right. And, you know, and it's actually very funny because there was a while for the—like, one of my kind of side projects for the past [sigh] a little bit [unintelligible 00:32:31] idea of, like, well, can we make, like, a universal query language or universal query layer that you could ship your dashboards or ship your alerts or whatever. And then it's like, generative AI kind of just, you know, completely leapfrogs that, right? It just says, like, well, why would you need a query language, if we can just—if you can just ask the computer and it works, right?Corey: The most common programming language is about to become English.Austin: Which I mean, there's an awful lot of externalities there—Corey: Which is great. I want to be clear. I'm not here to gatekeep.Austin: Yeah. I mean, I think there's a lot of externalities there, and there's a lot—and the kind of hype to provable benefit ratio is very skewed right now towards hype. That said, one of the things that is concerning to me as sort of an observability practitioner is the amount of people that are just, like, whole-hog, throwing themselves into, like, oh, we need to integrate generative AI, right? Like, we need to put AI chatbots and we need to have ChatGPT built into our products and da-da-da-da-da. And now you kind of have this perfect storm of people that really don't ha—because they're just using these APIs to integrate gen AI stuff with, they really don't understand what it's doing because a lot you know, it is very complex, and I'll be the first to admit that I really don't understand what a lot of it is doing, you know, on the deep, on the foundational math side.But if we're going to have trust in, kind of, any kind of system, we have to understand what it's doing, right? And so, the only way that we can understand what it's doing is through observability, which means it's incredibly important for organizations and companies that are building products on generative AI to, like, drop what—you know, walk—don't walk, run towards something that is going to give you observability into these language models.Corey: Yeah. “The computer said so,” is strangely dissatisfying.Austin: Yeah. You need to have that base, you know, sort of, performance [goals and signals 00:34:31], obviously, but you also need to really understand what are the questions being asked. As an example, let's say you have something that is tokenizing questions. You really probably do want to have some sort of observability on the hot path there that lets you kind of break down common tokens, especially if you were using, like, custom dialects or, like, vectors or whatever to modify the, you know, neural network model, like, you really want to see, like, well, what's the frequency of the certain tokens that I'm getting they're hitting the vectors versus not right? Like, where can I improve these sorts of things? Where am I getting, like, unexpected results?And maybe even have some sort of continuous feedback mechanism that it could be either analyzing the tone and tenor of end-user responses or you can have the little, like, frowny and happy face, whatever it is, like, something that is giving you that kind of constant feedback about, like, hey, this is how people are actually like interacting with it. Because I think there's way too many stories right now people just kind of like saying, like, “Oh, okay. Here's some AI-powered search,” and people just, like, hating it. Because people are already very primed to distrust AI, I think. And I can't blame anyone.Corey: Well, we've had an entire lifetime of movies telling us that's going to kill us all.Austin: Yeah.Corey: And now you have a bunch of, also, billionaire tech owners who are basically intent on making that reality. But that's neither here nor there.Austin: It isn't, but like I said, it's difficult. It's actually one of the first times I've been like—that I've found myself very conflicted.Corey: Yeah, I'm a booster of this stuff; I love it, but at the same time, you have some of the ridiculous hype around it and the complete lack of attention to safety and humanity aspects of it that it's—I like the technology and I think it has a lot of promise, but I want to get lumped in with that set.Austin: Exactly. Like, the technology is great. The fan base is… ehh, maybe something a little different. But I do think that, for lack of a better—not to be an inevitable-ist or whatever, but I do think that there is a significant amount of, like, this is a genie you can't put back in the bottle and it is going to have, like, wide-ranging, transformative effects on the discipline of, like, software development, software engineering, and white collar work in general, right? Like, there's a lot of—if your job involves, like, putting numbers into Excel and making pretty spreadsheets, then ooh, that doesn't seem like something that's going to do too hot when I can just have Excel do that for me.And I think we do need to be aware of that, right? Like, we do need to have that sort of conversation about, like… what are we actually comfortable doing here in terms of displacing human labor? When we do displace human labor, are we doing it so that we can actually give people leisure time or so that we can just cram even more work down the throats of the humans that are left?Corey: And unfortunately, I think we might know what that answer is, at least on our current path.Austin: That's true. But you know, I'm an optimist.Corey: I… don't do well with disappointment. Which the show has certainly not been. I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place for them to find you?Austin: Welp, I—you can find me on most social media. Many, many social medias. I used to be on Twitter a lot, and we all know what happened there. The best place to figure out what's going on is check out my bio, social.ap2.io will give you all the links to where I am. And yeah, been great talking with you.Corey: Likewise. Thank you so much for taking the time out of your day. Austin Parker, community maintainer for OpenTelemetry. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment pointing out that actually, physicists say the vast majority of the universe's empty space, so that we can later correct you by saying ah, but it's empty whitespace. That's right. YAML wins again.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Software Defined Talk
Episode 416: Exchange your “Buddy Bucks” for Nagios

Software Defined Talk

Play Episode Listen Later May 26, 2023 47:02


This week we discuss Private Equity buying New Relic and review the Gartner MQ for APM like no one else. Plus, some thoughts on yogurt, fruit and almonds… Watch the YouTube Live Recording of Episode 416 (https://www.youtube.com/watch?v=cxODnsIekJo) Runner-up Titles How big are your almonds? It's under consideration The Olympics of Monitoring Putting Hats on Cats Stanley Nickels for Monitoring The 3 Fines It's just crazy enough [that] it might work I was told the test was log management IDC is the thin balogna Rundown New Relic is getting bought so let's overanalyze the APM MQ WSJ News Exclusive | Software Company New Relic in Talks to Be Sold (https://www.wsj.com/articles/software-company-new-relic-in-talks-to-be-sold-8079db4e) 2022 Gartner® Magic Quadrant™ - For APM and Observability (https://www.dynatrace.com/monitoring/gartner-magic-quadrant-for-application-performance-monitoring-observability/) IDC APM marketshare from 2021 (https://www.idc.com/getdoc.jsp?containerId=US49596822&pageType=PRINTFRIENDLY) Relevant to your Interests Kustomer, the CRM startup acquired by Meta last year for $1B, spins out on a $250M valuation (https://techcrunch.com/2023/05/16/kustomer-meta-spin-out/) Harvard's Famous Study on Happiness Might Surprise You - Meetup Blog (https://www.meetup.com/blog/harvards-famous-study-on-happiness-might-surprise-you/) Historical software multiples at IPO (https://twitter.com/glennsolomon/status/1658873511233093632) Introducing the ChatGPT app for iOS (https://openai.com/blog/introducing-the-chatgpt-app-for-ios) Luxottica confirms 2021 data breach after info of 70M leaks online (https://www.bleepingcomputer.com/news/security/luxottica-confirms-2021-data-breach-after-info-of-70m-leaks-online/) Redpanda CEO on business source (BSL) over open source (https://youtu.be/Yp2KtNRAzxA) Australian stock exchange says software overhaul won't involve blockchain (https://www.reuters.com/markets/australian-stock-exchange-says-software-overhaul-wont-involve-blockchain-2023-05-19/) Shutterstock (NYSE: SSTK) agreed to buy Giphy (https://www.axios.com/newsletters/axios-pro-rata-fba83bc3-5b15-4374-8a3d-8d2829b1d001.html?chunk=1&utm_term=emshare#story1) The 5 biggest announcements from Microsoft Build 2023 (https://www.theverge.com/23734104/microsoft-build-2023-ai-bing-copilot) Microsoft launches Fabric, a new end-to-end data and analytics platform (https://techcrunch.com/2023/05/23/microsoft-launches-fabric-a-new-end-to-end-data-and-analytics-platform/) Alibaba misses revenue estimate, approves cloud unit spinoff (https://www.reuters.com/business/retail-consumer/alibaba-fourth-quarter-revenue-rises-2-2023-05-18/) Laid-off Big Tech workers are getting invited to return—on unfavorable terms (https://www.itbrew.com/stories/2023/05/12/laid-off-big-tech-workers-are-getting-invited-to-return-on-unfavorable-terms?cid=31524316.162675&mblid=bfd7ff44f422&mid=eb84400822928b8efbb6c2dcb0460be6&utm_campaign=etb&utm_medium=newsletter&utm_source=morning_brew) Nonsense A Texas A&M-Commerce Rodeo Instructor failed his entire class due to a basic misunderstanding of how "Chat GTP" works (https://softwaredefinedtalk.slack.com/archives/C04GSN7U5S9/p1684426841082349) The Star Wars: Galactic Starcruiser at Walt Disney World will take its final voyage (https://twitter.com/ScottGustin/status/1659276676889473050) Opening 100 pppts at once? Challenge accepted! (https://youtu.be/xRZM-EDsmuI) The Latest Sriracha Shortage Looks Worse Than The First One (https://www.delish.com/food-news/a43578818/sriracha-shortage-again/) Conferences June 1st VMUG Belgium in Brussels (https://vmug.be/) , free. June 7th State of Kubernetes overview (https://tanzu.vmware.com/content/webinars/jun-7-emea-state-of-kubernetes-2023-solving-kubernetes-challenges?utm_source=cote&utm_campaign=devrel&utm_content=newsletter), online. June 8th to 9th PlatformCon (https://platformcon.com/), online. June 22nd to 23rd DevOpsDays Amsterdam (https://devopsdays.org/events/2023-amsterdam/welcome/), attending. June 27th to 30th FinOps X (https://x.finops.org/) San Diego, attending. August 8th Kubernetes Community Day Australia (https://community.cncf.io/events/details/cncf-kcd-australia-presents-kubernetes-community-day-australia-2023/), attending. August 21st to 24th SpringOne (https://springone.io/) & VMware Explore US (https://www.vmware.com/explore/us.html), in Las Vegas. Explore EU CFP is open. Sep 6th to 7th DevOpsDays Des Moines (https://devopsdays.org/events/2023-des-moines/welcome/), speaking. Sep 18th to 19th SHIFT (https://shift.infobip.com/) in Zadar, speaking. If you want your conference mentioned, let's talk media sponsorships. SDT news & hype Join us in Slack (http://www.softwaredefinedtalk.com/slack). Get a SDT Sticker! Send your postal address to stickers@softwaredefinedtalk.com (mailto:stickers@softwaredefinedtalk.com) and we will send you free laptop stickers! Follow us on Twitch (https://www.twitch.tv/sdtpodcast), Twitter (https://twitter.com/softwaredeftalk), Instagram (https://www.instagram.com/softwaredefinedtalk/), Mastodon (https://hachyderm.io/@softwaredefinedtalk), LinkedIn (https://www.linkedin.com/company/software-defined-talk/), TikTok (https://www.tiktok.com/@softwaredefinedtalk) and YouTube (https://www.youtube.com/channel/UCi3OJPV6h9tp-hbsGBLGsDQ/featured). Use the code SDT to get $20 off Coté's book, Digital WTF (https://leanpub.com/digitalwtf/c/sdt), so $5 total. Become a sponsor of Software Defined Talk (https://www.softwaredefinedtalk.com/ads)! Recommendations Brandon: What's new in Mimestream 1.0? (https://mimestream.com/blog/whats-new-in-1.0) Matt: Have a picnic Coté: Hoodie and shorts for the spring; the “web browsing” plugin in ChatGPT does not work well. Photo Credits Header (https://unsplash.com/photos/FhroWi5pP20) Artwork (https://labs.openai.com/e/yvzRVBom4tqYBbLZKpY7sp02/O3MXdgOZ37BRYs2JsOhXFy44)

Software Defined Talk
Episode 415: You can buy a lot of Nagios with that

Software Defined Talk

Play Episode Listen Later May 19, 2023 53:48


This week we discuss Coinbase's $65 million DataDog bill, the factors that drive developer experience, and Google Bard. Plus, some tips on London Airports and the ideal airport arrival time. Watch the YouTube Live Recording of Episode 415 (https://www.youtube.com/watch?v=ysNsrYSkYsY&t=4s) Runner-up Titles Two hours before boarding $65 million is a lot of nines You can buy a lot of Nagios with that Just write the check for $65 million Boxes Inciting chaos Steady state is stopped Developers are just like the rest of us Unsummarizable AI in the Streets Rundown Datadog's $65M/year customer mystery solved (https://newsletter.pragmaticengineer.com/p/datadogs-65myear-customer-mystery) Google drops waitlist for AI chatbot Bard and announces oodles of new features (https://www.theverge.com/2023/5/10/23718066/google-bard-ai-features-waitlist-dark-mode-visual-search-io) DevEx: What Actually Drives Productivity (https://queue.acm.org/detail.cfm?id=3595878) Relevant to your Interests Google launches a GitHub Copilot competitor (https://techcrunch.com/2023/05/10/google-launches-a-github-copilot-competitor/) IBM Watson missed the AI revolution, but Watsonx could become the heartbeat of the Generative Enterprise - Horses for Sources | No Boundaries (https://www.horsesforsources.com/ibm-watsonx_051023/) It's not just cloud costs that are out of control (https://world.hey.com/dhh/it-s-not-just-cloud-costs-that-are-out-of-control-efcd098c) Microsoft just made a huge, far-from-certain bet on nuclear fusion (https://www.theverge.com/2023/5/10/23717332/microsoft-nuclear-fusion-power-plant-helion-purchase-agreement?_hsmi=257843304) Testing a new encrypted messaging app's extraordinary claims (https://crnkovic.dev/testing-converso/) Adrian Cockcroft (@adrianco@mastodon.social) (https://mastodon.social/@adrianco/110327883488321584) Goldman Sachs downgrades Twilio, other software names on 'limited catalyst' path (TWLO) (https://seekingalpha.com/news/3970029-goldman-sachs-downgrades-twilio-other-software-names-on-limited-catalyst-path) Meta announces generative AI features for advertisers (https://techcrunch.com/2023/05/11/meta-announces-generative-ai-features-for-advertisers/?utm_source=newsletter&utm_medium=email&utm_campaign=newsletter_axioslogin&stream=top) The .zip TLD sucks and it needs to be immediately revoked. (https://financialstatement.zip/) Netflix Reportedly Cutting Spending By $300 Million After Delay On Password Sharing Crackdown (https://www.forbes.com/sites/katherinehamilton/2023/05/12/netflix-reportedly-cutting-spending-by-300-million-after-delay-on-password-sharing-crackdown/?sh=4c6e2a1eaee2) Thank you, Enterprisers! (https://enterprisersproject.com/article/2023/5/thank-you-enterprisers) Somehow OpenSearch has succeeded (https://www.infoworld.com/article/3695576/somehow-opensearch-has-succeeded.html) Docker makes comeback with over $50M in ARR two years into restructuring (https://techcrunch.com/2022/02/01/docker-makes-comeback-reaching-over-50m-in-arr/) Calendly Welcomes Former Salesforce Product Executive Stephen Hsu as CPO (https://www.businesswire.com/news/home/20230509005168/en/Calendly-Welcomes-Former-Salesforce-Product-Executive-Stephen-Hsu-as-CPO?utm_source=newsletter&utm_medium=email&utm_campaign=newsletter_axioslogin&stream=top) Farewell to the Era of Cheap EC2 Spot Instances | Eric Pauley (https://pauley.me/post/2023/spot-price-trends/) The .zip TLD sucks and it needs to be immediately revoked. (https://financialstatement.zip/) The Dangers of Google's .zip TLD (https://medium.com/@bobbyrsec/the-dangers-of-googles-zip-tld-5e1e675e59a5) Conferences June 1st VMUG Belgium in Brussels (https://vmug.be/) , free. June 7th State of Kubernetes overview (https://tanzu.vmware.com/content/webinars/jun-7-emea-state-of-kubernetes-2023-solving-kubernetes-challenges?utm_source=cote&utm_campaign=devrel&utm_content=newsletter), online. June 8th to 9th PlatformCon (https://platformcon.com/), online. June 22nd to 23rd DevOpsDays Amsterdam (https://devopsdays.org/events/2023-amsterdam/welcome/), attending. June 27th to 30th FinOps X (https://x.finops.org/) San Diego, attending. August 21st to 24th SpringOne (https://springone.io/) & VMware Explore US (https://www.vmware.com/explore/us.html), in Las Vegas. Explore EU CFP is open. Sep 6th to 7th DevOpsDays Des Moines (https://devopsdays.org/events/2023-des-moines/welcome/), speaking. Sep 18th to 19th SHIFT (https://shift.infobip.com/) in Zadar, speaking. If you want your conference mentioned, let's talk media sponsorships. SDT news & hype Join us in Slack (http://www.softwaredefinedtalk.com/slack). Get a SDT Sticker! Send your postal address to stickers@softwaredefinedtalk.com (mailto:stickers@softwaredefinedtalk.com) and we will send you free laptop stickers! Follow us on Twitch (https://www.twitch.tv/sdtpodcast), Twitter (https://twitter.com/softwaredeftalk), Instagram (https://www.instagram.com/softwaredefinedtalk/), Mastodon (https://hachyderm.io/@softwaredefinedtalk), LinkedIn (https://www.linkedin.com/company/software-defined-talk/), TikTok (https://www.tiktok.com/@softwaredefinedtalk) and YouTube (https://www.youtube.com/channel/UCi3OJPV6h9tp-hbsGBLGsDQ/featured). Use the code SDT to get $20 off Coté's book, Digital WTF (https://leanpub.com/digitalwtf/c/sdt), so $5 total. Become a sponsor of Software Defined Talk (https://www.softwaredefinedtalk.com/ads)! Recommendations Brandon: POLAR H9 Heart Rate Sensor (https://www.amazon.com/dp/B08411DQ96?psc=1&ref=ppx_yo2ov_dt_b_product_details) Matt: Critical Mass (https://amzn.to/42MCaMg) Sydney Half Marathon (https://runawaysydneyhalf.com.au/) Coté: Pipers Great Berwick Longhorn Beef “crisps.” (https://www.piperscrisps.com/en/great-berwick-longhorn-beef) Photo Credits Header (https://unsplash.com/photos/so9S78y64zk) Artwork (https://labs.openai.com/e/oLKoWqA6Gqx4kvjQ2wKHkgK4/rfcnCgVl74kVlDhNE4OEwHRy)

Screaming in the Cloud
Making Open-Source Multi-Cloud Truly Free with AB Periasamy

Screaming in the Cloud

Play Episode Listen Later Mar 28, 2023 40:04


AB Periasamy, Co-Founder and CEO of MinIO, joins Corey on Screaming in the Cloud to discuss what it means to be truly open source and the current and future state of multi-cloud. AB explains how MinIO was born from the idea that the world was going to produce a massive amount of data, and what it's been like to see that come true and continue to be the future outlook. AB and Corey explore why some companies are hesitant to move to cloud, and AB describes why he feels the move is inevitable regardless of cost. AB also reveals how he has helped create a truly free open-source software, and how his partnership with Amazon has been beneficial. About ABAB Periasamy is the co-founder and CEO of MinIO, an open source provider of high performance, object storage software. In addition to this role, AB is an active investor and advisor to a wide range of technology companies, from H2O.ai and Manetu where he serves on the board to advisor or investor roles with Humio, Isovalent, Starburst, Yugabyte, Tetrate, Postman, Storj, Procurify, and Helpshift. Successful exits include Gitter.im (Gitlab), Treasure Data (ARM) and Fastor (SMART).AB co-founded Gluster in 2005 to commoditize scalable storage systems. As CTO, he was the primary architect and strategist for the development of the Gluster file system, a pioneer in software defined storage. After the company was acquired by Red Hat in 2011, AB joined Red Hat's Office of the CTO. Prior to Gluster, AB was CTO of California Digital Corporation, where his work led to scaling of the commodity cluster computing to supercomputing class performance. His work there resulted in the development of Lawrence Livermore Laboratory's “Thunder” code, which, at the time was the second fastest in the world.  AB holds a Computer Science Engineering degree from Annamalai University, Tamil Nadu, India.AB is one of the leading proponents and thinkers on the subject of open source software - articulating the difference between the philosophy and business model. An active contributor to a number of open source projects, he is a board member of India's Free Software Foundation.Links Referenced: MinIO: https://min.io/ Twitter: https://twitter.com/abperiasamy LinkedIn: https://www.linkedin.com/in/abperiasamy/ Email: mailto:ab@min.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Chronosphere. When it costs more money and time to observe your environment than it does to build it, there's a problem. With Chronosphere, you can shape and transform observability data based on need, context and utility. Learn how to only store the useful data you need to see in order to reduce costs and improve performance at chronosphere.io/corey-quinn. That's chronosphere.io/corey-quinn. And my thanks to them for sponsor ing my ridiculous nonsense. Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn, and I have taken a somewhat strong stance over the years on the relative merits of multi-cloud, and when it makes sense and when it doesn't. And it's time for me to start modifying some of those. To have that conversation and several others as well, with me today on this promoted guest episode is AB Periasamy, CEO and co-founder of MinIO. AB, it's great to have you back.AB: Yes, it's wonderful to be here again, Corey.Corey: So, one thing that I want to start with is defining terms. Because when we talk about multi-cloud, there are—to my mind at least—smart ways to do it and ways that are frankly ignorant. The thing that I've never quite seen is, it's greenfield, day one. Time to build something. Let's make sure we can build and deploy it to every cloud provider we might ever want to use.And that is usually not the right path. Whereas different workloads in different providers, that starts to make a lot more sense. When you do mergers and acquisitions, as big companies tend to do in lieu of doing anything interesting, it seems like they find it oh, we're suddenly in multiple cloud providers, should we move this acquisition to a new cloud? No. No, you should not.One of the challenges, of course, is that there's a lot of differentiation between the baseline offerings that cloud providers have. MinIO is interesting in that it starts and stops with an object store that is mostly S3 API compatible. Have I nailed the basic premise of what it is you folks do?AB: Yeah, it's basically an object store. Amazon S3 versus us, it's actually—that's the comparable, right? Amazon S3 is a hosted cloud storage as a service, but underneath the underlying technology is called object-store. MinIO is a software and it's also open-source and it's the software that you can deploy on the cloud, deploy on the edge, deploy anywhere, and both Amazon S3 and MinIO are exactly S3 API compatible. It's a drop-in replacement. You can write applications on MinIO and take it to AWS S3, and do the reverse. Amazon made S3 API a standard inside AWS, we made S3 API standard across the whole cloud, all the cloud edge, everywhere, rest of the world.Corey: I want to clarify two points because otherwise I know I'm going to get nibbled to death by ducks on the internet. When you say open-source, it is actually open-source; you're AGPL, not source available, or, “We've decided now we're going to change our model for licensing because oh, some people are using this without paying us money,” as so many companies seem to fall into that trap. You are actually open-source and no one reasonable is going to be able to disagree with that definition.The other pedantic part of it is when something says that it's S3 compatible on an API basis, like, the question is always does that include the weird bugs that we wish it wouldn't have, or some of the more esoteric stuff that seems to be a constant source of innovation? To be clear, I don't think that you need to be particularly compatible with those very corner and vertex cases. For me, it's always been the basic CRUD operations: can you store an object? Can you give it back to me? Can you delete the thing? And maybe an update, although generally object stores tend to be atomic. How far do you go down that path of being, I guess, a faithful implementation of what the S3 API does, and at which point you decide that something is just, honestly, lunacy and you feel no need to wind up supporting that?AB: Yeah, the unfortunate part of it is we have to be very, very deep. It only takes one API to break. And it's not even, like, one API we did not implement; one API under a particular circumstance, right? Like even if you see, like, AWS SDK is, right, Java SDK, different versions of Java SDK will interpret the same API differently. And AWS S3 is an API, it's not a standard.And Amazon has published the REST specifications, API specs, but they are more like religious text. You can interpret it in many ways. Amazon's own SDK has interpreted, like, this in several ways, right? The only way to get it right is, like, you have to have a massive ecosystem around your application. And if one thing breaks—today, if I commit a code and it introduced a regression, I will immediately hear from a whole bunch of community what I broke.There's no certification process here. There is no industry consortium to control the standard, but then there is an accepted standard. Like, if the application works, they need works. And one way to get it right is, like, Amazon SDKs, all of those language SDKs, to be cleaner, simpler, but applications can even use MinIO SDK to talk to Amazon and Amazon SDK to talk to MinIO. Now, there is a clear, cooperative model.And I actually have tremendous respect for Amazon engineers. They have only been kind and meaningful, like, reasonable partnership. Like, if our community reports a bug that Amazon rolled out a new update in one of the region and the S3 API broke, they will actually go fix it. They will never argue, “Why are you using MinIO SDK?” Their engineers, they do everything by reason. That's the reason why they gained credibility.Corey: I think, on some level, that we can trust that the API is not going to meaningfully shift, just because so much has been built on top of it over the last 15, almost 16 years now that even slight changes require massive coordination. I remember there was a little bit of a kerfuffle when they announced that they were going to be disabling the BitTorrent endpoint in S3 and it was no longer going to be supported in new regions, and eventually they were turning it off. There were still people pushing back on that. I'm still annoyed by some of the documentation around the API that says that it may not return a legitimate error code when it errors with certain XML interpretations. It's… it's kind of become very much its own thing.AB: [unintelligible 00:06:22] a problem, like, we have seen, like, even stupid errors similar to that, right? Like, HTTP headers are supposed to be case insensitive, but then there are some language SDKs will send us in certain type of casing and they expect the case to be—the response to be same way. And that's not HTTP standard. If we have to accept that bug and respond in the same way, then we are asking a whole bunch of community to go fix that application. And Amazon's problem are our problems too. We have to carry that baggage.But some places where we actually take a hard stance is, like, Amazon introduced that initially, the bucket policies, like access control list, then finally came IAM, then we actually, for us, like, the best way to teach the community is make best practices the standard. The only way to do it. We have been, like, educating them that we actually implemented ACLs, but we removed it. So, the customers will no longer use it. The scale at which we are growing, if I keep it, then I can never force them to remove.So, we have been pedantic about, like, how, like, certain things that if it's a good advice, force them to do it. That approach has paid off, but the problem is still quite real. Amazon also admits that S3 API is no longer simple, but at least it's not like POSIX, right? POSIX is a rich set of API, but doesn't do useful things that we need to do. So, Amazon's APIs are built on top of simple primitive foundations that got the storage architecture correct, and then doing sophisticated functionalities on top of the simple primitives, these atomic RESTful APIs, you can finally do it right and you can take it to great lengths and still not break the storage system.So, I'm not so concerned. I think it's time for both of us to slow down and then make sure that the ease of operation and adoption is the goal, then trying to create an API Bible.Corey: Well, one differentiation that you have that frankly I wish S3 would wind up implementing is this idea of bucket quotas. I would give a lot in certain circumstances to be able to say that this S3 bucket should be able to hold five gigabytes of storage and no more. Like, you could fix a lot of free tier problems, for example, by doing something like that. But there's also the problem that you'll see in data centers where, okay, we've now filled up whatever storage system we're using. We need to either expand it at significant cost and it's going to take a while or it's time to go and maybe delete some of the stuff we don't necessarily need to keep in perpetuity.There is no moment of reckoning in traditional S3 in that sense because, oh, you can just always add one more gigabyte at 2.3 or however many cents it happens to be, and you wind up with an unbounded growth problem that you're never really forced to wrestle with. Because it's infinite storage. They can add drives faster than you can fill them in most cases. So, it's it just feels like there's an economic story, if nothing else, just from a governance control and make sure this doesn't run away from me, and alert me before we get into the multi-petabyte style of storage for my Hello World WordPress website.AB: Mm-hm. Yeah, so I always thought that Amazon did not do this—it's not just Amazon, the cloud players, right—they did not do this because they want—is good for their business; they want all the customers' data, like unrestricted growth of data. Certainly it is beneficial for their business, but there is an operational challenge. When you set quota—this is why we grudgingly introduced this feature. We did not have quotas and we didn't want to because Amazon S3 API doesn't talk about quota, but the enterprise community wanted this so badly.And eventually we [unintelligible 00:09:54] it and we gave. But there is one issue to be aware of, right? The problem with quota is that you as an object storage administrator, you set a quota, let's say this bucket, this application, I don't see more than 20TB; I'm going to set 100TB quota. And then you forget it. And then you think in six months, they will reach 20TB. The reality is, in six months they reach 100TB.And then when nobody expected—everybody has forgotten that there was a code a certain place—suddenly application start failing. And when it fails, it doesn't—even though the S3 API responds back saying that insufficient space, but then the application doesn't really pass that error all the way up. When applications fail, they fail in unpredictable ways. By the time the application developer realizes that it's actually object storage ran out of space, the lost time and it's a downtime. So, as long as they have proper observability—because I mean, I've will also asked observability, that it can alert you that you are only going to run out of space soon. If you have those system in place, then go for quota. If not, I would agree with the S3 API standard that is not about cost. It's about operational, unexpected accidents.Corey: Yeah, on some level, we wound up having to deal with the exact same problem with disk volumes, where my default for most things was, at 70%, I want to start getting pings on it and at 90%, I want to be woken up for it. So, for small volumes, you wind up with a runaway log or whatnot, you have a chance to catch it and whatnot, and for the giant multi-petabyte things, okay, well, why would you alert at 70% on that? Well, because procurement takes a while when we're talking about buying that much disk for that much money. It was a roughly good baseline for these things. The problem, of course, is when you have none of that, and well it got full so oops-a-doozy.On some level, I wonder if there's a story around soft quotas that just scream at you, but let you keep adding to it. But that turns into implementation details, and you can build something like that on top of any existing object store if you don't need the hard limit aspect.AB: Actually, that is the right way to do. That's what I would recommend customers to do. Even though there is hard quota, I will tell, don't use it, but use soft quota. And the soft quota, instead of even soft quota, you monitor them. On the cloud, at least you have some kind of restriction that the more you use, the more you pay; eventually the month end bills, it shows up.On MinIO, when it's deployed on these large data centers, that it's unrestricted access, quickly you can use a lot of space, no one knows what data to delete, and no one will tell you what data to delete. The way to do this is there has to be some kind of accountability.j, the way to do it is—actually [unintelligible 00:12:27] have some chargeback mechanism based on the bucket growth. And the business units have to pay for it, right? That IT doesn't run for free, right? IT has to have a budget and it has to be sponsored by the applications team.And you measure, instead of setting a hard limit, you actually charge them that based on the usage of your bucket, you're going to pay for it. And this is a observability problem. And you can call it soft quotas, but it hasn't been to trigger an alert in observability. It's observability problem. But it actually is interesting to hear that as soft quotas, which makes a lot of sense.Corey: It's one of those problems that I think people only figure out after they've experienced it once. And then they look like wizards from the future who, “Oh, yeah, you're going to run into a quota storage problem.” Yeah, we all find that out because the first time we smack into something and live to regret it. Now, we can talk a lot about the nuances and implementation and low level detail of this stuff, but let's zoom out of it. What are you folks up to these days? What is the bigger picture that you're seeing of object storage and the ecosystem?AB: Yeah. So, when we started, right, our idea was that world is going to produce incredible amount of data. In ten years from now, we are going to drown in data. We've been saying that today and it will be true. Every year, you say ten years from now and it will still be valid, right?That was the reason for us to play this game. And we saw that every one of these cloud players were incompatible with each other. It's like early Unix days, right? Like a bunch of operating systems, everything was incompatible and applications were beginning to adopt this new standard, but they were stuck. And then the cloud storage players, whatever they had, like, GCS can only run inside Google Cloud, S3 can only run inside AWS, and the cloud player's game was bring all the world's data into the cloud.And that actually requires enormous amount of bandwidth. And moving data into the cloud at that scale, if you look at the amount of data the world is producing, if the data is produced inside the cloud, it's a different game, but the data is produced everywhere else. MinIO's idea was that instead of introducing yet another API standard, Amazon got the architecture right and that's the right way to build large-scale infrastructure. If we stick to Amazon S3 API instead of introducing it another standard, [unintelligible 00:14:40] API, and then go after the world's data. When we started in 2014 November—it's really 2015, we started, it was laughable. People thought that there won't be a need for MinIO because the whole world will basically go to AWS S3 and they will be the world's data store. Amazon is capable of doing that; the race is not over, right?Corey: And it still couldn't be done now. The thing is that they would need to fundamentally rethink their, frankly, you serious data egress charges. The problem is not that it's expensive to store data in AWS; it's that it's expensive to store data and then move it anywhere else for analysis or use on something else. So, there are entire classes of workload that people should not consider the big three cloud providers as the place where that data should live because you're never getting it back.AB: Spot on, right? Even if network is free, right, Amazon makes, like, okay, zero egress-ingress charge, the data we're talking about, like, most of MinIO deployments, they start at petabytes. Like, one to ten petabyte, feels like 100 terabyte. For even if network is free, try moving a ten-petabyte infrastructure into the cloud. How are you going to move it?Even with FedEx and UPS giving you a lot of bandwidth in their trucks, it is not possible, right? I think the data will continue to be produced everywhere else. So, our bet was there we will be [unintelligible 00:15:56]—instead of you moving the data, you can run MinIO where there is data, and then the whole world will look like AWS's S3 compatible object store. We took a very different path. But now, when I say the same story that when what we started with day one, it is no longer laughable, right?People believe that yes, MinIO is there because our market footprint is now larger than Amazon S3. And as it goes to production, customers are now realizing it's basically growing inside a shadow IT and eventually businesses realize the bulk of their business-critical data is sitting on MinIO and that's how it's surfacing up. So now, what we are seeing, this year particularly, all of these customers are hugely concerned about cost optimization. And as part of the journey, there is also multi-cloud and hybrid-cloud initiatives. They want to make sure that their application can run on any cloud or on the same software can run on their colos like Equinix, or like bunch of, like, Digital Reality, anywhere.And MinIO's software, this is what we set out to do. MinIO can run anywhere inside the cloud, all the way to the edge, even on Raspberry Pi. It's now—whatever we started with is now has become reality; the timing is perfect for us.Corey: One of the challenges I've always had with the idea of building an application with the idea to run it anywhere is you can make explicit technology choices around that, and for example, object store is a great example because most places you go now will or can have an object store available for your use. But there seem to be implementation details that get lost. And for example, even load balancers wind up being implemented in different ways with different scaling times and whatnot in various environments. And past a certain point, it's okay, we're just going to have to run it ourselves on top of HAproxy or Nginx, or something like it, running in containers themselves; you're reinventing the wheel. Where is that boundary between, we're going to build this in a way that we can run anywhere and the reality that I keep running into, which is we tried to do that but we implicitly without realizing it built in a lot of assumptions that everything would look just like this environment that we started off in.AB: The good part is that if you look at the S3 API, every request has the site name, the endpoint, bucket name, the path, and the object name. Every request is completely self-contained. It's literally a HTTP call away. And this means that whether your application is running on Android, iOS, inside a browser, JavaScript engine, anywhere across the world, they don't really care whether the bucket is served from EU or us-east or us-west. It doesn't matter at all, so it actually allows you by API, you can build a globally unified data infrastructure, some buckets here, some buckets there.That's actually not the problem. The problem comes when you have multiple clouds. Different teams, like, part M&A, the part—like they—even if you don't do M&A, different teams, no two data engineer will would agree on the same software stack. Then where they will all end up with different cloud players and some is still running on old legacy environment.When you combine them, the problem is, like, let's take just the cloud, right? How do I even apply a policy, that access control policy, how do I establish unified identity? Because I want to know this application is the only one who is allowed to access this bucket. Can I have that same policy on Google Cloud or Azure, even though they are different teams? Like if that employer, that project, or that admin, if he or she leaves the job, how do I make sure that that's all protected?You want unified identity, you want unified access control policies. Where are the encryption key store? And then the load balancer itself, the load, its—load balancer is not the problem. But then unless you adopt S3 API as your standard, the definition of what a bucket is different from Microsoft to Google to Amazon.Corey: Yeah, the idea of an of the PUTS and retrieving of actual data is one thing, but then you have how do you manage it the control plane layer of the object store and how do you rationalize that? What are the naming conventions? How do you address it? I even ran into something similar somewhat recently when I was doing an experiment with one of the Amazon Snowball edge devices to move some data into S3 on a lark. And the thing shows up and presents itself on the local network as an S3 endpoint, but none of their tooling can accept a different endpoint built into the configuration files; you have to explicitly use it as an environment variable or as a parameter on every invocation of something that talks to it, which is incredibly annoying.I would give a lot for just to be able to say, oh, when you're talking in this profile, that's always going to be your S3 endpoint. Go. But no, of course not. Because that would make it easier to use something that wasn't them, so why would they ever be incentivized to bake that in?AB: Yeah. Snowball is an important element to move data, right? That's the UPS and FedEx way of moving data, but what I find customers doing is they actually use the tools that we built for MinIO because the Snowball appliance also looks like S3 API-compatible object store. And in fact, like, I've been told that, like, when you want to ship multiple Snowball appliances, they actually put MinIO to make it look like one unit because MinIO can erase your code objects across multiple Snowball appliances. And the MC tool, unlike AWS CLI, which is really meant for developers, like low-level calls, MC gives you unique [scoring 00:21:08] tools, like lscp, rsync-like tools, and it's easy to move and copy and migrate data. Actually, that's how people deal with it.Corey: Oh, God. I hadn't even considered the problem of having a fleet of Snowball edges here that you're trying to do a mass data migration on, which is basically how you move petabyte-scale data, is a whole bunch of parallelism. But having to figure that out on a case-by-case basis would be nightmarish. That's right, there is no good way to wind up doing that natively.AB: Yeah. In fact, Western Digital and a few other players, too, now the Western Digital created a Snowball-like appliance and they put MinIO on it. And they are actually working with some system integrators to help customers move lots of data. But Snowball-like functionality is important and more and more customers who need it.Corey: This episode is sponsored in part by Honeycomb. I'm not going to dance around the problem. Your. Engineers. Are. Burned. Out. They're tired from pagers waking them up at 2 am for something that could have waited until after their morning coffee. Ring Ring, Who's There? It's Nagios, the original call of duty! They're fed up with relying on two or three different “monitoring tools” that still require them to manually trudge through logs to decipher what might be wrong. Simply put, there's a better way. Observability tools like Honeycomb (and very little else because they do admittedly set the bar) show you the patterns and outliers of how users experience your code in complex and unpredictable environments so you can spend less time firefighting and more time innovating. It's great for your business, great for your engineers, and, most importantly, great for your customers. Try FREE today at honeycomb.io/screaminginthecloud. That's honeycomb.io/screaminginthecloud.Corey: Increasingly, it felt like, back in the on-prem days, that you'd have a file server somewhere that was either a SAN or it was going to be a NAS. The question was only whether it presented it to various things as a volume or as a file share. And then in cloud, the default storage mechanism, unquestionably, was object store. And now we're starting to see it come back again. So, it started to increasingly feel, in a lot of ways, like Cloud is no longer so much a place that is somewhere else, but instead much more of an operating model for how you wind up addressing things.I'm wondering when the generation of prosumer networking equipment, for example, is going to say, “Oh, and send these logs over to what object store?” Because right now, it's still write a file and SFTP it somewhere else, at least the good ones; some of the crap ones still want old unencrypted FTP, which is neither here nor there. But I feel like it's coming back around again. Like, when do even home users wind up instead of where do you save this file to having the cloud abstraction, which hopefully, you'll never have to deal with an S3-style endpoint, but that can underpin an awful lot of things. It feels like it's coming back and that's cloud is the de facto way of thinking about things. Is that what you're seeing? Does that align with your belief on this?AB: I actually, fundamentally believe in the long run, right, applications will go SaaS, right? Like, if you remember the days that you used to install QuickBooks and ACT and stuff, like, on your data center, you used to run your own Exchange servers, like, those days are gone. I think these applications will become SaaS. But then the infrastructure building blocks for these SaaS, whether they are cloud or their own colo, I think that in the long run, it will be multi-cloud and colo all combined and all of them will look alike.But what I find from the customer's journey, the Old World and the New World is incompatible. When they shifted from bare metal to virtualization, they didn't have to rewrite their application. But this time, you have—it as a tectonic shift. Every single application, you have to rewrite. If you retrofit your application into the cloud, bad idea, right? It's going to cost you more and I would rather not do it.Even though cloud players are trying to make, like, the file and block, like, file system services [unintelligible 00:24:01] and stuff, they make it available ten times more expensive than object, but it's just to [integrate 00:24:07] some legacy applications, but it's still a bad idea to just move legacy applications there. But what I'm finding is that the cost, if you still run your infrastructure with enterprise IT mindset, you're out of luck. It's going to be super expensive and you're going to be left out modern infrastructure, because of the scale, it has to be treated as code. You have to run infrastructure with software engineers. And this cultural shift has to happen.And that's why cloud, in the long run, everyone will look like AWS and we always said that and it's now being becoming true. Like, Kubernetes and MinIO basically is leveling the ground everywhere. It's giving ECS and S3-like infrastructure inside AWS or outside AWS, everywhere. But what I find the challenging part is the cultural mindset. If they still have the old cultural mindset and if they want to adopt cloud, it's not going to work.You have to change the DNA, the culture, the mindset, everything. The best way to do it is go to the cloud-first. Adopt it, modernize your application, learn how to run and manage infrastructure, then ask economics question, the unit economics. Then you will find the answers yourself.Corey: On some level, that is the path forward. I feel like there's just a very long tail of systems that have been working and have been meeting the business objective. And well, we should go and refactor this because, I don't know, a couple of folks on a podcast said we should isn't the most compelling business case for doing a lot of it. It feels like these things sort of sit there until there is more upside than just cost-cutting to changing the way these things are built and run. That's the reason that people have been talking about getting off of mainframe since the '90s in some companies, and the mainframe is very much still there. It is so ingrained in the way that they do business, they have to rethink a lot of the architectural things that have sprung up around it.I'm not trying to shame anyone for the [laugh] state that their environment is in. I've never yet met a company that was super proud of its internal infrastructure. Everyone's always apologizing because it's a fire. But they think someone else has figured this out somewhere and it all runs perfectly. I don't think it exists.AB: What I am finding is that if you are running it the enterprise IT style, you are the one telling the application developers, here you go, you have this many VMs and then you have, like, a VMware license and, like, Jboss, like WebLogic, and like a SQL Server license, now you go build your application, you won't be able to do it. Because application developers talk about Kafka and Redis and like Kubernetes, they don't speak the same language. And that's when these developers go to the cloud and then finish their application, take it live from zero lines of code before it can procure infrastructure and provision it to these guys. The change that has to happen is how can you give what the developers want now that reverse journey is also starting. In the long run, everything will look alike, but what I'm finding is if you're running enterprise IT infrastructure, traditional infrastructure, they are ashamed of talking about it.But then you go to the cloud and then at scale, some parts of it, you want to move for—now you really know why you want to move. For economic reasons, like, particularly the data-intensive workloads becomes very expensive. And at that part, they go to a colo, but leave the applications on the cloud. So, it's the multi-cloud model, I think, is inevitable. The expensive pieces that where you can—if you are looking at yourself as hyperscaler and if your data is growing, if your business focus is data-centric business, parts of the data and data analytics, ML workloads will actually go out, if you're looking at unit economics. If all you are focused on productivity, stick to the cloud and you're still better off.Corey: I think that's a divide that gets lost sometimes. When people say, “Oh, we're going to move to the cloud to save money.” It's, “No you're not.” At a five-year time horizon, I would be astonished if that juice were worth the squeeze in almost any scenario. The reason you go for therefore is for a capability story when it's right for you.That also means that steady-state workloads that are well understood can often be run more economically in a place that is not the cloud. Everyone thinks for some reason that I tend to be its cloud or it's trash. No, I'm a big fan of doing things that are sensible and cloud is not the right answer for every workload under the sun. Conversely, when someone says, “Oh, I'm building a new e-commerce store,” or whatnot, “And I've decided cloud is not for me.” It's, “Ehh, you sure about that?”That sounds like you are smack-dab in the middle of the cloud use case. But all these things wind up acting as constraints and strategic objectives. And technology and single-vendor answers are rarely going to be a panacea the way that their sales teams say that they will.AB: Yeah. And I find, like, organizations that have SREs, DevOps, and software engineers running the infrastructure, they actually are ready to go multi-cloud or go to colo because they have the—exactly know. They have the containers and Kubernetes microservices expertise. If you are still on a traditional SAN, NAS, and VM architecture, go to cloud, rewrite your application.Corey: I think there's a misunderstanding in the ecosystem around what cloud repatriation actually looks like. Everyone claims it doesn't exist because there's basically no companies out there worth mentioning that are, “Yep, we've decided the cloud is terrible, we're taking everything out and we are going to data centers. The end.” In practice, it's individual workloads that do not make sense in the cloud. Sometimes just the back-of-the-envelope analysis means it's not going to work out, other times during proof of concepts, and other times, as things have hit a certain point of scale, we're in an individual workload being pulled back makes an awful lot of sense. But everything else is probably going to stay in the cloud and these companies don't want to wind up antagonizing the cloud providers by talking about it in public. But that model is very real.AB: Absolutely. Actually, what we are finding with the application side, like, parts of their overall ecosystem, right, within the company, they run on the cloud, but the data side, some of the examples, like, these are in the range of 100 to 500 petabytes. The 500-petabyte customer actually started at 500 petabytes and their plan is to go at exascale. And they are actually doing repatriation because for them, their customers, it's consumer-facing and it's extremely price sensitive, but when you're a consumer-facing, every dollar you spend counts. And if you don't do it at scale, it matters a lot, right? It will kill the business.Particularly last two years, the cost part became an important element in their infrastructure, they knew exactly what they want. They are thinking of themselves as hyperscalers. They get commodity—the same hardware, right, just a server with a bunch of [unintelligible 00:30:35] and network and put it on colo or even lease these boxes, they know what their demand is. Even at ten petabytes, the economics starts impacting. If you're processing it, the data side, we have several customers now moving to colo from cloud and this is the range we are talking about.They don't talk about it publicly because sometimes, like, you don't want to be anti-cloud, but I think for them, they're also not anti-cloud. They don't want to leave the cloud. The completely leaving the cloud, it's a different story. That's not the case. Applications stay there. Data lakes, data infrastructure, object store, particularly if it goes to a colo.Now, your applications from all the clouds can access this centralized—centralized, meaning that one object store you run on colo and the colos themselves have worldwide data centers. So, you can keep the data infrastructure in a colo, but applications can run on any cloud, some of them, surprisingly, that they have global customer base. And not all of them are cloud. Sometimes like some applications itself, if you ask what type of edge devices they are running, edge data centers, they said, it's a mix of everything. What really matters is not the infrastructure. Infrastructure in the end is CPU, network, and drive. It's a commodity. It's really the software stack, you want to make sure that it's containerized and easy to deploy, roll out updates, you have to learn the Facebook-Google style running SaaS business. That change is coming.Corey: It's a matter of time and it's a matter of inevitability. Now, nothing ever stays the same. Everything always inherently changes in the full sweep of things, but I'm pretty happy with where I see the industry going these days. I want to start seeing a little bit less centralization around one or two big companies, but I am confident that we're starting to see an awareness of doing these things for the right reason more broadly permeating.AB: Right. Like, the competition is always great for customers. They get to benefit from it. So, the decentralization is a path to bringing—like, commoditizing the infrastructure. I think the bigger picture for me, what I'm particularly happy is, for a long time we carried industry baggage in the infrastructure space.If no one wants to change, no one wants to rewrite application. As part of the equation, we carried the, like, POSIX baggage, like SAN and NAS. You can't even do [unintelligible 00:32:48] as a Service, NFS as a Service. It's too much of a baggage. All of that is getting thrown out. Like, the cloud players be helped the customers start with a clean slate. I think to me, that's the biggest advantage. And that now we have a clean slate, we can now go on a whole new evolution of the stack, keeping it simpler and everyone can benefit from this change.Corey: Before we wind up calling this an episode, I do have one last question for you. As I mentioned at the start, you're very much open-source, as in legitimate open-source, which means that anyone who wants to can grab an implementation and start running it. How do you, I guess make peace with the fact that the majority of your user base is not paying you? And I guess how do you get people to decide, “You know what? We like the cut of his jib. Let's give him some money.”AB: Mm-hm. Yeah, if I looked at it that way, right, I have both the [unintelligible 00:33:38], right, on the open-source side as well as the business. But I don't see them to be conflicting. If I run as a charity, right, like, I take donation. If you love the product, here is the donation box, then that doesn't work at all, right?I shouldn't take investor money and I shouldn't have a team because I have a job to pay their bills, too. But I actually find open-source to be incredibly beneficial. For me, it's about delivering value to the customer. If you pay me $5, I ought to make you feel $50 worth of value. The same software you would buy from a proprietary vendor, why would—if I'm a customer, same software equal in functionality, if its proprietary, I would actually prefer open-source and pay even more.But why are, really, customers paying me now and what's our view on open-source? I'm actually the free software guy. Free software and open-source are actually not exactly equal, right? We are the purest of the open-source community and we have strong views on what open-source means, right. That's why we call it free software. And free here means freedom, right? Free does not mean gratis, that free of cost. It's actually about freedom and I deeply care about it.For me it's a philosophy and it's a way of life. That's why I don't believe in open core and other models that holding—giving crippleware is not open-source, right? I give you some freedom but not all, right, like, it's it breaks the spirit. So, MinIO is a hundred percent open-source, but it's open-source for the open-source community. We did not take some community-developed code and then added commercial support on top.We built the product, we believed in open-source, we still believe and we will always believe. Because of that, we open-sourced our work. And it's open-source for the open-source community. And as you build applications that—like the AGPL license on the derivative works, they have to be compatible with AGPL because we are the creator. If you cannot open-source, you open-source your application derivative works, you can buy a commercial license from us. We are the creator, we can give you a dual license. That's how the business model works.That way, the open-source community completely benefits. And it's about the software freedom. There are customers, for them, open-source is good thing and they want to pay because it's open-source. There are some customers that they want to pay because they can't open-source their application and derivative works, so they pay. It's a happy medium; that way I actually find open-source to be incredibly beneficial.Open-source gave us that trust, like, more than adoption rate. It's not like free to download and use. More than that, the customers that matter, the community that matters because they can see the code and they can see everything we did, it's not because I said so, marketing and sales, you believe them, whatever they say. You download the product, experience it and fall in love with it, and then when it becomes an important part of your business, that's when they engage with us because they talk about license compatibility and data loss or a data breach, all that becomes important. Open-source isn't—I don't see that to be conflicting for business. It actually is incredibly helpful. And customers see that value in the end.Corey: I really want to thank you for being so generous with your time. If people want to learn more, where should they go?AB: I was on Twitter and now I think I'm spending more time on, maybe, LinkedIn. I think if they—they can send me a request and then we can chat. And I'm always, like, spending time with other entrepreneurs, architects, and engineers, sharing what I learned, what I know, and learning from them. There is also a [community open channel 00:37:04]. And just send me a mail at ab@min.io and I'm always interested in talking to our user base.Corey: And we will, of course, put links to that in the [show notes 00:37:12]. Thank you so much for your time. I appreciate it.AB: It's wonderful to be here.Corey: AB Periasamy, CEO and co-founder of MinIO. I'm Cloud Economist Corey Quinn and this has been a promoted guest episode of Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice that presumably will also include an angry, loud comment that we can access from anywhere because of shared APIs.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud
Combining Community and Company Employees with Matty Stratton

Screaming in the Cloud

Play Episode Listen Later Mar 16, 2023 40:08


Matty Stratton, Director of Developer Relations at Aiven, joins Corey on Screaming in the Cloud for a friendly debate on whether or not company employees can still be considered community members. Corey says no, but opens up his position to the slings and arrows of Matty in an entertaining change of pace. Matty explains why he feels company employees can still be considered community members, and also explores how that should be done in a way that is transparent and helpful to everyone in the community. Matty and Corey also explore the benefits and drawbacks of talented community members becoming employees.About MattyMatty Stratton is the Director of Developer Relations at Aiven, a well-known member of the DevOps community, founder and co-host of the popular Arrested DevOps podcast, and a global organizer of the DevOpsDays set of conferences.Matty has over 20 years of experience in IT operations and is a sought-after speaker internationally, presenting at Agile, DevOps, and cloud engineering focused events worldwide. Demonstrating his keen insight into the changing landscape of technology, he recently changed his license plate from DEVOPS to KUBECTL.He lives in Chicago and has three awesome kids, whom he loves just a little bit more than he loves Diet Coke. Links Referenced: Aiven: https://aiven.io/ Twitter: https://twitter.com/mattstratton Mastodon: hackyderm.io/@mattstratton LinkedIn: https://www.linkedin.com/in/mattstratton/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is brought to us in part by our friends at Min.ioWith more than 1.1 billion docker pulls - Most of which were not due to an unfortunate loop mistake, like the kind I like to make - and more than 37 thousand github stars, (which are admittedly harder to get wrong), MinIO has become the industry standard alternative to S3. It runs everywhere  - public clouds, private clouds, Kubernetes distributions, baremetal, raspberry's pi, colocations - even in AWS Local Zones. The reason people like it comes down to its simplicity, scalability, enterprise features and best in class throughput. Software-defined and capable of running on almost any hardware you can imagine and some you probably can't, MinIO can handle everything you can throw at it - and AWS has imagined a lot of things - from datalakes to databases.Don't take their word for it though - check it out at www.min.io and see for yourself. That's www.min.io Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I am joined today by returning guest, my friend and yours, Matty Stratton, Director of Developer Relations at Aiven. Matty, it's been a hot second. How are you?Matty: It has been a while, but been pretty good. We have to come back to something that just occurred to me when we think about the different things we've talked about. There was a point of contention about prior art of the Corey Quinn face and photos. I don't know if you saw that discourse; we may have to have a conversation. There may be some absent—Corey: I did not see—Matty: Okay.Corey: —discourse, but I also would accept freely that I am not the first person to ever come up with the idea of opening my mouth and looking ridiculous for a photograph either.Matty: That's fair, but the thing that I think was funny—and if you don't mind, I'll just go ahead and throw this out here—is that I didn't put this two and two together. So, I posted a picture on Twitter a week or so ago that was primarily to show off the fact—it was a picture of me in 1993, and the point was that my jeans were French-rolled and were pegged. But in the photo, I am doing kind of the Corey Quinn face and so people said, “Oh, is this prior art?” And I said—you know what? I actually just remembered and I've never thought about this before, but one of my friends in high school, for his senior year ID he took a picture—his picture looks like, you know, that kind of, you know, three-quarters turn with the mouth opening going, “Ah,” you know?And he loved that picture—number one, he loved that picture so much that this guy carried his senior year high school ID in his wallet until we were like 25 because it was his favorite picture of himself. But every photo—and I saw this from looking through my yearbook of my friend Jay when we are seniors, he's doing the Corey Quinn face. And he is anecdotally part of the DevOps community, now a little bit too, and I haven't pointed this out to him. But people were saying that, you know, mine was prior art on yours, I said, “Actually, I was emulating yet someone else.”Corey: I will tell you the actual story of how it started. It was at re:Invent, I want to say 2018 or so, and what happened was is someone, they were a big fan of the newsletter—sort of the start of re:Invent—they said, “Hey, can I get a selfie with you?” And I figured, sure, why not. And the problem I had is I've always looked bad in photographs. And okay, great, so if I'm going to have a photo taken of me, that's going to be ridiculous, why not as a lark, go ahead and do this for fun during the course of re:Invent this year?So, whenever I did that I just slapped—if someone asked for a selfie—I'd slap the big happy open mouth smile on my face. And people thought, “Oh, my God, this is amazing.” And I don't know that it was necessarily worth that level of enthusiasm, but okay. I'll take it. I'm not here to tell people they're wrong when they enjoy a joke that I'm putting out there.And it just sort of stuck. And I think the peak of it that I don't think I'm ever going to be able to beat is I actually managed to pull that expression on my driver's license.Matty: Wow.Corey: Yeah.Matty: That's—Corey: They don't have a sense of humor that they are aware of at the DMV.Matty: No, they really don't. And having been to the San Francisco DMV and knowing how long it takes to get in there, like, that was a bit of a risk on your part because if they decided to change their mind, you wouldn't be able to come back for another four months [laugh].Corey: It amused me to do it, so why not? What else was I going to do? I brought my iPad with me, it has cellular on it, so I just can work remotely from there. It was either that or working in my home office again, and frankly, at the height of the pandemic, I could use the break.Matty: Yes [laugh]. That's saying something when the break you can use is going to the DMV.Corey: Right.Matty: That's a little bit where we were, where we at. I think just real quick thinking about that because there's a lot to be said with that kind of idea of making a—whether it's silly or not, but having a common, especially if you do a lot of photos, do a lot of things, you don't have to think about, like, how do I look? I mean, you have to think about—you know, you can just say I just know what I do. Because if you think about it, it's about cultivating your smile, cultivating your look for your photos, and just sort of having a way so you don't—you just know what to do every time. I guess that's a, you know, maybe a model tip or something. I don't know. But you might be onto something.Corey: I joke that my entire family motto is never be the most uncomfortable person in the room. And there's something to be said for it where if you're going to present a certain way, make it your own. Find a way to at least stand out. If nothing else, it's a bit different. Most people don't do that.Remember, we've all got made fun of, generally women—for some reason—back about 15 years ago or so for duck face, where in all the pictures you're making duck face. And well, there are reasons why that is a flattering way to present your face. But if there's one thing we love as a society, it's telling women they're doing something wrong.Matty: Yeah.Corey: So yeah, there's a whole bunch of ways you're supposed to take selfies or whatnot. Honestly, I'm in no way shape or form pretty enough or young enough to care about any of them. At this point, it's what I do when someone busts out a camera and that's the end of it. Now, am I the only person to do this? Absolutely not. Do I take ownership of it? No. Someone else wants to do it, they need give no credit. The idea probably didn't come from me.Matty: And to be fair, if I'm little bit taking the mickey there or whatever about prior art, it was more than I thought it was funny because I had not even—it was this thing where it was like, this is a good friend of mine, probably some of that I've been friends with longer than anyone in my whole life, and it was a core part [laugh] of his personality when we were 18 and 19, and it just d—I just never direct—like, made that connection. And then it happened to me and went “Oh, my God. Jason and Corey did the same thing.” [laugh]. It was—Corey: No, it feels like parallel evolution.Matty: Yeah, yeah. It was more of me never having connected those dots. And again, you're making that face for your DMV photo amused you, me talking about this for the last three minutes on a podcast amused me. So.Corey: And let's also be realistic here. How many ways are there to hold your face during a selfie that is distinguishable and worthy of comment? Usually, it's like okay, well, he has this weird sardonic half-smile with an eyebrow ar—no. His mouth was wide open. We're gonna go with that.Matty: You know, there's a little—I want to kind of—because I think there's actually quite a bit to the lesson from any of this because I think about—follow me here; maybe I'll get to the right place—like me and karaoke. No one would ever accuse me of being a talented singer, right? I'm not going to sing well in a way where people are going to be moved by my talent. So instead, I have to go a different direction. I have to go funny.But what it boils down to is I can only do—I do karaoke well when it's a song where I can feel like I'm doing an impression of the singer. So, for example, the B-52s. I do a very good impression of Fred Schneider. So, I can sing a B-52 song all day long. I actually could do better with Pearl Jam than I should be able to with my terrible voice because I'm doing an Eddie Vedder impression.So, what I'm getting at is you're sort of taking this thing where you're saying, okay, to your point, you said, “Hey,”—and your words, not mine—[where 00:07:09] somebody say, “The picture is not going to be of me looking like blue steel runway model, so I might as well look goofy.” You know? And take it that way and be funny with it. And also, every time, it's the same way, so I think it's a matter of kind of owning the conversation, you know, and saying, how do you accentuate the thing that you can do. I don't know. There's something about DevOps, somehow in there.Corey: So, I am in that uncomfortable place right now between having finalized a blog post slash podcast that's going out in two days from this recording. So, it will go out before you and I have this discussion publicly, but it's also too late for me to change any of it,m so I figured I will open myself up to the slings and arrows of you, more or less. And you haven't read this thing yet, which is even better, so you're now going to be angry about an imperfect representation of what I said in writing. But the short version is this: if you work for a company as their employee, then you are no longer a part of that company's community, as it were. And yes, that's nuanced and it's an overbroad statement and there are a bunch of ways that you could poke holes in it, but I'm curious to get your take on the overall positioning of it.Matty: So, at face value, I would vehemently disagree with that statement. And by that is, that I have spent years of my life tilting at the opposite windmill, which is just because you work at this company, doesn't mean you do not participate in the community and should not consider yourself a part of the community, first and foremost. That will, again, like everything else, it depends. It depends on a lot of things and I hope we can kind of explore that a little bit because just as much as I would take umbrage if you will, or whatnot, with the statement that if you work at the company, you stop being part of the community, I would also have an issue with, you're just automatically part of the community, right? Because these things take effort.And I feel like I've been as a devreloper, or whatever, Corey—how do you say it?Corey: Yep. No, you're right on. Devreloper.Matty: As a—or I would say, as a DevRel, although people on Twitter are angry about using the word DevRel to discuss—like saying, “I'm a DevRel.” “DevRel is a department.” It's a DevOps engineer thing again, except actually—it's, like, actually wrong. But anyway, you kind of run into this, like for example—I'm going to not name names here—but, like, to say, you know, Twitter for Pets, the—what do you—by the way, Corey, what are you going to do now for your made-up company when what Twitter is not fun for this anymore? You can't have Twitter for Pets anymore.Corey: I know I'm going to have to come up with a new joke. I don't quite know what to do with myself.Matty: This is really hard. While we will pretend Twitter for Pets is still around a little bit, even though its API is getting shut down.Corey: Exactly.Matty: So okay, so we're over here at Twitter for Pets, Inc. And we've got our—Corey: Twitter for Bees, because you know it'll at least have an APIary.Matty: Yeah. Ha. We have our team of devrelopers and community managers and stuff and community engineers that work at Twitter for Pets, and we have all of our software engineers and different people. And a lot of times the assumption—and now we're going to have Twitter for Pets community something, right? We have our community, we have our area, our place that we interact, whether it's in person, it's virtual, whether it's an event, whether it's our Discord or Discourse or Slack or whatever [doodlee 00:10:33] thing we're doing these days, and a lot of times, all those engineers and people whose title does not have the word ‘community' on it are like, “Oh, good. Well, we have people that do that.”So, number one, no because now we have people whose priority is it; like, we have more intentionality. So, if I work on the community team, if I'm a dev advocate or something like that, my priority is communicating and advocating to and for that community. But it's like a little bit of the, you know, the office space, I take the requirements from the [unintelligible 00:11:07] to people, you I give them to the engineers. I've got people—so like, you shouldn't have to have a go-between, right? And there's actually quite a bit of place.So, I think, this sort of assumption that you're not part of it and you have no responsibility towards that community, first of all, you're missing a lot as a person because that's just how you end up with people building a thing they don't understand.Corey: Oh, I think you have tremendous responsibility to the community, but whether you're a part of it and having responsibility to it or not aligned in my mind.Matty: So… maybe let's take a second and what do you mean by being a part of it?Corey: Right. Where very often I'll see a certain, I don't know, very large cloud provider will have an open-source project. Great, so you go and look at the open-source project and the only people with commit access are people who work at that company. That is an easy-to-make-fun-of example of this. Another is when the people who are in a community and talking about how they perceive things and putting out content about how they've interacted with various aspects of it start to work there, you see areas where it starts to call its authenticity into question.AWS is another great example of this. As someone in the community, I can talk about how I would build something on top of AWS, but then move this thing on to Fastly instead of CloudFront because CloudFront is terrible. If you work there, you're not going to be able to say the same thing. So, even if you're not being effusive with praise, there are certain guardrails and constraints that keep you from saying what you might otherwise, just based upon the sheer self-interest that comes from the company whose product or service you're talking about is also signing your paycheck and choosing to continue to do so.Matty: And I think even less about it because that's where your paycheck is coming. It's also just a—there's a gravitational pull towards those solutions because that's just what you're spending your day with, right? You know—Corey: Yeah. And you also don't want to start and admit even to yourself, in some cases, that okay, this aspect of what our company does is terrible, so companies—people shouldn't use it. You want to sort of ignore that, on some level, psychologically because that dissonance becomes harmful.Matty: Yeah. And I think there's—so again, this is where things get nuanced and get to levels. Because if you have the right amount of psychological safety in your organization, the organization understands what it's about to that. Because even people whose job is to be a community person should be able to say, “Hey, this is my actual opinion on this. And it might be contrary to the go-to-market where that comes in.”But it's hard, especially when it gets filtered through multiple layers and now you've got a CEO who doesn't understand that nuance who goes, “Wait, why was Corey on some podcast saying that the Twitter for Pets API is not everything it could possibly be?” So, I do think—I will say this—I do think that organizations and leadership are understanding this more than they might have in the past, so we are maybe putting on ourselves this belief that we can't be as fully honest, but even if it's not about hiding the warts, even if it's just a matter of also, you're just like, hey, chances are—plus also to be quite frank, if I work at the company, I probably have access to way more shit than I would have to pay for or do whatever and I know the right way. But here's the trick, and I won't even say it's a dogfooding thing, but if you are not learning and thinking about things the way that your users do—and I will even say that that's where—it is the users, which are the community, that community or the people that use your product or are connected to it, they don't use it; they may be anecdotal—or not anecdotally, maybe tangentially connected. I will give an example. And there was a place I was working where it was very clear, like, we had a way to you know, do open-source contributions back of a type of a provider plug-in, whatever you want to call it and I worked at the company and I could barely figure out how to follow the instructions.Because it made a lot of sense to someone who built that software all day long and knew the build patterns, knew all that stuff. So, if you were an engineer at this company, “Well, yeah, of course. You just do this.” And anybody who puts the—connects the dots, this has gotten better—and this was understood relatively quickly as, “Oh, this is the problem. Let's fix it.” So, the thing is, the reason why I bring this up is because it's not something anybody does intentionally because you don't know what you don't know. And—Corey: Oh, I'm not accusing anyone of being a nefarious actor in any of this. I also wonder if part of this is comes from your background as being heavily involved in the Chef community as a Chef employee and as part of the community around that, which is inherently focused on an open-source product that a company has been built around, whereas my primary interaction with community these days is the AWS community, where it doesn't matter whether you're large or small, you are not getting much, if anything, for free from AWS; you're all their customers and you don't really have input into how something gets built, beyond begging nicely.Matty: That's definitely true. And I think we saw that and there was things, when we look at, like, how community, kind of, evolved or just sort of happened at Chef and why we can't recreate it the same way is there was a certain inflection point of the industry and the burgeoning DevOps movement, and there wasn't—you know, so a lot of that was there. But one of the big problems, too, is, as Corey said, everybody—I shouldn't say every, but I've from the A—all the way up to AWS to your smaller startups will have this problem of where you end up hiring in—whether you want to or not—all of your champions and advocates and your really strong community members, and then that ends up happening. So, number one, that's going to happen. So frankly, if you don't push towards this idea, you're actually going to have people not want to come work because you should be able to be still the member that you were before.And the other thing is that at certain size, like, at the size of a hyperscaler, or, you know, a Microsoft—well, anybody—well Microsofts not a hyperscaler, but you know what I'm saying. Like, very, very large organization, your community folks are not necessarily the ones doing that hiring away. And as much as they might—you know, and again, I may be the running the community champion program at Microsoft and see that you want—you know, but that Joe Schmo is getting hired over into engineering. Like, I'm not going to hire Joe because it hurts me, but I can't say you can't, you know? It's so this is a problem at the large size.And at the smaller size, when you're growing that community, it happens, too, because it's really exciting. When there's a place that you're part of that community, especially when there's a strong feel, like going to work for the mothership, so to speak is, like, awesome. So again, to give an example, I was a member of the Chef community, I was a user, a community person well, before, you know, I went and, you know, had a paycheck coming out of that Seattle office. And it was, like, the coolest thing in the world to get a job offer from Ch—like, I was like, “Oh, my God. I get to actually go work there now.” Right?And when I was at Pulumi, there quite a few people I could think of who I knew through the community who then get jobs at Pulumi and we're so excited, and I imagine still excited, you know? I mean, that was awesome to do. So, it's hard because when you get really excited about a technology, then being able to say, “Wait, I can work on this all the time?” That sounds awesome, right? So like, you're going to have that happen.So, I think what you have to do is rather than prevent it from happening because number one, like, you don't want to actually prevent that from happening because those people will actually be really great additions to your organization in lots of ways. Also, you're not going to stop it from happening, right? I mean, it's also just a silly way to do it. All you're going to do is piss people off, and say, like, “Hey, you're not allowed to work here because we need you in the community.” Then they're going to be like, “Great. Well, guess what I'm not a part of anymore now, jerk?” Right? You know [laugh] I mean so—Corey: Exactly.Matty: Your [unintelligible 00:18:50] stops me. So, that doesn't work. But I think to your point, you talked about, like, okay, if you have a, ostensibly this a community project, but all the maintainers are from one—are from your company, you know? Or so I'm going to point to an example of, we had—you know, this was at Pulumi, we had a Champions program called Puluminaries, and then there's something similar to like Vox Populi, but it was kind of the community that was not run by Pulumi Inc. In that case.Now, we helped fund it and helped get it started, but there was there were rules about the, you know, the membership of the leadership, steering committee or board or whatever it was called, there was a hard limit on the number of people that could be Pulumi employees who were on that board. And it actually, as I recall when I was leaving—I imagine this is not—[unintelligible 00:19:41] does sometimes have to adjust a couple of things because maybe those board members become employees and now you have to say, you can't do that anymore or we have to take someone down. But the goal was to actually, you know, basically have—you know, Pulumi Corp wanted to have a voice on that board because if for no other reason, they were funding it, but it was just one voice. It wasn't even a majority voice. And that's a hard sell in a lot of places too because you lose control over that.There's things I know with, uh—when I think about, like, running meetup communities, like, we might be—well I mean, this is not a big secret, I mean because it's been announced, but we're—you know, Aiven is helping bootstrap a bunch of data infrastructure meetups around the world. But they're not Aiven meetups. Now, we're starting them because they have to start, but pretty much our approach is, as soon as this is running and there's people, whether they work here, work with us or not, they can take it, right? Like, if that's go—you know? And being able to do that can be really hard because you have to relinquish the control of your community.And I think you don't have to relinquish a hundred percent of that control because you're helping facilitate it because if it doesn't already have its own thing—to make sure that things like code of conduct and funding of it, and there's things that come along with the okay, we as an organization, as a company that has dollars and euros is going to do stuff for this, but it's not ours. And that's the thing to remember is that your community does not belong to you, the company. You are there to facilitate it, you are there to empower it, you're there to force-multiply it, to help protect it. And yeah, you will probably slurp a whole bunch of value out of it, so this is not magnanimous, but if you want it to actually be a place it's going to work, it kind of has to be what it wants to be. But by the same token, you can't just sort of sit there and be like, “I'm going to wait for this community grow up around me without anything”—you know.So, that's why you do have to start one if there is quote-unquote—maybe if there's no shape to one. But yeah, I think that's… it is different when it's something that feels a little—I don't even want to say that it's about being open-source. It's a little bit about it less of it being a SaaS or a service, or if it's something that you—I don't know.Corey: This episode is sponsored in part by Honeycomb. I'm not going to dance around the problem. Your. Engineers. Are. Burned. Out. They're tired from pagers waking them up at 2 am for something that could have waited until after their morning coffee. Ring Ring, Who's There? It's Nagios, the original call of duty! They're fed up with relying on two or three different “monitoring tools” that still require them to manually trudge through logs to decipher what might be wrong. Simply put, there's a better way. Observability tools like Honeycomb (and very little else becau se they do admittedly set the bar) show you the patterns and outliers of how users experience your code in complex and unpredictable environments so you can spend less time firefighting and more time innovating. It's great for your business, great for your engineers, and, most importantly, great for your customers. Try FREE today at honeycomb.io/screaminginthecloud. That's honeycomb.io/screaminginthecloud.Corey: Yeah, I think you're onto something here. I think another aspect where I found it be annoying is when companies view their community as, let's hire them all. And I don't think it ever starts that way. I think that it starts as, well these are people who are super-passionate about this, and they have great ideas and they were great to work with. Could we hire them?And the answer is, “Oh, wait. You can give me money for this thing I've been doing basically for free? Yeah, sure, why not?” And that's great in the individual cases. The problem is, at some point, you start to see scenarios where it feels like, if not everyone, then a significant vocal majority of the community starts to work there.Matty: I think less often than you might think is it done strategically or on purpose. There have been exceptions to that. There's one really clear one where it feels like a certain company a few years ago, hired up all the usual suspects of the DevOps community. All of a sudden, you're like, oh, a dozen people all went to go work at this place all at once. And the fun thing is, I remember feeling a little bit—got my nose a little out of joint because I was not the hiring mana—like, I knew the people.I was like, “Well, why didn't you ask me?” And they said, “Actually, you are more important to us not working here.” Now, that might have just been a way to sell my dude-in-tech ego or not, but whether or not that was actually true for me or not, that is a thing where you say you know, your folks—but I do think that particular example of, like, okay, I'm this, that company, and I'm going to go hire up all the usual suspects, I think that's less. I think a lot of times when you see communities hire up those people, it's not done on purpose and in fact, it's probably not something they actually wanted to do in mass that way. But it happens because people who are passionate about your product, it's like I said before, it actually seems pretty cool to go work on it as your main thing.But I can think of places I've been where we had, you know—again, same thing, we had a Pulumi—we had someone who was probably our strongest, loudest, most vocal community member, and you know, I really wanted to get this person to come join us and that was sort of one of the conversations. Nobody ever said, “We won't offer this person a job if they're great.” Like, that's the thing. I think that's actually kind of would be shitty to be like, “You're a very qualified individual, but you're more important to me out in the community so I'm not going to make your job offer.” But it was like, Ooh, that's the, you know—it'd be super cool to have this person but also, not that that should be part of our calculus of decision, but then you just say, what do you do to mitigate that?Because what I'm concerned about is people hearing this the wrong way and saying, “There's this very qualified individual who wants to come work on my team at my company, but they're also really important to our community and it will hurt our community if they come work here, so sorry, person, we're not going to give you an opportunity to have an awesome job.” Like, that's also thinking about the people involved, too. But I know having talked to folks that lots of these different large organizations that have this problem, generally, those community folks, especially at those places, they don't want this [laugh] happening. They get frustrated by it. So, I mean, I'll tell you, it's you know, the—AWS is one of them, right?They're very excited about a lot of the programs and cool people coming from community builders and stuff and Heroes, you know. On one hand, it's incredibly awesome to have a Hero come work at AWS, but it hurts, right, because now they're not external anymore.Corey: And you stop being a Hero in that case, as well.Matty: Yeah. You do, yeah.Corey: Of course, they also lose the status if they go to one of their major competitors. So like, let me get this straight. You can't be a Hero if you work for AWS or one of its competitors. And okay, how are there any Heroes left at all at some point? And the answer is, they bound it via size and a relatively small list of companies. But okay.Matty: So, thinking back to your point about saying, okay, so if you work at the company, you lose some authenticity, some impartiality, some, you know… I think, rather than just saying, “Well, you're not part”—because that also, honestly, my concern is that your blog post is now going to be ammunition for all the people who don't want to act as members of the community for the company they work for now. They're going to say, well, Corey told me I don't have to. So, like I said, I've been spending the last few years tilting at the opposite windmill, which is getting people that are not on the community team to take part in community summits and discourse and things like that, like, you know, for that's—so I think the thing is, rather than saying, “Well, you can't,” or, “You aren't,” it's like, “Well, what do you do to mitigate those things?”Corey: Yeah, it's a weird thing because taking AWS as the example that I've been beating up on a lot, the vast majority of their employees don't know the community exists in any meaningful sense. Which, no fault to them. The company has so many different things, no one keeps up with at all. But it's kind of nuts to realize that there are huge communities of people out there using a thing you have built and you do not know that those users exist and talk to each other in a particular watering hole. And you of course, as a result, have no presence there. I think that's the wrong direction, too. But—Matty: Mm-hm.Corey: Observing the community and being part of the community, I think there's a difference. Are you a biologist or are you a gorilla?Matty: Okay, but [sigh] I guess that's sort of the difference, too which—and it's hard, it's very hard to not just observe. Because I think that actually even taking the mentality of, “I am here to be Jane Goodall, Dr. Jane Goodall, and observe you while I live amongst you, but I'm not going to actually”—although maybe I'm probably doing disservice—I'm remembering my Goodall is… she was actually more involved. May be a bad example.Corey: Yeah. So, that analogy does fall apart a little bit.Matty: It does fall apart a little bit—Corey: Yeah.Matty: But it's you kind of am I sitting there taking field notes or am I actually engaging with you? Because there is a difference. Even if your main reason for being there is just purely to—I mean, this is not the Prime Directive. It's not Star Trek, right? You're not going to like, hold—you don't need to hold—I mean, do you have to hold yourself aloof and say, “I don't participate in this conversation; I'm just here to take notes?”I think that's very non-genuine at that point. That's over-rotating the other way. But I think it's a matter of in those spaces—I think there's two things. I think you have to have a way to be identified as you are an employee because that's just disclosure.Corey: Oh, I'm not suggesting by any stretch of the imagination, people work somewhere but not admit that they work somewhere when talking about the company. That's called fraud.Matty: Right. No, no, and I don't think it's even—but I'm saying beyond just, if it's not, if you're a cop, you have to tell me, right?Corey: [laugh].Matty: It's like, it's not—if asked, I will tell you I work at AWS. It's like in that place, it should say, “I am an AWS em—” like, I should be badged that way, just so it's clear. I think that's actually helpful in two ways. It's also helpful because it says like, okay, maybe you have a connection you can get for me somehow. Like, you might actually have some different insight or a way to chase something that, you know, it's not necessarily just about disclosure; it's also helpful to know.But I think within those spaces, that disclosure—or not disclosure, but being an employee does not offer you any more authority. And part of that is just having to be very clear about how you're constructing that community, right? And that's sort of the way that I think about it is, like, when we did the Pulumi Community Summit about a year ago, right? It was an online, you know, thing we did, and the timing was such that we didn't have a whole lot of Pulumi engineers were able to join, but when we—and it's hard to say we're going to sit in an open space together and everybody is the same here because people also—here's the difference. You say you want this authority? People will want that authority from the people that work at the company and they will always go to them and say, like, “Well, you should have this answer. Can you tell me about this? Can you do this?”So, it's actually hard on both cases to have that two-way conversation unless you set the rules of that space such as, “Okay, I work at Aiven, but when I'm in this space, short of code of conduct or whatever, if I have to be doing that thing, I have no more authority on this than anyone else.” I'm in this space as the same way everyone else's. You can't let that be assumed.Corey: Oh, and big companies do. It's always someone else's… there's someone else's department. Like, at some level, it feels like when you work in one of those enormous orgs, it's your remit is six inches wide.Matty: Well, right. Right. So, I think it's like your authority exists only so far as it's helpful to somebody. If I'm in a space as an Aivener, I'm there just as Matty the person. But I will say I work at Aiven, so if you're like, “God, I wish that I knew who was the person to ask about this replication issue,” and then I can be like, “Aha, I actually have backchannel. Let me help you with that.” But if I can say, “You know what? This is what I think about Kafka and I think why this is whatever,” like, you can—my opinion carries just as much weight as anybody else's, so to speak. Or—Corey: Yeah. You know, it's also weird. Again, community is such a broad and diverse term, I find myself in scenarios where I will observe and talk to people inside AWS about things, but I never want to come across as gloating somehow, that oh, I know, internal people that talk to you about this and you don't. Like, that's never how I want to come across. And I also, I never see the full picture; it's impossible for me to, so I never make commitments on behalf of other people. That's a good way to get in trouble.Matty: It is. And I think in the case of, like, someone like you who's, you know, got the connections you have or whatever, it's less likely for that to be something that you would advertise for a couple of reasons. Like, nobody should be advertising to gloat, but also, part of my remit as a member of a community team is to actually help people. Like, you're doing it because you want to or because it serves you in a different way. Like, that is literally my job.So like, it shouldn't be, like—like, because same thing, if you offer up your connections, now you are taking on some work to do that. Someone who works at the company, like, yes, you should be taking on that work because this is what we do. We're already getting paid for it, you know, so to speak, so I think that's the—Corey: Yeah.Matty: —maybe a nuance, but—Corey: Every once in a while, I'll check my Twitter spam graveyard, [unintelligible 00:32:01] people asking me technical questions months ago about various things regarding AWS and whatnot. And that's all well and good; the problem I have with it is that I'm not a support vector. I don't represent for the company or work for them. Now, if I worked there, I'd feel obligated to make sure this gets handed to the right person. And that's important.The other part of it, though, is okay, now that that's been done and handed off, like do I shepherd it through the process? Eh. I don't want people to get used to asking people in DMs because again, I consider myself to be a nice guy, but if I'm some nefarious jerk, then I could lead them down a very dark path where I suddenly have access to their accounts. And oh, yeah, go ahead and sign up for this thing and I'll take over their computer or convince them to pay me in iTunes gift cards or something like that. No, no, no. Have those conversations in public or through official channels, just because I don't, I don't think you want to wind up in that scenario.Matty: So, my concern as well, with sort of taking the tack of you are just an observer of the community, not a part of it is, that actually can reinforce some pretty bad behavior from an organization towards how they treat the community. One of the things that bothers me—if we're going to go on a different rant about devrelopers like myself—is I like to say that, you know, we pride ourselves as DevRels as being very empathetic and all this stuff, but very happy to shit all over people that work in sales or marketing, based on their job title, right? And I'm like, “Wow, that's great,” right? We're painting with this broad brush. Whereas in reality, we're not separate from.And so, the thing is, when you treat your community as something separate from you, you are treating it as something separate from you. And then it becomes a lot easier also, to not treat them like people and treat them as just a bunch of numbers and treat them as something to have value extracted from rather than it—this is actually a bunch of humans, right? And if I'm part of that, then I'm in the same Dunbar number a little bit, right? I'm in the same monkey sphere as those people because me, I'm—whoever; I'm the CTO or whatever, but I'm part of this community, just like Joe Smith over there in Paducah, you know, who's just building things for the first time. We're all humans together, and it helps to not treat it as the sort of amorphous blob of value to be extracted.So, I think that's… I think all of the examples you've been giving and those are all valid concerns and things to watch out for, the broad brush if you're not part of the community if you work there, my concern is that that leads towards exacerbating already existing bad behavior. You don't have to convince most of the people that the community is separate from them. That's what I'm sort of getting at. I feel like in this work, we've been spending so much time to try to get people to realize they should be acting like part of their larger community—and also, Corey, I know you well enough to know that, you know, sensationalism to make a point [laugh] works to get somebody to join—Corey: I have my moments.Matty: Yeah, yeah, yeah. I mean, there's I think… I'll put it this way. I'm very interested to see the reaction, the response that comes out in, well now, for us a couple of days, for you the listener, a while ago [laugh] when that hits because I think it is a, I don't want to say it's controversial, but I think it's something that has a lot of, um… put it this way, anything that's simple and black and white is not good for discussion.Corey: It's nuanced. And I know that whenever I wrote in 1200 words is not going to be as nuanced of the conversation we just had, either, so I'm sure people will have opinions on it. That'd be fun. It'd be a good excuse for me to listen.Matty: Exactly [laugh]. And then we'll have to remember to go back and find—I'll have to do a little Twitter search for the dates.Corey: We'll have to do another discussion on this, if anything interesting comes out of it.Matty: Actually, that would be funny. That would be—we could do a little recap.Corey: It would. I want to thank you so much for being so generous with your time. Where can people find you if they want to learn more?Matty: Well, [sigh] for the moment, [sigh] who knows what will be the case when this comes out, but you can still find me on Twitter at @mattstratton. I'm also at hackie-derm dot io—sorry, hackyderm.io. I keep wanting to say hackie-derm, but hackyderm actually works better anyway and it's funnier. But [hackyderm.io/@mattstratton](https://hackyderm.io/@mattstratton) is my Mastodon. LinkedIn; I'm. Around there. I need to play more at that. You will—also again, I don't know when this is coming out, so you won't tell you—you don't find me out traveling as much as you might have before, but DevOpsDays Chicago is coming up August 9th and 10th in Chicago, so at the time of listening to this, I'm sure our program will have been posted. But please come and join us. It will be our ninth time of hosting a DevOpsDay Chicago. And I have decided I'm sticking around for ten, so next year will be my last DevOpsDay that I'm running. So, this is the penultimate. And we always know that the penultimate is the best.Corey: Absolutely. Thanks again for your time. It's appreciated. Matty Stratton, Director of Developer Relations at Aiven. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment talking about how I completely missed the whole point of this community and failing to disclose that you are in fact one of the producers of the show.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Screaming in the Cloud
The Realities of Working in Data with Emily Gorcenski

Screaming in the Cloud

Play Episode Listen Later Mar 7, 2023 36:22


Emily Gorcenski, Data & AI Service Line Lead at Thoughtworks, joins Corey on Screaming in the Cloud to discuss how big data is changing our lives - both for the better, and the challenges that come with it. Emily explains how data is only important if you know what to do with it and have a plan to work with it, and why it's crucial to understand the use-by date on your data. Corey and Emily also discuss how big data problems aren't universal problems for the rest of the data community, how to address the ethics around AI, and the barriers to entry when pursuing a career in data. About EmilyEmily Gorcenski is a principal data scientist and the Data & AI Service Line Lead of ThoughtWorks Germany. Her background in computational mathematics and control systems engineering has given her the opportunity to work on data analysis and signal processing problems from a variety of complex and data intensive industries. In addition, she is a renowned data activist and has contributed to award-winning journalism through her use of data to combat extremist violence and terrorism. The opinions expressed are solely her own.Links Referenced: ThoughtWorks: https://www.thoughtworks.com/ Personal website: https://emilygorcenski.com Twitter: https://twitter.com/EmilyGorcenski Mastodon: https://mastodon.green/@emilygorcenski@indieweb.social TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. My guest today is Emily Gorcenski, who is the Data and AI Service Line Lead over at ThoughtWorks. Emily, thank you so much for joining me today. I appreciate it.Emily: Thank you for having me. I'm happy to be here.Corey: What is it you do, exactly? Take it away.Emily: Yeah, so I run the data side of our business at ThoughtWorks, Germany. That means data engineering work, data platform work, data science work. I'm a data scientist by training. And you know, we're a consulting company, so I'm working with clients and trying to help them through the, sort of, morphing landscape that data is these days. You know, should we be migrating to the cloud with our data? What can we migrate to the cloud with our data? Where should we be doing with our data scientists and how do we make our data analysts' lives easier? So, it's a lot of questions like that and trying to figure out the strategy and all of those things.Corey: You might be one of the most perfectly positioned people to ask this question to because one of the challenges that I've run into consistently and persistently—because I watch a lot of AWS keynotes—is that they always come up with the same talking point, that data is effectively the modern gold. And data is what unlocks value to your busin—“Every business agrees,” because someone who's dressed in what they think is a nice suit on stage is saying that it's, “Okay, you're trying to sell me something. What's the deal here?” Then I check my email and I discover that Amazon has sent me the same email about the same problem for every region I've deployed things to in AWS. And, “Oh, you deploy this to one of the Japanese regions. We're going to send that to you in Japanese as a result.”And it's like, okay, for a company that says data is important, they have no idea who any of their customers are at this point, is that is the takeaway here. How real is, “Data is important,” versus, “We charge by the gigabyte so you should save all of your data and then run expensive things on top of it.”Emily: I think data is very important, if you know what you're going to do with it and if you have a plan for how to work with it. I think if you look at the history of computing, of technology, if you go back 20 years to maybe the early days of the big data era, right? Everyone's like, “Oh, we've got big data. Data is going to be big.” And for some reason, we never questioned why, like, we were thinking that the ‘big' in ‘big data' meant big is in volume and not ‘big' as in ‘big pharma.'This sort of revolution never really happened for most companies. Sure, some companies got a lot of value from the, sort of, data mining and just gather everything and collect everything and if you hit it with a big computational hammer, insights will come out and somehow there's insights will make you money through magic. The reality is much more prosaic. If you want to make money with data, you have to have a plan for what you're going to do with data. You have to know what you're looking for and you have to know exactly what you're going to get when you look at your data and when you try to answer questions with it.And so, when we see somebody like Amazon not being able to correlate that the fact that you're the account owner for all of these different accounts and that the language should be English and all of these things, that's part of the operational problem because it's annoying, to try to do joins across multiple tables in multiple regions and all of those things, but it's also part—you know, nobody has figured out how this adds value for them to do that, right? There's a part of it where it's like, this is just professionalism, but there's a part of it, where it's also like… whatever. You've got Google Translate. Figure out yourself. We're just going to get through it.I think that… as time has evolved from the initial waves of the big data era into the data science era, and now we're in, you know, all sorts of different architectures and principles and all of these things, most companies still haven't figured out what to do with data, right? They're still investing a ton of money to answer the same analytics questions that they were answering 20 years ago. And for me, I think that's a disappointment in some regards because we do have better tools now. We can do so many more interesting things if you give people the opportunity.Corey: One of the things that always seemed a little odd was, back when I wielded root credentials in anger—anger,' of course, being my name for the production environment, as opposed to, “Theory,” which is what I call staging because it works in theory, but not in production. I digress—it always felt like I was getting constant pushback from folks of, “You can't delete that data. It's incredibly important because one day, we're going to find a way to unlock the magic of it.” And it's, “These are web server logs that are 15 years old, and 98% of them by volume are load balancer health checks because it turns out that back in those days, baby seals got more hits than our website did, so that's not really a thing that we wind up—that's going to add much value to it.” And then from my perspective, at least, given that I tend to live, eat, sleep, breathe cloud these days, AWS did something that was refreshingly customer-obsessed when they came out with Glacier Deep Archive.Because the economics of that are if you want to store a petabyte of data, with a 12-hour latency on request for things like archival logs and whatnot, it's $1,000 a month per petabyte, which is okay, you have now hit a price point where it is no longer worth my time to argue with you. We're just not going to delete anything ever again. Problem solved. Then came GDPR, which is neither here nor there and we actually want to get rid of those things for a variety of excellent legal reasons. And the dance continues.But my argument against getting rid of data because it's super expensive no longer holds water in the way that it wants did for anything remotely resembling a reasonable amount of data. Then again, that's getting reinvented all the time. I used to be very, I guess we'll call it, I guess, a data minimalist. I don't want to store a bunch of data, mostly because I'm not a data person. I am very bad thinking in that way.I consider SQL to be the chests of the programming world and I'm not particularly great at it. And I also unlucky and have an aura, so if I destroy a bunch of stateless web servers, okay, we can all laugh about that, but let's keep me the hell away from the data warehouse if we still want a company tomorrow morning. And that was sort of my experience. And I understand my bias in that direction. But I'm starting to see magic get unlocked.Emily: Yeah, I think, you know, you said earlier, there's, like, this mindset, like, data is the new gold or data is new oil or whatever. And I think it's actually more true that data is the new milk, right? It goes bad if you don't use it, you know, before a certain point in time. And at a certain point in time, it's not going to be very offensive if you just leave it locked in the jug, but as soon as you try to open it, you're going to have a lot of problems. Data is very, very cheap to store these days. It's very easy to hold data; it's very expensive to process data.And I think that's where the shift has gone, right? There's sort of this, like, Oracle DBA legacy of, like, “Don't let the software developers touch the prod database.” And they've kind of kept their, like, arcane witchcraft to themselves, and that mindset has persisted. But now it's sort of shifted into all of these other architectural patterns that are just abstractions on top of this, don't let the software engineers touch the data store, right? So, we have these, like, streaming-first architectures, which are great. They're great for software devs. They're great for software devs. And they're great for data engineers who like to play with big powerful technology.They're terrible if you want to answer a question, like, “How many customers that I have yesterday?” And these are the things that I think are some of the central challenges, right? A Kappa architecture—you know, streaming-first architecture—is amazing if you want to improve your application developer throughput. And it's amazing if you want to build real-time analytics or streaming analytics into your platform. But it's terrible if you want your data lake to be navigable. It's terrible if you want to find the right data that makes sense to do the more complex things. And it becomes very expensive to try to process it.Corey: One of the problems I think I have that is that if I take a look at the data volumes that I work with in my day-to-day job, I'm dealing with AWS billing data as spit out by the AWS billing system. And there isn't really a big data problem here. If you take a look at some of the larger clients, okay, maybe I'm trying to consume a CSV that's ten gigabytes. Yes, Excel is going to violently scream itself to death if I try to wind up loading it there, and then my computer smells like burning metal all afternoon. But if it fits in RAM, it doesn't really feel like it's a big data problem, on some level.And it just feels that when I look at the landscape of all the different tools you can use for things like this, they just feel like it's more or less, hmm, “I have a loose thread on my shirt. Could you pass me that chainsaw for a second?” It just seems like stupendous overkill for anything that I'm working with. Counterpoint; that the clients I'm working with have massive data farms and my default response when I meet someone who's very good at an area that I don't do a lot of work in is—counterintuitively to what a lot of people apparently do on Twitter—is not the default assumption of oh, “I don't know anything about that space. It must be worthless and they must be dumb.”No. That is not the default approach to take anything, from my perspective. So, it's clear there's something very much there that I just don't see slash understand. That is a very roundabout way of saying what could be uncharitably distilled down to, “So, is your entire career bullshit?” But no, it is clearly not.There is value being extracted from this and it's powerful. I just think that there's been an industry-wide, relatively poor job done of explaining that value in ways that don't come across as contrived or profoundly disturbing.Emily: Yeah, I think there's a ton of value in doing things right. It gets very complicated to try to explain the nuances of when and how data can actually be useful, right? Oftentimes, your historical data, you know, it really only tells you about what happened in the past. And you can throw some great mathematics at it and try to use it to predict the future in some sense, but it's not necessarily great at what happens when you hit really hard changes, right?For example, when the Coronavirus pandemic hit and purchaser and consumer behavior changed overnight. There was no data in the data set that explained that consumer behavior. And so, what you saw is a lot of these things like supply chain issues, which are very heavily data-driven on a normal circumstance, there was nothing in that data that allowed those algorithms to optimize for the reality that we were seeing at that scale, right? Even if you look at advanced logistics companies, they know what to do when there's a hurricane coming or when there's been an earthquake or things like that. They have disaster scenarios.But nobody has ever done anything like this at the global scale, right? And so, what we saw was this hard reset that we're still feeling the repercussions of today. Yes, there were people who couldn't work and we had lockdowns and all that stuff, but we also have an effect from the impact of the way that we built the systems to work with the data that we need to shuffle around. And so, I think that there is value in being able to process these really, really large datasets, but I think that actually, there's also a lot of value in being able to solve smaller, simpler problems, right? Not everything is a big data problem, not everything requires a ton of data to solve.It's more about the mindset that you use to look at the data, to explore the data, and what you're doing with it. And I think the challenge here is that, you know, everyone wants to believe that they have a big data problem because it feels like you have to have a big data problem if you—Corey: All the cool kids are having this kind of problem.Emily: You have to have big data to sit at the grownup's table. And so, what's happened is we've optimized a lot of tools around solving big data problems and oftentimes, these tools are really poor at solving normal data problems. And there's a lot of money being spent in a lot of overkill engineering in the data space.Corey: On some level, it feels like there has been a dramatic misrepresentation of this. I had an article that went out last year where I called machine-learning selling pickaxes into a digital gold rush. And someone I know at AWS responded to that and probably the best way possible—she works over on their machine-learning group—she sent me a foam Minecraft pickaxe that now is hanging on my office wall. And that gets more commentary than anything, including the customized oil painting I have of Billy the Platypus fighting an AWS Billing Dragon. No, people want to talk about the Minecraft pickaxe.It's amazing. It's first, where is this creativity in any of the marketing that this department is putting out? But two it's clearly not accurate. And what it took for me to see that was a couple of things that I built myself. I built a Twitter thread client that would create Twitter threads, back when Twitter was a place that wasn't overrun by some of the worst people in the world and turned into BirdChan.But that was great. It would automatically do OCR on images that I uploaded, it would describe the image to you using Azure's Cognitive Vision API. And that was magic. And now I see things like ChatGPT, and that's magic. But you take a look at the way that the cloud companies have been describing the power of machine learning in AI, they wind up getting someone with a doctorate whose first language is math getting on stage for 45 minutes and just yelling at you in Star Trek technobabble to the point where you have no idea what the hell they're saying.And occasionally other data scientists say, “Yeah, I think he's just shining everyone on at this point. But yeah, okay.” It still becomes unclear. It takes seeing the value of it for it to finally click. People make fun of it, but the Hot Dog, Not A Hot Dog app is the kind of valuable breakthrough that suddenly makes this intangible thing very real for people.Emily: I think there's a lot of impressive stuff and ChatGPT is fantastically impressive. I actually used ChatGPT to write a letter to some German government agency to deal with some bureaucracy. It was amazing. It did it, was grammatically correct, it got me what I needed, and it saved me a ton of time. I think that these tools are really, really powerful.Now, the thing is, not every company needs to build its own ChatGPT. Maybe they need to integrate it, maybe there's an application for it somewhere in their landscape of product, in their landscape of services, in the landscape of their interim internal tooling. And I would be thrilled actually to see some of that be brought into reality in the next couple of years. But you also have to remember that ChatGPT is not something that came because we have, like, a really great breakthrough in AI last year or something like that. It stacked upon 40 years of research.We've gone through three new waves of neural networking in that time to get to this point, and it solves one class of problem, which is honestly a fairly narrow class of problem. And so, what I see is a lot of companies that have much more mundane problems, but where data can actually still really help them. Like how do you process Cambodian driver's licenses with OCR, right? These are the types of things that if you had a training data set that was every Cambodian person's driver's license for the last ten years, you're still not going to get the data volumes that even a day worth of Amazon's marketplace generates, right? And so, you need to be able to solve these problems still with data without resorting to the cudgel that is a big data solution, right?So, there's still a niche, a valuable niche, for solving problems with data without having to necessarily resort to, we have to load the entire internet into our stream and throw GPUs at it all day long and spend hundreds of—tens of millions of dollars in training. I don't know, maybe hundreds of millions; however much ChatGPT just raised. There's an in-between that I think is vastly underserved by what people are talking about these days.Corey: There is so much attention being given to this and it feels almost like there has been a concerted and defined effort to almost talk in circles and remove people from the humanity and the human consequences of what it is that they're doing. When I was younger, in my more reckless years, I was never much of a fan of the idea of government regulation. But now it has become abundantly clear that our industry, regardless of how you want to define industry, how—describe a society—cannot self-regulate when it comes to data that has the potential to ruin people's lives. I mean, I spent a fair bit of my time in my career working in financial services in a bunch of different ways. And at least in those jobs, it was only money.The scariest thing I ever dealt with, from a data perspective is when I did a brief stint at Grindr because that was the sort of problem where if that data gets out, people will die. And I have not had to think about things like that have that level of import before or since, for which I'm eternally grateful. “It's only money,” which is a weird thing for a guy who fixes cloud bills for a living to say. And if I say that in a client call, it's not going to go very well. But it's the truth. Money is one of those things that can be fixed. It can be addressed in due course. There are always opportunities there. Someone just been outed to their friends, family, and they feel their life is now in shambles around them, you can't unring that particular bell.Emily: Yeah. And in some countries, it can lead to imprisonment, or—Corey: It can lead to death sentences, yes. It's absolutely not acceptable.Emily: There's a lot to say about the ethics of where we are. And I think that as a lot of these high profile, you know, AI tools have come out over the last year or so, so you know, Stable Diffusion and ChatGPT and all of this stuff, there's been a lot of conversation that is sort of trying to put some counterbalance on what we're seeing. And I don't know that it's going to be successful. I think that, you know, I've been speaking about ethics and technology for a long time and I think that we need to mature and get to the next level of actually addressing the ethical problems in technology. Because it's so far beyond things like, “Oh, you know, if there's a biased training data set and therefore the algorithm is biased,” right?Everyone knows that by now, right? And the people who don't know that, don't care. We need to get much beyond where, you know, these conversations about ethics and technology are going because it's a manifold problem. We have issues with the people labeling this data are paid, you know, pennies per hour to deal with some of the most horrific content you've ever seen. I mean, I'm somebody who has immersed myself in a lot of horrific content for some of the work that I have done, and this is, you know, so far beyond what I've had to deal with in my life that I can't even imagine it. You couldn't pay me enough money to do it and we're paying people in developing nations, you know, a buck-thirty-five an hour to do this. I think—Corey: But you must understand, Emily, that given the standard of living where they are, that that is perfectly normal and we wouldn't want to distort local market dynamics. So, if they make a buck-fifty a day, we are going to be generous gods and pay them a whopping dollar-seventy a day, and now we feel good about ourselves. And no, it's not about exploitation. It's about raising up an emerging market. And other happy horseshit that lies people tell themselves.Emily: Yes, it is. Yes, it is. And we've built—you know, the industry has built its back on that. It's raised itself up on this type of labor. It's raised itself up on taking texts and images without permission of the creators. And, you know, there's—I'm not a lawyer and I'm not going to play one, but I do know that derivative use is something that at least under American law, is something that can be safely done. It would be a bad world if derivative use was not something that we had freely available, I think, and on the balance.But our laws, the thing is, our laws don't account for the scale. Our laws about things like fair use, derivative use, are for if you see a picture and you want to take your own interpretation, or if you see an image and you want to make a parody, right? It's a one-to-one thing. You can't make 5 million parody images based on somebody's art, yourself. These laws were never built for this scale.And so, I think that where AI is exploiting society is it's exploiting a set of ethics, a set of laws, and a set of morals that are built around a set of behavior that is designed around normal human interaction scales, you know, one person standing in front of a lecture hall or friends talking with each other or things like that. The world was not meant for a single person to be able to speak to hundreds of thousands of people or to manipulate hundreds of thousands of images per day. It's actually—I find it terrifying. Like, the fact that me, a normal person, has a Twitter following that, you know, if I wanted to, I can have 50 million impressions in a month. This is not a normal thing for a normal human being to have.And so, I think that as we build this technology, we have to also say, we're changing the landscape of human ethics by our ability to act at scale. And yes, you're right. Regulation is possibly one way that can help this, but I think that we also need to embed cultural values in how we're using the technology and how we're shaping our businesses to use the technology. It can be used responsibly. I mean, like I said, ChatGPT helped me with a visa issue, sending an email to the immigration office in Berlin. That's a fantastic thing. That's a net positive for me; hopefully, for humanity. I wasn't about to pay a lawyer to do it. But where's the balance, right? And it's a complex topic.Corey: It is. It absolutely is. There is one last topic that I would like to talk to you about that's a little less heavy. And I've got to be direct with you that I'm not trying to be unkind, but you've disappointed me. Because you mentioned to me at one point, when I asked how things were going in your AWS universe, you said, “Well, aside from the bank heist, reasonably well.”And I thought that you were blessed as with something I always look for, which is the gift of glorious metaphor. Unfortunately, as I said, you've disappointed me. It was not a metaphor; it was the literal truth. What the hell kind of bank heist could possibly affect an AWS account? This sounds like something out of a movie. Hit me with it.Emily: Yeah, you know, I think in the SRE world, we tell people to focus on the high probability, low impact things because that's where it's going to really hurt your business, and let the experts deal with the black swan events because they're pretty unlikely. You know, a normal business doesn't have to worry about terrorists breaking into the Google data center or a gang of thieves breaking into a bank vault. Apparently, that is something that I have to worry about because I have some data in my personal life that I needed to protect, like all other people. And I decided, like a reasonable and secure and smart human being who has a little bit of extra spending cash that I would do the safer thing and take my backup hard drive and my old phones and put them in a safety deposit box at an old private bank that has, you know, a vault that's behind the meter-and-a-half thick steel door and has two guards all the time, cameras everywhere. And I said, “What is the safest possible thing that you can do to store your backups?” Obviously, you put it in a secure storage location, right? And then, you know, I don't use my AWS account, my personal AWS account so much anymore. I have work accounts. I have test accounts—Corey: Oh, yeah. It's honestly the best way to have an AWS account is just having someone else having a payment instrument attached to it because otherwise oh God, you're on the hook for that yourself and nobody wants that.Emily: Absolutely. And you know, creating new email addresses for new trial accounts is really just a pain in the ass. So, you know, I have my phone, you know, from five years ago, sitting in this bank vault and I figured that was pretty secure. Until I got an email [laugh] from the Berlin Polizei saying, “There has been a break-in.” And I went and I looked at the news and apparently, a gang of thieves has pulled off the most epic heist in recent European history.This is barely in the news. Like, unless you speak German, you're probably not going to find any news about this. But a gang of thieves broke into this bank vault and broke open the safety deposit boxes. And it turns out that this vault was also the location where a luxury watch consigner had been storing his watches. So, they made off with some, like, tens of millions of dollars of luxury watches. And then also the phone that had my 2FA for my Amazon account. So, the total value, you know, potential theft of this was probably somewhere in the $500 million range if they set up a SageMaker instance on my account, perhaps.Corey: This episode is sponsored in part by Honeycomb. I'm not going to dance around the problem. Your. Engineers. Are. Burned. Out. They're tired from pagers waking them up at 2 am for something that could have waited until after their morning coffee. Ring Ring, Who's There? It's Nagios, the original call of duty! They're fed up with relying on two or three different “monitoring tools” that still require them to manually trudge through logs to decipher what might be wrong. Simply put, there's a better way. Observability tools like Honeycomb (and very little else becau se they do admittedly set the bar) show you the patterns and outliers of how users experience your code in complex and unpredictable environments so you can spend less time firefighting and more time innovating. It's great for your business, great for your engineers, and, most importantly, great for your customers. Try FREE today at honeycomb.io/screaminginthecloud. That's honeycomb.io/screaminginthecloud.Corey: The really annoying part that you are going to kick yourself on about this—and I'm not kidding—is, I've looked up the news articles on this event and it happened, something like two or three days after AWS put out the best release of last years, or any other re:Invent—past, present, future—which is finally allowing multiple MFA devices on root accounts. So finally, we can stop having safes with these things or you can have two devices or you can have multiple people in Covid times out of remote sides of different parts of the world and still get into the thing. But until then, nope. It's either no MFA or you have to store it somewhere ridiculous like that and access becomes a freaking problem in the event that the device is lost, or in this case stolen.Emily: [laugh]. I will just beg the thieves, if you're out there, if you're secretly actually a bunch of cloud engineers who needed to break into a luxury watch consignment storage vault so that you can pay your cloud bills, please have mercy on my poor AWS account. But also I'll tell you that the credit card attached to it is expired so you won't have any luck.Corey: Yeah. Really sad part. Despite having the unexpired credit card, it just means that the charge won't go through. They're still going to hold you responsible for it. It's the worst advice I see people—Emily: [laugh].Corey: Well, intentioned—giving each other on places like Reddit where the other children hang out. And it's, “Oh, just use a prepaid gift card so it can only charge you so much.” It's yeah, and then you get exploited like someone recently was and start accruing $60,000 a day in Lambda charges on an otherwise idle account and Amazon will come after you with a straight face after a week. And, like, “Yes, we'd like our $360,000, please.”Emily: Yes.Corey: “We tried to charge the credit card and wouldn't you know, it expired. Could you get on that please? We'd like our money faster if you wouldn't mind.” And then you wind up in absolute hell. Now, credit where due, they in every case I am aware of that is not looking like fraud's close cousin, they have made it right, on some level. But it takes three weeks of back and forth and interminable waiting.And you're sitting there freaking out, especially if you're someone who does not have a spare half-million dollars sitting around. Imagine who—“You sound poor. Have you tried not being that?” And I'm firmly convinced that it a matter of time until someone does something truly tragic because they don't understand that it takes forever, but it will go away. And from my perspective, there's no bigger problem that AWS needs to fix than surprise lifelong earnings bills to some poor freaking student who is just trying to stand up a website as part of a class.Emily: All of the clouds have these missing stairs in them. And it's really easy because they make it—one of the things that a lot of the cloud providers do is they make it really easy for you to spin up things to test them. And they make it really, really hard to find where it is to shut it all down. The data science is awful at this. As a data scientist, I work with a lot of data science tools, and every cloud has, like, the spin up your magical data science computing environment so that your data scientist can, like, bang on the data with you know, high-performance compute for a while.And you know, it's one click of a button and you type in a couple of na—you know, a couple of things name, your service or whatever, name your resource. You click a couple buttons and you spin it up, but behind the scenes, it's setting up a Kubernetes cluster and it's setting up some storage bucket and it's setting up some data pipelines and it's setting up some monitoring stuff and it's setting up a VM in order to run all of this stuff. And the next thing that you know, you're burning 100, 200 euro a day, just to, like, to figure out if you can load a CSV into pandas using a Jupyter Notebook. And you're like—when you try to shut it all down, you can't. It's you have to figure, oh, there is a networking thing set up. Well, nobody told me there's a networking thing set up. You know? How do I delete that?Corey: You didn't say please, so here you go. Without for me, it's not even the giant bill going from $4 a month in S3 charges to half a million bucks because that is pretty obvious from the outside just what the hell's been happening. It's the little stuff. I am still—since last summer—waiting for a refund on $260 of ‘because we said so' SageMaker credits because of a change of their billing system, for a 45-minute experiment I had done eight months before that.Emily: Yep.Corey: Wild stuff. Wild stuff. And I have no tolerance for people saying, “Oh, you should just read the pricing page and understand it better.” Yeah, listen, jackhole. I do this for a living. If I can fall victim to it, anyone can. I promise. It is not that I don't know how the billing system works and what to do to avoid unexpected charges.And I'm just luck—because if I hadn't caught it with my systems three days into the month, it would have been a $2,000 surprise. And yeah, I run a company. I can live with that. I wouldn't be happy, but whatever. It is immaterial compared to, you know, payroll.Emily: I think it's kind of a rite of passage, you know, to have the $150 surprise Redshift bill at the end of the month from your personal test account. And it's sad, you know? I think that there's so much better that they can do and that they should do. Sort of as a tangent, one of the challenges that I see in the data space is that it's so hard to break into data because the tooling is so complex and it requires so much extra knowledge, right? If you want to become a software developer, you can develop a microservice on your machine, you can build a web app on your machine, you can set up Ruby on Rails, or Flask, or you know, .NET, or whatever you want. And you can do all of that locally.And you can learn everything you need to know about React, or Terraform, or whatever, running locally. You can't do that with data stuff. You can't do that with BigQuery. You can't do that with Redshift. The only way that you can learn this stuff is if you have an account with that setup and you're paying the money to execute on it. And that makes it a really high barrier for entry for anyone to get into this space. It makes it really hard to learn. Because if you want to learn anything by doing, like many of us in the industry have done, it's going to cost you a ton of money just to [BLEEP] around and find out.Corey: Yes. And no one likes the find out part of those stories.Emily: Nobody likes to find out when it comes to your bill.Corey: And to tie it back to the data story of it, it is clearly some form of batch processing because it tries to be an eight-hour consistency model. Yeah, I assume for everything, it's 72. But what that means is that you are significantly far removed from doing a thing and finding out what that thing costs. And that's the direct charges. There's always the oh, I'm going to set things up and it isn't going to screw you over on the bill. You're just planting a beautiful landmine you're going to stumble blindly into in three months when you do something else and didn't realize what that means.And the worst part is it feels victim-blamey. I mean, this is my pro—I guess this is one of the reasons I guess I'm so down on data, even now. It's because I contextualize it in a sense of the AWS bill. No one's happy dealing with that. You ever met a happy accountant? You have not.Emily: Nope. Nope [laugh]. Especially when it comes to clouds stuff.Corey: Oh yeah.Emily: Especially these days, when we're all looking to save energy, save money in the cloud.Corey: Ideally, save the planet. Sustainability and saving money align on the axis of ‘turn that shit off.' It's great. We can hope for a brighter tomorrow.Emily: Yep.Corey: I really want to thank you for being so generous with your time. If people want to learn more, where can they find you? Apparently filing police reports after bank heists, which you know, it's a great place to meet people.Emily: Yeah. You know, the largest criminal act in Berlin is certainly a place you want to go to get your cloud advice. You can find me, I have a website. It's my name, emilygorcenski.com.You can find me on Twitter, but I don't really post there anymore. And I'm on Mastodon at some place because Mastodon is weird and kind of a mess. But if you search me, I'm really not that hard to find. My name is harder to spell, but you'll see it in the podcast description.Corey: And we will, of course, put links to all of this in the show notes. Thank you so much for your time. I really appreciate it.Emily: Thank you for having me.Corey: Emily Gorcenski, Data and AI Service Line Lead at ThoughtWorks. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insipid, insulting comment, talking about why data doesn't actually matter at all. And then the comment will disappear into the ether because your podcast platform of choice feels the same way about your crappy comment.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
The Evolution of Cloud Services with Richard Hartmann

Screaming in the Cloud

Play Episode Listen Later Oct 18, 2022 45:26


About RichardRichard "RichiH" Hartmann is the Director of Community at Grafana Labs, Prometheus team member, OpenMetrics founder, OpenTelemetry member, CNCF Technical Advisory Group Observability chair, CNCF Technical Oversight Committee member, CNCF Governing Board member, and more. He also leads, organizes, or helps run various conferences from hundreds to 18,000 attendess, including KubeCon, PromCon, FOSDEM, DENOG, DebConf, and Chaos Communication Congress. In the past, he made mainframe databases work, ISP backbones run, kept the largest IRC network on Earth running, and designed and built a datacenter from scratch. Go through his talks, podcasts, interviews, and articles at https://github.com/RichiH/talks or follow him on Twitter at https://twitter.com/TwitchiH for musings on the intersection of technology and society.Links Referenced: Grafana Labs: https://grafana.com/ Twitter: https://twitter.com/TwitchiH Richard Hartmann list of talks: https://github.com/richih/talks TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig. That's snark.cloud/appconfig.Corey: This episode is brought to us in part by our friends at Datadog. Datadog's SaaS monitoring and security platform that enables full stack observability for developers, IT operations, security, and business teams in the cloud age. Datadog's platform, along with 500 plus vendor integrations, allows you to correlate metrics, traces, logs, and security signals across your applications, infrastructure, and third party services in a single pane of glass.Combine these with drag and drop dashboards and machine learning based alerts to help teams troubleshoot and collaborate more effectively, prevent downtime, and enhance performance and reliability. Try Datadog in your environment today with a free 14 day trial and get a complimentary T-shirt when you install the agent.To learn more, visit datadoghq/screaminginthecloud to get. That's www.datadoghq/screaminginthecloudCorey: Welcome to Screaming in the Cloud, I'm Corey Quinn. There are an awful lot of people who are incredibly good at understanding the ins and outs and the intricacies of the observability world. But they didn't have time to come on the show today. Instead, I am talking to my dear friend of two decades now, Richard Hartmann, better known on the internet as RichiH, who is the Director of Community at Grafana Labs, here to suffer—in a somewhat atypical departure for the theme of this show—personal attacks for once. Richie, thank you for joining me.Richard: And thank you for agreeing on personal attacks.Corey: Exactly. It was one of your riders. Like, there have to be the personal attacks back and forth or you refuse to appear on the show. You've been on before. In fact, the last time we did a recording, I believe you were here in person, which was a long time ago. What have you been up to?You're still at Grafana Labs. And in many cases, I would point out that, wow, you've been there for many years; that seems to be an atypical thing, which is an American tech industry perspective because every time you and I talk about this, you look at folks who—wow, you were only at that company for five years. What's wrong with you—you tend to take the longer view and I tend to have the fast twitch, time to go ahead and leave jobs because it's been more than 20 minutes approach. I see that you're continuing to live what you preach, though. How's it been?Richard: Yeah, so there's a little bit of Covid brains, I think. When we talked in 2018, I was still working at SpaceNet, building a data center. But the last two-and-a-half years didn't really happen for many people, myself included. So, I guess [laugh] that includes you.Corey: No, no you're right. You've only been at Grafana Labs a couple of years. One would think I would check the notes for shooting my mouth off. But then, one wouldn't know me.Richard: What notes? Anyway, I've been around Prometheus and Grafana Since 2015. But it's like, real, full-time everything is 2020. There was something in between. Since 2018, I contracted to do vulnerability handling and everything for Grafana Labs because they had something and they didn't know how to deal with it.But no, full time is 2020. But as to the space in the [unintelligible 00:02:45] of itself, it's maybe a little bit German of me, but trying to understand the real world and trying to get an overview of systems and how they actually work, and if they are working correctly and as intended, and if not, how they're not working as intended, and how to fix this is something which has always been super important to me, in part because I just want to understand the world. And this is a really, really good way to automate understanding of the world. So, it's basically a work-saving mechanism. And that's why I've been sticking to it for so long, I guess.Corey: Back in the early days of monitoring systems—so we called it monitoring back then because, you know, are using simple words that lack nuance was sort of de rigueur back then—we wound up effectively having tools. Nagios is the one that springs to mind, and it was terrible in all the ways you would expect a tool written in janky Perl in the early-2000s to be. But it told you what was going on. It tried to do a thing, generally reach a server or query it about things, and when things fell out of certain specs, it screamed its head off, which meant that when you had things like the core switch melting down—thinking of one very particular incident—you didn't get a Nagios alert; you got 4000 Nagios alerts. But start to finish, you could wrap your head rather fully around what Nagios did and why it did the sometimes strange things that it did.These days, when you take a look at Prometheus, which we hear a lot about, particularly in the Kubernetes space and Grafana, which is often mentioned in the same breath, it's never been quite clear to me exactly where those start and stop. It always feels like it's a component in a larger system to tell you what's going on rather than a one-stop shop that's going to, you know, shriek its head off when something breaks in the middle of the night. Is that the right way to think about it? The wrong way to think about it?Richard: It's a way to think about it. So personally, I use the terms monitoring and observability pretty much interchangeably. Observability is a relatively well-defined term, even though most people won't agree. But if you look back into the '70s into control theory where the term is coming from, it is the measure of how much you're able to determine the internal state of a system by looking at its inputs and its outputs. Depending on the definition, some people don't include the inputs, but that is the OG definition as far as I'm aware.And from this, there flow a lot of things. This question of—or this interpretation of the difference between telling that, yes, something's broken versus why something's broken. Or if you can't ask new questions on the fly, it's not observability. Like all of those things are fundamentally mapped to this definition of, I need enough data to determine the internal state of whatever system I have just by looking at what is coming in, what is going out. And that is at the core the thing. Now, obviously, it's become a buzzword, which is oftentimes the fate of successful things. So, it's become a buzzword, and you end up with cargo culting.Corey: I would argue periodically, that observability is hipster monitoring. If you call it monitoring, you get yelled at by Charity Majors. Which is tongue and cheek, but she has opinions, made, nonetheless shall I say, frustrating by the fact that she is invariably correct in those opinions, which just somehow makes it so much worse. It would be easy to dismiss things she says if she weren't always right. And the world is changing, especially as we get into the world of distributed systems.Is the server that runs the app working or not working loses meaning when we're talking about distributed systems, when we're talking about containers running on top of Kubernetes, which turns every outage into a murder mystery. We start having distributed applications composed of microservices, so you have no idea necessarily where an issue is. Okay, is this one microservice having an issue related to the request coming into a completely separate microservice? And it seems that for those types of applications, the answer has been tracing for a long time now, where originally that was something that felt like it was sprung, fully-formed from the forehead of some God known as one of the hyperscalers, but now is available to basically everyone, in theory.In practice, it seems that instrumenting applications still one of the hardest parts of all of this. I tried hooking up one of my own applications to be observed via OTEL, the open telemetry project, and it turns out that right now, OTEL and AWS Lambda have an intersection point that makes everything extremely difficult to work with. It's not there yet; it's not baked yet. And someday, I hope that changes because I would love to interchangeably just throw metrics and traces and logs to all the different observability tools and see which ones work, which ones don't, but that still feels very far away from current state of the art.Richard: Before we go there, maybe one thing which I don't fully agree with. You said that previously, you were told if a service up or down, that's the thing which you cared about, and I don't think that's what people actually cared about. At that time, also, what they fundamentally cared about: is the user-facing service up, or down, or impacted? Is it slow? Does it return errors every X percent for requests, something like this?Corey: Is the site up? And—you're right, I was hand-waving over a whole bunch of things. It was, “Okay. First, the web server is returning a page, yes or no? Great. Can I ping the server?” Okay, well, there are ways of server can crash and still leave enough of the TCP/IP stack up or it can respond to pings and do little else.And then you start adding things to it. But the Nagios thing that I always wanted to add—and had to—was, is the disk full? And that was annoying. And, on some level, like, why should I care in the modern era how much stuff is on the disk because storage is cheap and free and plentiful? The problem is, after the third outage in a month because the disk filled up, you start to not have a good answer for well, why aren't you monitoring whether the disk is full?And that was the contributors to taking down the server. When the website broke, there were what felt like a relatively small number of reasonably well-understood contributors to that at small to midsize applications, which is what I'm talking about, the only things that people would let me touch. I wasn't running hyperscale stuff where you have a fleet of 10,000 web servers and, “Is the server up?” Yeah, in that scenario, no one cares. But when we're talking about the database server and the two application servers and the four web servers talking to them, you think about it more in terms of pets than you do cattle.Richard: Yes, absolutely. Yet, I think that was a mistake back then, and I tried to do it differently, as a specific example with the disk. And I'm absolutely agreeing that previous generation tools limit you in how you can actually work with your data. In particular, once you're with metrics where you can do actual math on the data, it doesn't matter if the disk is almost full. It matters if that disk is going to be full within X amount of time.If that disk is 98% full and it sits there at 98% for ten years and provides the service, no one cares. The thing is, will it actually run out in the next two hours, in the next five hours, what have you. Depending on this, is this currently or imminently a customer-impacting or user-impacting then yes, alert on it, raise hell, wake people, make them fix it, as opposed to this thing can be dealt with during business hours on the next workday. And you don't have to wake anyone up.Corey: Yeah. The big filer with massive amounts of storage has crossed the 70% line. Okay, now it's time to start thinking about that, what do you want to do? Maybe it's time to order another shelf of discs for it, which is going to take some time. That's a radically different scenario than the 20 gigabyte root volume on your server just started filling up dramatically; the rate of change is such that'll be full in 20 minutes.Yeah, one of those is something you want to wake people up for. Generally speaking, you don't want to wake people up for what is fundamentally a longer-term strategic business problem. That can be sorted out in the light of day versus, “[laugh] we're not going to be making money in two hours, so if I don't wake up and fix this now.” That's the kind of thing you generally want to be woken up for. Well, let's be honest, you don't want that to happen at all, but if it does happen, you kind of want to know in advance rather than after the fact.Richard: You're literally describing linear predict from Prometheus, which is precisely for this, where I can look back over X amount of time and make a linear prediction because everything else breaks down at scale, blah, blah, blah, to detail. But the thing is, I can draw a line with my pencil by hand on my data and I can predict when is this thing going to it. Which is obviously precisely correct if I have a TLS certificate. It's a little bit more hand-wavy when it's a disk. But still, you can look into the future and you say, “What will be happening if current trends for the last X amount of time continue in Y amount of time.” And that's precisely a thing where you get this more powerful ability of doing math with your data.Corey: See, when you say it like that, it sounds like it actually is a whole term of art, where you're focusing on an in-depth field, where salaries are astronomical. Whereas the tools that I had to talk about this stuff back in the day made me sound like, effectively, the sysadmin that I was grunting and pointing: “This is gonna fill up.” And that is how I thought about it. And this is the challenge where it's easy to think about these things in narrow, defined contexts like that, but at scale, things break.Like the idea of anomaly detection. Well, okay, great if normally, the CPU and these things are super bored and suddenly it gets really busy, that's atypical. Maybe we should look into it, assuming that it has a challenge. The problem is, that is a lot harder than it sounds because there are so many factors that factor into it. And as soon as you have something, quote-unquote, “Intelligent,” making decisions on this, it doesn't take too many false positives before you start ignoring everything it has to say, and missing legitimate things. It's this weird and obnoxious conflation of both hard technical problems and human psychology.Richard: And the breaking up of old service boundaries. Of course, when you say microservices, and such, fundamentally, functionally a microservice or nanoservice, picoservice—but the pendulum is already swinging back to larger units of complexity—but it fundamentally does not make any difference if I have a monolith on some mainframe or if I have a bunch of microservices. Yes, I can scale differently, I can scale horizontally a lot more easily, vertically, it's a little bit harder, blah, blah, blah, but fundamentally, the logic and the complexity, which is being packaged is fundamentally the same. More users, everything, but it is fundamentally the same. What's happening again, and again, is I'm breaking up those old boundaries, which means the old tools which have assumptions built in about certain aspects of how I can actually get an overview of a system just start breaking down, when my complexity unit or my service or what have I, is usually congruent with a physical piece, of hardware or several services are congruent with that piece of hardware, it absolutely makes sense to think about things in terms of this one physical server. The fact that you have different considerations in cloud, and microservices, and blah, blah, blah, is not inherently that it is more complex.On the contrary, it is fundamentally the same thing. It scales with users' everything, but it is fundamentally the same thing, but I have different boundaries of where I put interfaces onto my complexity, which basically allow me to hide all of this complexity from the downstream users.Corey: That's part of the challenge that I think we're grappling with across this entire industry from start to finish. Where we originally looked at these things and could reason about it because it's the computer and I know how those things work. Well, kind of, but okay, sure. But then we start layering levels of complexity on top of layers of complexity on top of layers of complexity, and suddenly, when things stop working the way that we expect, it can be very challenging to unpack and understand why. One of the ways I got into this whole space was understanding, to some degree, of how system calls work, of how the kernel wound up interacting with userspace, about how Linux systems worked from start to finish. And these days, that isn't particularly necessary most of the time for the care and feeding of applications.The challenge is when things start breaking, suddenly having that in my back pocket to pull out could be extremely handy. But I don't think it's nearly as central as it once was and I don't know that I would necessarily advise someone new to this space to spend a few years as a systems person, digging into a lot of those aspects. And this is why you need to know what inodes are and how they work. Not really, not anymore. It's not front and center the way that it once was, in most environments, at least in the world that I live in. Agree? Disagree?Richard: Agreed. But it's very much unsurprising. You probably can't tell me how to precisely grow sugar cane or corn, you can't tell me how to refine the sugar out of it, but you can absolutely bake a cake. But you will not be able to tell me even a third of—and I'm—for the record, I'm also not able to tell you even a third about the supply chain which just goes from I have a field and some seeds and I need to have a package of refined sugar—you're absolutely enabled to do any of this. The thing is, you've been part of the previous generation of infrastructure where you know how this underlying infrastructure works, so you have more ability to reason about this, but it's not needed for cloud services nearly as much.You need different types of skill sets, but that doesn't mean the old skill set is completely useless, at least not as of right now. It's much more a case of you need fewer of those people and you need them in different places because those things have become infrastructure. Which is basically the cloud play, where a lot of this is just becoming infrastructure more and more.Corey: Oh, yeah. Back then I distinctly remember my elders looking down their noses at me because I didn't know assembly, and how could I possibly consider myself a competent systems admin if I didn't at least have a working knowledge of assembly? Or at least C, which I, over time, learned enough about to know that I didn't want to be a C programmer. And you're right, this is the value of cloud and going back to those days getting a web server up and running just to compile Apache's httpd took a week and an in-depth knowledge of GCC flags.And then in time, oh, great. We're going to have rpm or debs. Great, okay, then in time, you have apt, if you're in the dev land because I know you are a Debian developer, but over in Red Hat land, we had yum and other tools. And then in time, it became oh, we can just use something like Puppet or Chef to wind up ensuring that thing is installed. And then oh, just docker run. And now it's a checkbox in a web console for S3.These things get easier with time and step by step by step we're standing on the shoulders of giants. Even in the last ten years of my career, I used to have a great challenge question that I would interview people with of, “Do you know what TinyURL is? It takes a short URL and then expands it to a longer one. Great, on the whiteboard, tell me how you would implement that.” And you could go up one side and down the other, and then you could add constraints, multiple data centers, now one goes offline, how do you not lose data? Et cetera, et cetera.But these days, there are so many ways to do that using cloud services that it almost becomes trivial. It's okay, multiple data centers, API Gateway, a Lambda, and a global DynamoDB table. Now, what? “Well, now it gets slow. Why is it getting slow?”“Well, in that scenario, probably because of something underlying the cloud provider.” “And so now, you lose an entire AWS region. How do you handle that?” “Seems to me when that happens, the entire internet's kind of broken. Do people really need longer URLs?”And that is a valid answer, in many cases. The question doesn't really work without a whole bunch of additional constraints that make it sound fake. And that's not a weakness. That is the fact that computers and cloud services have never been as accessible as they are now. And that's a win for everyone.Richard: There's one aspect of accessibility which is actually decreasing—or two. A, you need to pay for them on an ongoing basis. And B, you need an internet connection which is suitably fast, low latency, what have you. And those are things which actually do make things harder for a variety of reasons. If I look at our back-end systems—as in Grafana—all of them have single binary modes where you literally compile everything into a single binary and you can run it on your laptop because if you're stuck on a plane, you can't do any work on it. That kind of is not the best of situations.And if you have a huge CI/CD pipeline, everything in this cloud and fine and dandy, but your internet breaks. Yeah, so I do agree that it is becoming generally more accessible. I disagree that it is becoming more accessible along all possible axes.Corey: I would agree. There is a silver lining to that as well, where yes, they are fraught and dangerous and I would preface this with a whole bunch of warnings, but from a cost perspective, all of the cloud providers do have a free tier offering where you can kick the tires on a lot of these things in return for no money. Surprisingly, the best one of those is Oracle Cloud where they have an unlimited free tier, use whatever you want in this subset of services, and you will never be charged a dime. As opposed to the AWS model of free tier where well, okay, it suddenly got very popular or you misconfigured something, and surprise, you now owe us enough money to buy Belize. That doesn't usually lead to a great customer experience.But you're right, you can't get away from needing an internet connection of at least some level of stability and throughput in order for a lot of these things to work. The stuff you would do locally on a Raspberry Pi, for example, if your budget constrained and want to get something out here, or your laptop. Great, that's not going to work in the same way as a full-on cloud service will.Richard: It's not free unless you have hard guarantees that you're not going to ever pay anything. It's fine to send warning, it's fine to switch the thing off, it's fine to have you hit random hard and soft quotas. It is not a free service if you can't guarantee that it is free.Corey: I agree with you. I think that there needs to be a free offering where, “Well, okay, you want us to suddenly stop serving traffic to the world?” “Yes. When the alternative is you have to start charging me through the nose, yes I want you to stop serving traffic.” That is definitionally what it says on the tin.And as an independent learner, that is what I want. Conversely, if I'm an enterprise, yeah, I don't care about money; we're running our Superbowl ad right now, so whatever you do, don't stop serving traffic. Charge us all the money. And there's been a lot of hand wringing about, well, how do we figure out which direction to go in? And it's, have you considered asking the customer?So, on a scale of one to bank, how serious is this account going to be [laugh]? Like, what are your big concerns: never charge me or never go down? Because we can build for either of those. Just let's make sure that all of those expectations are aligned. Because if you guess you're going to get it wrong and then no one's going to like you.Richard: I would argue this. All those services from all cloud providers actually build to address both of those. It's a deliberate choice not to offer certain aspects.Corey: Absolutely. When I talk to AWS, like, “Yeah, but there is an eventual consistency challenge in the billing system where it takes”—as anyone who's looked at the billing system can see—“Multiple days, sometimes for usage data to show up. So, how would we be able to stop things if the usage starts climbing?” To which my relatively direct responses, that sounds like a huge problem. I don't know how you'd fix that, but I do know that if suddenly you decide, as a matter of policy, to okay, if you're in the free tier, we will not charge you, or even we will not charge you more than $20 a month.So, you build yourself some headroom, great. And anything that people are able to spin up, well, you're just going to have to eat the cost as a provider. I somehow suspect that would get fixed super quickly if that were the constraint. The fact that it isn't is a conscious choice.Richard: Absolutely.Corey: And the reason I'm so passionate about this, about the free space, is not because I want to get a bunch of things for free. I assure you I do not. I mean, I spend my life fixing AWS bills and looking at AWS pricing, and my argument is very rarely, “It's too expensive.” It's that the billing dimension is hard to predict or doesn't align with a customer's experience or prices a service out of a bunch of use cases where it'll be great. But very rarely do I just sit here shaking my fist and saying, “It costs too much.”The problem is when you scare the living crap out of a student with a surprise bill that's more than their entire college tuition, even if you waive it a week or so later, do you think they're ever going to be as excited as they once were to go and use cloud services and build things for themselves and see what's possible? I mean, you and I met on IRC 20 years ago because back in those days, the failure mode and the risk financially was extremely low. It's yeah, the biggest concern that I had back then when I was doing some of my Linux experimentation is if I typed the wrong thing, I'm going to break my laptop. And yeah, that happened once or twice, and I've learned not to make those same kinds of mistakes, or put guardrails in so the blast radius was smaller, or use a remote system instead. Yeah, someone else's computer that I can destroy. Wonderful. But that was on we live and we learn as we were coming up. There was never an opportunity for us, to my understanding, to wind up accidentally running up an $8 million charge.Richard: Absolutely. And psychological safety is one of the most important things in what most people do. We are social animals. Without this psychological safety, you're not going to have long-term, self-sustaining groups. You will not make someone really excited about it. There's two basic ways to sell: trust or force. Those are the two ones. There's none else.Corey: Managing shards. Maintenance windows. Overprovisioning. ElastiCache bills. I know, I know. It's a spooky season and you're already shaking. It's time for caching to be simpler. Momento Serverless Cache lets you forget the backend to focus on good code and great user experiences. With true autoscaling and a pay-per-use pricing model, it makes caching easy. No matter your cloud provider, get going for free at gomemento.co/screaming That's GO M-O-M-E-N-T-O dot co slash screamingCorey: Yeah. And it also looks ridiculous. I was talking to someone somewhat recently who's used to spending four bucks a month on their AWS bill for some S3 stuff. Great. Good for them. That's awesome. Their credentials got compromised. Yes, that is on them to some extent. Okay, great.But now after six days, they were told that they owed $360,000 to AWS. And I don't know how, as a cloud company, you can sit there and ask a student to do that. That is not a realistic thing. They are what is known, in the United States at least, in the world of civil litigation as quote-unquote, “Judgment proof,” which means, great, you could wind up finding that someone owes you $20 billion. Most of the time, they don't have that, so you're not able to recoup it. Yeah, the judgment feels good, but you're never going to see it.That's the problem with something like that. It's yeah, I would declare bankruptcy long before, as a student, I wound up paying that kind of money. And I don't hear any stories about them releasing the collection agency hounds against people in that scenario. But I couldn't guarantee that. I would never urge someone to ignore that bill and see what happens.And it's such an off-putting thing that, from my perspective, is beneath of the company. And let's be clear, I see this behavior at times on Google Cloud, and I see it on Azure as well. This is not something that is unique to AWS, but they are the 800-pound gorilla in the space, and that's important. Or as I just to mention right now, like, as I—because I was about to give you crap for this, too, but if I go to grafana.com, it says, and I quote, “Play around with the Grafana Stack. Experience Grafana for yourself, no registration or installation needed.”Good. I was about to yell at you if it's, “Oh, just give us your credit card and go ahead and start spinning things up and we won't charge you. Honest.” Even your free account does not require a credit card; you're doing it right. That tells me that I'm not going to get a giant surprise bill.Richard: You have no idea how much thought and work went into our free offering. There was a lot of math involved.Corey: None of this is easy, I want to be very clear on that. Pricing is one of the hardest things to get right, especially in cloud. And it also, when you get it right, it doesn't look like it was that hard for you to do. But I fix [sigh] I people's AWS bills for a living and still, five or six years in, one of the hardest things I still wrestle with is pricing engagements. It's incredibly nuanced, incredibly challenging, and at least for services in the cloud space where you're doing usage-based billing, that becomes a problem.But glancing at your pricing page, you do hit the two things that are incredibly important to me. The first one is use something for free. As an added bonus, you can use it forever. And I can get started with it right now. Great, when I go and look at your pricing page or I want to use your product and it tells me to ‘click here to contact us.' That tells me it's an enterprise sales cycle, it's got to be really expensive, and I'm not solving my problem tonight.Whereas the other side of it, the enterprise offering needs to be ‘contact us' and you do that, that speaks to the enterprise procurement people who don't know how to sign a check that doesn't have to commas in it, and they want to have custom terms and all the rest, and they're prepared to pay for that. If you don't have that, you look to small-time. When it doesn't matter what price you put on it, you wind up offering your enterprise tier at some large number, it's yeah, for some companies, that's a small number. You don't necessarily want to back yourself in, depending upon what the specific needs are. You've gotten that right.Every common criticism that I have about pricing, you folks have gotten right. And I definitely can pick up on your fingerprints on a lot of this. Because it sounds like a weird thing to say of, “Well, he's the Director of Community, why would he weigh in on pricing?” It's, “I don't think you understand what community is when you ask that question.”Richard: Yes, I fully agree. It's super important to get pricing right, or to get many things right. And usually the things which just feel naturally correct are the ones which took the most effort and the most time and everything. And yes, at least from the—like, I was in those conversations or part of them, and the one thing which was always clear is when we say it's free, it must be free. When we say it is forever free, it must be forever free. No games, no lies, do what you say and say what you do. Basically.We have things where initially you get certain pro features and you can keep paying and you can keep using them, or after X amount of time they go away. Things like these are built in because that's what people want. They want to play around with the whole thing and see, hey, is this actually providing me value? Do I want to pay for this feature which is nice or this and that plugin or what have you? And yeah, you're also absolutely right that once you leave these constraints of basically self-serve cloud, you are talking about bespoke deals, but you're also talking about okay, let's sit down, let's actually understand what your business is: what are your business problems? What are you going to solve today? What are you trying to solve tomorrow?Let us find a way of actually supporting you and invest into a mutual partnership and not just grab the money and run. We have extremely low churn for, I would say, pretty good reasons. Because this thing about our users, our customers being successful, we do take it extremely seriously.Corey: It's one of those areas that I just can't shake the feeling is underappreciated industry-wide. And the reason I say that this is your fingerprints on it is because if this had been wrong, you have a lot of… we'll call them idiosyncrasies, where there are certain things you absolutely will not stand for, and misleading people and tricking them into paying money is high on that list. One of the reasons we're friends. So yeah, but I say I see your fingerprints on this, it's yeah, if this hadn't been worked out the way that it is, you would not still be there. One other thing that I wanted to call out about, well, I guess it's a confluence of pricing and logging in the rest, I look at your free tier, and it offers up to 50 gigabytes of ingest a month.And it's easy for me to sit here and compare that to other services, other tools, and other logging stories, and then I have to stop and think for a minute that yeah, discs have gotten way bigger, and internet connections have gotten way faster, and even the logs have gotten way wordier. I still am not sure that most people can really contextualize just how much logging fits into 50 gigs of data. Do you have any, I guess, ballpark examples of what that looks like? Because it's been long enough since I've been playing in these waters that I can't really contextualize it anymore.Richard: Lord of the Rings is roughly five megabytes. It's actually less. So, we're talking literally 10,000 Lord of the Rings, which you can just shove in us and we're just storing this for you. Which also tells you that you're not going to be reading any of this. Or some of it, yes, but not all of it. You need better tooling and you need proper tooling.And some of this is more modern. Some of this is where we actually pushed the state of the art. But I'm also biased. But I, for myself, do claim that we did push the state of the art here. But at the same time you come back to those absolute fundamentals of how humans deal with data.If you look back basically as far as we have writing—literally 6000 years ago, is the oldest writing—humans have always dealt with information with the state of the world in very specific ways. A, is it important enough to even write it down, to even persist it in whatever persistence mechanisms I have at my disposal? If yes, write a detailed account or record a detailed account of whatever the thing is. But it turns out, this is expensive and it's not what you need. So, over time, you optimize towards only taking down key events and only noting key events. Maybe with their interconnections, but fundamentally, the key events.As your data grows, as you have more stuff, as this still is important to your business and keeps being more important to—or doesn't even need to be a business; can be social, can be whatever—whatever thing it is, it becomes expensive, again, to retain all of those key events. So, you turn them into numbers and you can do actual math on them. And that's this path which you've seen again, and again, and again, and again, throughout humanity's history. Literally, as long as we have written records, this has played out again, and again, and again, and again, for every single field which humans actually cared about. At different times, like, power networks are way ahead of this, but fundamentally power networks work on metrics, but for transient load spike, and everything, they have logs built into their power measurement devices, but those are only far in between. Of course, the main thing is just metrics, time-series. And you see this again, and again.You also were sysadmin in internet-related all switches have been metrics-based or metrics-first for basically forever, for 20, 30 years. But that stands to reason. Of course the internet is running at by roughly 20 years scale-wise in front of the cloud because obviously you need the internet because as you wouldn't be having a cloud. So, all of those growing pains why metrics are all of a sudden the thing, “Or have been for a few years now,” is basically, of course, people who were writing software, providing their own software services, hit the scaling limitations which you hit for Internet service providers two decades, three decades ago. But fundamentally, you have this complete system. Basically profiles or distributed tracing depending on how you view distributed tracing.You can also argue that distributed tracing is key events which are linked to each other. Logs sit firmly in the key event thing and then you turn this into numbers and that is metrics. And that's basically it. You have extremes at the and where you can have valid, depending on your circumstances, engineering trade-offs of where you invest the most, but fundamentally, that is why those always appear again in humanity's dealing with data, and observability is no different.Corey: I take a look at last month's AWS bill. Mine is pretty well optimized. It's a bit over 500 bucks. And right around 150 of that is various forms of logging and detecting change in the environment. And on the one hand, I sit here, and I think, “Oh, I should optimize that,” because the value of those logs to me is zero.Except that whenever I have to go in and diagnose something or respond to an incident or have some forensic exploration, they then are worth an awful lot. And I am prepared to pay 150 bucks a month for that because the potential value of having that when the time comes is going to be extraordinarily useful. And it basically just feels like a tax on top of what it is that I'm doing. The same thing happens with application observability where, yeah, when you just want the big substantial stuff, yeah, until you're trying to diagnose something. But in some cases, yeah, okay, then crank up the verbosity and then look for it.But if you're trying to figure it out after an event that isn't likely or hopefully won't recur, you're going to wish that you spent a little bit more on collecting data out of it. You're always going to be wrong, you're always going to be unhappy, on some level.Richard: Ish. You could absolutely be optimizing this. I mean, for $500, it's probably not worth your time unless you take it as an exercise, but outside of due diligence where you need specific logs tied to—or specific events tied to specific times, I would argue that a lot of the problems with logs is just dealing with it wrong. You have this one extreme of full-text indexing everything, and you have this other extreme of a data lake—which is just a euphemism of never looking at the data again—to keep storage vendors happy. There is an in between.Again, I'm biased, but like for example, with Loki, you have those same label sets as you have on your metrics with Prometheus, and you have literally the same, which means you only index that part and you only extract on ingestion time. If you don't have structured logs yet, only put the metadata about whatever you care about extracted and put it into your label set and store this, and that's the only thing you index. But it goes further than just this. You can also turn those logs into metrics.And to me this is a path of optimization. Where previously I logged this and that error. Okay, fine, but it's just a log line telling me it's HTTP 500. No one cares that this is at this precise time. Log levels are also basically an anti-pattern because they're just trying to deal with the amount of data which I have, and try and get a handle on this on that level whereas it would be much easier if I just counted every time I have an HTTP 500, I just up my counter by one. And again, and again, and again.And all of a sudden, I have literally—and I did the math on this—over 99.8% of the data which I have to store just goes away. It's just magic the way—and we're only talking about the first time I'm hitting this logline. The second time I'm hitting this logline is functionally free if I turn this into metrics. It becomes cheap enough that one of the mantras which I have, if you need to onboard your developers on modern observability, blah, blah, blah, blah, blah, the whole bells and whistles, usually people have logs, like that's what they have, unless they were from ISPs or power companies, or so; there they usually start with metrics.But most users, which I see both with my Grafana and with my Prometheus [unintelligible 00:38:46] tend to start with logs. They have issues with those logs because they're basically unstructured and useless and you need to first make them useful to some extent. But then you can leverage on this and instead of having a debug statement, just put a counter. Every single time you think, “Hey, maybe I should put a debug statement,” just put a counter instead. In two months time, see if it was worth it or if you delete that line and just remove that counter.It's so much cheaper, you can just throw this on and just have it run for a week or a month or whatever timeframe and done. But it goes beyond this because all of a sudden, if I can turn my logs into metrics properly, I can start rewriting my alerts on those metrics. I can actually persist those metrics and can more aggressively throw my logs away. But also, I have this transition made a lot easier where I don't have this huge lift, where this day in three months is to be cut over and we're going to release the new version of this and that software and it's not going to have that, it's going to have 80% less logs and everything will be great and then you missed the first maintenance window or someone is ill or what have you, and then the next Big Friday is coming so you can't actually deploy there. I mean Black Friday. But we can also talk about deploying on Fridays.But the thing is, you have this huge thing, whereas if you have this as a continuous improvement process, I can just look at, this is the log which is coming out. I turn this into a number, I start emitting metrics directly, and I see that those numbers match. And so, I can just start—I build new stuff, I put it into a new data format, I actually emit the new data format directly from my code instrumentation, and only then do I start removing the instrumentation for the logs. And that allows me to, with full confidence, with psychological safety, just move a lot more quickly, deliver much more quickly, and also cut down on my costs more quickly because I'm just using more efficient data types.Corey: I really want to thank you for spending as much time as you have. If people want to learn more about how you view the world and figure out what other personal attacks they can throw your way, where's the best place for them to find you?Richard: Personal attacks, probably Twitter. It's, like, the go-to place for this kind of thing. For actually tracking, I stopped maintaining my own website. Maybe I'll do again, but if you go on github.com/ritchieh/talks, you'll find a reasonably up-to-date list of all the talks, interviews, presentations, panels, what have you, which I did over the last whatever amount of time. [laugh].Corey: And we will, of course, put links to that in the [show notes 00:41:23]. Thanks again for your time. It's always appreciated.Richard: And thank you.Corey: Richard Hartmann, Director of Community at Grafana Labs. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment. And then when someone else comes along with an insulting comment they want to add, we'll just increment the counter by one.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

The Lunduke Journal of Technology
Linux, Alternative OS, & Retro Computing News - Oct 9, 2022

The Lunduke Journal of Technology

Play Episode Listen Later Oct 10, 2022 29:56


What follows is the most important news for the week! Linux-y news! Retro computer news! Alternative OS news! You know… the stuff that matters!The Free Software Foundation is 37 years old!On October 4th, 1985, Richard Stallman founded the Free Software Foundation.Weird thought: On October 3rd, 1985, the Free Software Foundation didn't exist.After all these years, it's almost hard to imagine a world where the FSF wasn't around.A physical, retro-Hard-Drive sound simulator: HDD ClickerThis mad genius got tired of the silence of his flash based hard drives. He longed for the days when his bit, magnetic hard drives made all of those awesome “hard drive noises”.So he did something about it: He build a small device that made that noise when his flash drives are accessed..Check out the video demos he gives. Turn the sound up. Just lovely.I want four.Canonical launches Ubuntu Pro as free service for individualsCanonical is now offering an “Ubuntu Pro” service for individuals… for free.“Anyone can use Ubuntu Pro for free on up to 5 machines”And then, naturally, companies and big organizations will need to purchase a subscription plan for the Ubuntu Pro service. Makes sense. And, really, is a model I quite like: Businesses and Enterprise customers are helping fund the development and support… which directly benefits the individuals. Nice.The primary purpose of Ubuntu Pro looks to be “ten years” of security updates for the core OS plus “23,000” other packages:“Ubuntu Pro (currently in public beta) expands our famous ten-year security coverage to an additional 23,000 packages beyond the main operating system.Including Ansible, Apache Tomcat, Apache Zookeeper, Docker, Drupal, Nagios, Node.js, phpMyAdmin, Puppet, PowerDNS, Python 2, Redis, Rust, WordPress, and many more...”Honestly, this seems like the way to go for folks using Ubuntu. Better support, longer lifespan of updates in the repository… if I were running Ubuntu, I'd probably jump on that. Especially considering the fact that it's free. This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit lunduke.substack.com/subscribe

FOCUS ON: Linux
Monitoring

FOCUS ON: Linux

Play Episode Listen Later Sep 16, 2022 147:18


Monitoring ist ein spannendes, wenn auch durchaus umfangreiches Thema. Zusammen mit Michael "dnsmichi" Friedrich, Philip "log1c" Baier und Gerd Stolz nähern wir uns dem Thema mit großen Schritten. Neben einem Überblick über gängige Infrastruktur- und Cloud-Monitoringtools sprechen wir auch über Observability und verlieren uns in zahlreichen Anekdoten aus dem Nagios- und Icinga-Umfeld. Links zu dieser Episode: Monitoring-ToolsNagios: https://www.nagios.org/Icinga2: https://icinga.com/Icingaweb2 Unicorn Theme: https://github.com/Mikesch-mp/icingaweb2-theme-unicornIcingaweb2 Map Module: https://github.com/nbuchwitz/icingaweb2-module-mapZabbix: https://www.zabbix.com/check_mk: https://checkmk.com/Robotmk: https://www.robotmk.org/Logs und ObservabilitySplunk Observability Report 2022 - Germany: https://www.linkedin.com/video/event/urn:li:ugcPost:6961374817849593856/Splunk State of Observability:https://www.splunk.com/de_de/form/state-of-observability.htmlLearn Observability-Mailinglist: https://o11y.love/iX Log-Management Artikel: https://www.heise.de/select/ix/2022/9/2218110285363626884Runbooks/Incident Management bei GitLabhttps://gitlab.com/gitlab-com/runbookshttps://about.gitlab.com/handbook/engineering/infrastructure/incident-management/opsindev.news Newsletter: https://opsindev.news/Drageekeksi: https://gitlab.com/dnsmichi/atasteofaustriaRegex101: https://regex101.comTeilnehmendeMichael "dnsmichi" Friedrich: https://twitter.com/dnsmichiPhilip "log1c" Baier: https://community.icinga.com/u/log1cTooltippsGlances: https://github.com/nicolargo/glancesviddy: https://github.com/sachaos/viddymoreutils: https://joeyh.name/code/moreutils/Monitoring-Pluginshttps://github.com/aleex42/netapp-cdot-nagioshttps://github.com/bb-Ricardo/check_redfish

FOCUS ON: Linux
Monitoring

FOCUS ON: Linux

Play Episode Listen Later Sep 16, 2022 147:17


Monitoring ist ein spannendes, wenn auch durchaus umfangreiches Thema. Zusammen mit Michael "dnsmichi" Friedrich, Philip "log1c" Baier und Gerd Stolz nähern wir uns dem Thema mit großen Schritten. Neben einem Überblick über gängige Infrastruktur- und Cloud-Monitoringtools sprechen wir auch über Observability und verlieren uns in zahlreichen Anekdoten aus dem Nagios- und Icinga-Umfeld. Links zu dieser Episode: Monitoring-ToolsNagios: https://www.nagios.org/Icinga2: https://icinga.com/Icingaweb2 Unicorn Theme: https://github.com/Mikesch-mp/icingaweb2-theme-unicornIcingaweb2 Map Module: https://github.com/nbuchwitz/icingaweb2-module-mapZabbix: https://www.zabbix.com/check_mk: https://checkmk.com/Robotmk: https://www.robotmk.org/Logs und ObservabilitySplunk Observability Report 2022 - Germany: https://www.linkedin.com/video/event/urn:li:ugcPost:6961374817849593856/Splunk State of Observability:https://www.splunk.com/de_de/form/state-of-observability.htmlLearn Observability-Mailinglist: https://o11y.love/iX Log-Management Artikel: https://www.heise.de/select/ix/2022/9/2218110285363626884Runbooks/Incident Management bei GitLabhttps://gitlab.com/gitlab-com/runbookshttps://about.gitlab.com/handbook/engineering/infrastructure/incident-management/opsindev.news Newsletter: https://opsindev.news/Drageekeksi: https://gitlab.com/dnsmichi/atasteofaustriaRegex101: https://regex101.comTeilnehmendeMichael "dnsmichi" Friedrich: https://twitter.com/dnsmichiPhilip "log1c" Baier: https://community.icinga.com/u/log1cTooltippsGlances: https://github.com/nicolargo/glancesviddy: https://github.com/sachaos/viddymoreutils: https://joeyh.name/code/moreutils/Monitoring-Pluginshttps://github.com/aleex42/netapp-cdot-nagioshttps://github.com/bb-Ricardo/check_redfish

Screaming in the Cloud
Trivy and Open Source Communities with Anaïs Urlichs

Screaming in the Cloud

Play Episode Listen Later Sep 6, 2022 36:15


About AnaïsAnaïs is a Developer Advocate at Aqua Security, where she contributes to Aqua's cloud native open source projects. When she is not advocating DevOps best practices, she runs her own YouTube Channel centered around cloud native technologies. Before joining Aqua, Anais worked as SRE at Civo, a cloud native service provider, where she helped enhance the infrastructure for hundreds of tenant clusters. As CNCF ambassador of the year 2021, her passion lies in making tools and platforms more accessible to developers and community members.Links Referenced: Aqua Security: https://www.aquasec.com/ Aqua Open Source YouTube channel: https://www.youtube.com/c/AquaSecurityOpenSource Personal blog: https://anaisurl.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at AWS AppConfig. Engineers love to solve, and occasionally create, problems. But not when it's an on-call fire-drill at 4 in the morning. Software problems should drive innovation and collaboration, NOT stress, and sleeplessness, and threats of violence. That's why so many developers are realizing the value of AWS AppConfig Feature Flags. Feature Flags let developers push code to production, but hide that that feature from customers so that the developers can release their feature when it's ready. This practice allows for safe, fast, and convenient software development. You can seamlessly incorporate AppConfig Feature Flags into your AWS or cloud environment and ship your Features with excitement, not trepidation and fear. To get started, go to snark.cloud/appconfig That's snark.cloud/appconfig.Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate. Is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other; which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at honeycomb.io/screaminginthecloud. Observability: it's more than just hipster monitoring.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Every once in a while, when I start trying to find guests to chat with me and basically suffer my various slings and arrows on this show, I encounter something that I've never really had the opportunity to explore further. And today's guest leads me in just such a direction. Anaïs is an open-source developer advocate at Aqua Security, and when I was asking her whether or not she wanted to talk about various topics, one of the first thing she said was, “Don't ask me much about AWS because I've never used it,” which, oh my God. Anaïs, thank you for joining me. You must be so very happy never to have dealt with the morass of AWS.Anaïs: [laugh]. Yes, I'm trying my best to stay away from it. [laugh].Corey: Back when I got into the cloud space, for lack of a better term, AWS was sort of really the only game in town unless you wanted to start really squinting hard at what you define cloud as. I mean yes, I could have gone into Salesforce or something, but I was already sad and angry all the time. These days, you can very much go all in-on cloud. In fact, you were a CNCF ambassador, if I'm not mistaken. So, you absolutely are in the infrastructure cloud space, but you haven't dealt with AWS. That is just an interesting path. Have you found others who have gone down that same road, or are you sort of the first of a new breed?Anaïs: I think to find others who are in a similar position or have a similar experience, as you do, you first have to talk about your experience, and this is the first time, or maybe the second, that I'm openly [laugh] saying it on something that will be posted live, like, to the internet. Before I, like, I tried to stay away from mentioning it at all, do the best that I can because I'm at this point where I'm so far into my cloud-native Kubernetes journey that I feel like I should have had to deal with AWS by now, and I just didn't. And I'm doing my best and I'm very successful in avoiding it. [laugh]. So, that's where I am. Yeah.Corey: We're sort of on opposite sides of a particular fence because I spend entirely too much time being angry at AWS, but I've never really touched Kubernetes and anger. I mean, I see it in a lot of my customer accounts and I get annoyed at its data transfer bills and other things that it causes in an economic sense, but as far as the care and feeding of a production cluster, back in my SRE days, I had very old-school architectures. It's, “Oh, this is an ancient system, just like grandma used to make,” where we had the entire web tier, then a job applic—or application server tier, and then a database at the end, and everyone knew where everything was. And then containers came out of nowhere, and it seemed like okay, this solves a bunch of problems and introduces a whole bunch more. How do I orchestrate them? How do I ensure that they're healthy?And then ah, Kubernetes was the answer. And for a while, it seemed like no matter what the problem was, Kubernetes was going to be the answer because people were evangelizing it pretty hard. And now I see it almost everywhere that I turn. What's your journey been like? How did you get into the weeds of, “You know what I want to do when I grow up? That's right. I want to work on container orchestration systems.” I have a five-year-old. She has never once said that because I don't abuse my children by making them learn how clouds work. How did you wind up doing what you do?Anaïs: It's funny that you mention that. So, I'm actually of the generation of engineers who doesn't know anything else but Kubernetes. So, when you mentioned that you used to use something before, I don't really know what that looks like. I know that you can still deploy systems without Kubernetes, but I have no idea how. My journey into the cloud-native space started out of frustration from the previous industry that I was working at.So, I was working for several years as developer advocate in the open-source blockchain cryptocurrency space and it's highly similar to all of the cliches that you hear online and across the news. And out of this frustration, [laugh] I was looking at alternatives. One of them was either going into game development, into the gaming industry, or the cloud-native space and infrastructure development and deployment. And yeah, that's where I ended up. So, at the end of 2020, I joined a startup in the cloud-native space and started my social media journey.Corey: One of the things that I found that Kubernetes solved for—and to be clear, Kubernetes really came into its own after I was doing a lot more advisory work and a lot more consulting style activity rather than running my own environments, but there's an entire universe of problems that the modern day engineer never has to think about due to, partially cloud and also Kubernetes as well, which is the idea of hardware or node failure. I've had middle of the night driving across Los Angeles in a panic getting to the data center because the disk array on the primary database had degraded because the drive failed. That doesn't happen anymore. And clouds have mostly solved that. It's okay, drives fail, but yeah, that's the problem for some people who live in Virginia or Oregon. I don't have to think about it myself.But you do have to worry about instances failing; what if the primary database instance dies? Well, when everything lives in a container then that container gets moved around in the stateless way between things, well great, you really only have to care instead about okay, what if all of my instances die? Or, what if my code is really crappy? To which my question is generally, what do you mean, ‘if?' All of us write crappy code.That's the nature of the universe. We open-source only the small subset that we are not actively humiliated by, which is, in a lot of ways, what you're focusing on now, over at Aqua Sec, you are an advocate for open-source. One of the most notable projects that come out of that is Trivy, if I'm pronouncing that correctly.Anaïs: Yeah, that's correct. Yeah. So, Trivy is our main open-source project. It's an all-in-one cloud-native security scanner. And it's actually—it's focused on misconfiguration issues, so it can help you to build more robust infrastructure definitions and configurations.So ideally, a lot of the things that you just mentioned won't happen, but it obviously, highly depends on so many different factors in the cloud-native space. But definitely misconfigurations of one of those areas that can easily go wrong. And also, not just that you have data might cease to exist, but the worst thing or, like, as bad might be that it's completely exposed online. And they are databases of different exposures where you can see all the kinds of data of information from just health data to dating apps, just being online available because the IP address is not protected, right? Things like that. [laugh].Corey: We all get those emails that start with, “Your security is very important to us,” and I know just based on that opening to an email, that the rest of that email is going to explain how security was not very important to you folks. And it's the apology, “Oops, we have messed up,” email. Now, the whole world of automated security scanners is… well, it's crowded. There are a number of different services out there that the cloud providers themselves offer a bunch of these, a whole bunch of scareware vendors at the security conferences do as well. Taking a quick glance at Trivy, one of the problems I see with it, from a cloud provider perspective, is that I see nothing that it does that winds up costing extra money on your cloud bill that you then have to pay to the cloud provider, so maybe they'll put a pull request in for that one of these days. But my sarcasm aside, what is it that differentiates Trivy from a bunch of other offerings in various spaces?Anaïs: So, there are multiple factors. If we're looking from an enterprise perspective, you could be using one of the in-house scanners from any of the cloud providers available, depending which you're using. The thing is, they are not generally going to be the ones who have a dedicated research team that provides the updates based on the vulnerabilities they find across the space. So, with an open-source security scanner or from a dedicated company, you will likely have more up-to-date information in your scans. Also, lots of different companies, they're using Trivy under the hood ultimately, or for their own scans.I can link a few where you can also find them in a Trivy repository. But ultimately, a lot of companies rely on Trivy and other open-source security scanners under the hood because they are from dedicated companies. Now, the other part to Trivy and why you might want to consider using Trivy is that in larger teams, you will have different people dealing with different components of your infrastructure, of your deployments, and you could end up having to use multiple different security scanners for all your different components from your container images that you're using, whether or not they are secure, whether or not they're following best practices that you defined to your infrastructure-as-code configurations, to you're running deployments inside of your cluster, for instance. So, each of those different stages across your lifecycle, from development to runtime, will maybe either need different security scanners, or you could use one security scanner that does it all. So, you could have in a team more knowledge sharing, you could have dedicated people who know how to use the tool and who can help out across a team across the lifecycle, and similar. So, that's one of the components that you might want to consider.Another thing is how mature is a tool, right? A lot of cloud providers, what they end up doing is they provide you with a solution, but it's nice to decoupled from anything else that you're using. And especially in the cloud-native space, you're heavily reliant on open-source tools, such as for your observability stack, right? Coming from Site Reliability Engineering also myself, I love using metrics and Grafana. And for me, if anything open-source from Loki to accessing my logs, to Grafana to dashboards, and all their integrations.I love that and I want to use the same tools that I'm using for everything else, also for my security tools. I don't want to have the metrics for my security tools visualized in a different solution to my reliability metrics for my application, right? Because that ultimately makes it more difficult to correlate metrics. So, those are, like, some of the factors that you might want to consider when you're choosing a security scanner.Corey: When you talk about thinking about this, from the perspective of an SRE is—I mean, this is definitely an artifact of where you come from and how you approach this space. Because in my world, when you have ten web servers, five application servers, and two database servers and you wind up with a problem in production, how do you fix this? Oh, it's easy. You log into one of those nodes and poke around and start doing diagnostics in production. In a containerized world, you generally can't do that, or there's a problem on a container, and by the time you're aware of that, that container hasn't existed for 20 minutes.So, how do you wind up figuring out what happens? And instrumenting for telemetry and metrics and observability, particularly at scale becomes way more important than it ever was, for me. I mean, my version of monitoring was always Nagios, which was the original Call of Duty that wakes you up at two in the morning when the hard drive fails. The world has thankfully moved beyond that and a bunch of ways. But it's not first nature for me. It's always, “Oh, yeah, that's right. We have a whole telemetry solution where I can go digging into.” My first attempt is always, oh, how do I get into this thing and poke it with a stick? Sometimes that's helpful, but for modern applications, it really feels like it's not.Anaïs: Totally. When we're moving to an infrastructure to an environment where we can deploy multiple times a day, right, and update our application multiple times a day, multiple times a day, we can introduce new security issues or other things can go wrong, right? So, I want to see—as much as I want to see all of the other failures, I want to see any security-related issues that might be deployed alongside those updates at the same frequency, right?Corey: The problem that I see across all this stuff, though, is there are a bunch of tools out there that people install, but then don't configure because, “Oh, well, I bought the tool. The end.” I mean, I think it was reported almost ten years ago or so on the big Target breach that they wound up installing some tool—I want to say FireEye, but please don't quote me on that—and it wound up firing off a whole bunch of alerts, and they figured was just noise, so they turned it all off. And it turned out no, no, this was an actual breach in progress. But people are so used to all the alarms screaming at them, that they don't dig into this.I mean, one of the original security scanners was Nessus. And I seen a lot of Nessus reports because for a long time, what a lot of crappy consultancies would do is they would white-label the output of whatever it was that Nessus said and deliver that in as the report. So, you'd wind up with 700 pages of quote-unquote, “Security issues.” And you'd have to flip through to figure out that, ah, this supports a somewhat old SSL negotiation protocol, and you're focusing on that instead of the oh, and by the way, the primary database doesn't have a password set. Like, it winds up just obscuring it because there is so much. How does Trivy approach avoiding the information overload problem?Anaïs: That's a great question because everybody's complaining about vulnerability fatigue, of them, for the first time, scanning their container images and workloads and seeing maybe even hundreds of vulnerabilities. And one of the things that can be done to counteract that right from the beginning is investing your time into looking at the different flags and configurations that you can do before actually deploying Trivy to, for example, your cluster. That's one part of it. The other part is I mentioned earlier, you would use a security scan at different parts of your deployment. So, it's really about integrating scanning not just once you—like, in your production environment, once you've deployed everything, but using it already before and empowering engineers to actually use it on their machines.Now, they can either decide to do it or not; it's not part of most people's job to do security scanning, but as you move along, the more you do, the more you can reduce the noise and then ultimately, when you deploy Trivy, for example, inside of your cluster, you can do a lot of configuration such as scanning just for critical vulnerabilities, only scanning for vulnerabilities that already have a fix available, and everything else should be ignored. Those are all factors and flags that you can place into Trivy, for instance, and make it easier. Now, with Trivy, you won't have automated PRs and everything out of the box; you would have to set up the actions or, like, the ways to mitigate those vulnerabilities manually by yourself with tools, as well as integrating Trivy with your existing stack, and similar. But then obviously, if you want to have something more automated, if you want to have something that does more for you in the background, that's when you want to use to an enterprise solution and shift to something like Aqua Security Enterprise Platform that actually provides you with the automated way of mitigating vulnerabilities where you don't have to know much about it and it just gives you the solution and provides you with a PR with the updates that you need in your infrastructure-as-code configurations to mitigate the vulnerability [unintelligible 00:15:52]?Corey: I think that's probably a very fair answer because let's be serious when you're running a bank or someone for whom security matters—and yes, yes, I know, security should matter for everyone, but let's be serious, I care a little bit less about the security impact of, for example, I don't know, my Twitter for Pets nonsense, than I do a dating site where people are not out about their orientation or whatnot. Like, there is a world of difference between the security concerns there. “Oh, no, you might be able to shitpost as me if you compromise my lasttweetinaws.com Twitter client that I put out there for folks to use.” Okay, great. That is not the end of the world compared to other stuff.By the time you're talking about things that are critically important, yeah, you want to spend money on this, and you want to have an actual full-on security team. But open-source tools like this are terrific for folks who are just getting started or they're building something for fun themselves and as it turns out, don't have a full security budget for their weird late-night project. I think that there's a beautiful, I guess, spectrum, as far as what level of investment you can make into security. And it's nice to see the innovation continued happening in the space.Anaïs: And you just mentioned that dedicated security companies, they likely have a research team that's deploying honeypots and seeing what happens to them, right? Like, how are attackers using different vulnerabilities and misconfigurations and what can be done to mitigate them. And that ultimately translates into the configurations of the open-source tool as well. So, if you're using, for instance, a security scanner that doesn't have an enterprise company with a research team behind it, then you might have different input into the data of that security scanner than if you do, right? So, these are, like, additional considerations that you might want to take when choosing a scanner. And also that obviously depends on what scanning you want to do, on the size of your company, and similar, right?Corey: This episode is sponsored in part by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on-premises, private cloud, and they just announced a fully-managed service on AWS and Azure called BigAnimal, all one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications—including Oracle—to the cloud. To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.Corey: Something that I do find fairly interesting is that you started off, as you say, doing DevRel in the open-source blockchain world, then you went to work as an SRE, and then went back to doing DevRel-style work. What got you into SRE and what got you out of SRE, other than the obvious having worked in SRE myself and being unhappy all the time? I kid, but what was it that got you into that space and then out of it?Anaïs: Yeah. Yeah, but no, it's a great question. And it's, I guess, also was shaped my perspective on different tools and, like, the user experience of different tools. But ultimately, I first worked in the cloud-native space for an enterprise tool as developer advocate. And I did not like the experience of working for a paid solution. Doing developer advocacy for it, it felt wrong in a lot of ways. A lot of times you were required to do marketing work in those situations.And that kind of got me out of developer advocacy into SRE work. And now I was working partially or mainly as SRE, and then on the side, I was doing some presentations in developer advocacy. However, that split didn't quite work, either. And I realized that the value that I add to a project is really the way I convey information, which I can't do if I'm busy fixing the infrastructure, right? I can't convey the information of as much of how the infrastructure has been fixed as I can if I'm working with an engineering team and then doing developer advocacy, solely developer advocacy within the engineering team.So, how I ultimately got back into developer advocacy was just simply by being reached out to by my manager at Aqua Security, and Itay telling me, him telling me that he has a role available and if I want to join his team. And it was open-source-focused. Given that I started my career for several years working in the open-source space and working with engineers, contributing to open-source tools, it was kind of what I wanted to go back to, what I really enjoy doing. And yeah, that's how that came about [laugh].Corey: For me, I found that I enjoy aspects of the technology part, but I find I enjoy talking to people way more. And for me, the gratifying moment that keeps me going, believe it or not, is not necessarily helping giant companies spend slightly less money on another giant company. It's watching people suddenly understand something they didn't before, it's watching the light go on in their eyes. And that's been addictive to me for a long time. I've also found that the best way for me to learn something is to teach someone else.I mean, the way I learned Git was that I foolishly wound up proposing a talk, “Terrible Ideas in Git”—we'll teach it by counterexample—four months before the talk. And they accepted it, and crap, I'd better learn enough get to give this talk effectively. I don't recommend this because if you miss the deadline, I checked, they will not move the conference for you. But there really is something to be said for watching someone learn something by way of teaching it to them.Anaïs: It's actually a common strategy for a lot of developer advocates of making up a talk and then waiting whether or not it will get accepted. [laugh] and once it gets accepted, that's when you start learning the tool and trying to figure it out. Now, it's not a good strategy, obviously, to do that because people can easily tell that you just did that for a conference. And—Corey: Sounds to me, like, you need to get better at bluffing. I kid.Anaïs: [laugh].Corey: I kid. Don't bluff your way through conference talks as a general rule. It tends not to go well. [laugh].Anaïs: No. It's a bad idea. It's a really bad idea. And so, I ultimately started learning the technologies or, like, the different tools and projects in the cloud-native space. And there are lots, if you look at the CNCF landscape, right? But just trying to talk myself through them on my YouTube channel. So, my early videos on my channel, it's just very much on the go of me looking for the first time at somebody's documentation and not making any sense out of them.Corey: It's surprising to me how far that gets you. I mean, I guess I'm always reminded of that Tom Hanks movie from my childhood Big where he wakes up—the kid wakes up as an adult one day, goes to work, and bluffs his way into working at a toy company. He's in a management meeting and just they're showing their new toy they're going to put out there and he's, “I don't get it.” Everyone looks at him like how dare you say it? And, “I don't get it. What's fun about this?” Because he's a kid.And he wants to getting promoted to vice president because wow, someone pointed out the obvious thing. And so often, it feels like using a tool or a product, be it open-source or enterprise, it is clearly something different in my experience of it when I try to use this thing than the person who developed it. And very often it's that I don't see the same things or think of the problem space the same way that the developers did, but also very often—and I don't mean to call anyone in particular out here—it's a symptom of a terrible user interface or user experience.Anaïs: What you've just said, a lot of times, it's just about saying the thing that nobody that dares to say or nobody has thought of before, and that gets you obviously, easier, further [laugh] then repeating what other people have already mentioned, right? And a lot of what you see a lot of times in these—also an open-source projects, but I think more even in closed-source enterprise organizations is that people just repeat whatever everybody else is saying in the room, right? You don't have that as much in the open-source world because you have more input or easier input in public than you do otherwise, but it still happens that I mean, people are highly similar to each other. If you're contributing to the same project, you probably have a similar background, similar expertise, similar interests, and that will get you to think in a similar way. So, if there's somebody like, like a high school student maybe, somebody just graduated, somebody from a completely different industry who's looking at those tools for the first time, it's like, “Okay, I know what I'm supposed to do, but I don't understand why I should use this tool for that.” And just pointing that out, gets you a response, most of the time. [laugh].Corey: I use Twitter and use YouTube. And obviously, I bias more for short, pithy comments that are dripping in sarcasm, whereas in a long-form video, you can talk a lot more about what you're seeing. But the problem I have with bad user experience, particularly bad developer experience, is that when it happens to me—and I know at a baseline level, that I am reasonably competent in technical spaces, but when I encounter a bad interface, my immediate instinctive reaction is, “Oh, I'm dumb. And this thing is for smart people.” And that is never, ever true, except maybe with quantum computing. Great, awesome. The Hello World tutorial for that stuff is a PhD from Berkeley. Good luck if you can get into that. But here in the real world where the rest of us play, it's just a bad developer experience, but my instinctive reaction is that there's stuff I don't know, and I'm not good enough to use this thing. And I get very upset about that.Anaïs: That's one of the things that you want to do with any technical documentation is that the first experience that anybody has, no matter the background, with your tool should be a success experience, right? Like people should look at it, use maybe one command, do one thing, one simple thing, and be like, “Yeah, this makes sense,” or, like, this was fun to do, right? Like, this first positive interaction. And it doesn't have to be complex. And that's what many people I think get wrong, that they try to show off how powerful a tool is, of like, oh, “My God, you can do all those things. It's so exciting, right?” But [laugh] ultimately, if nobody can use it or if most of the people, 99% of the people who try it for the first time have a bad experience, it makes them feel uncomfortable or any negative emotion, then it's really you're approaching it from the wrong perspective, right?Corey: That's very apt. I think it's so much of whether people stick with something long enough to learn it and find the sharp edges has to do with what their experience looks like. I mean, back when I was more or less useless when it comes to anything that looked like programming—because I was a sysadmin type—I started contributing to SaltStack. And what was amazing about that was Tom Hatch, the creator of the project had this pattern that he kept up for way too long, where whenever anyone submitted an issue, he said, “Great, well, how about you fix it?” And because we had a patch, like, “Well, I'm not good at programming.” He's like, “That's okay. No one is. Try it and we'll see.”And he accepted every patch and then immediately, you'd see another patch come in ten minutes later that fixed the problems in your patch. But it was the most welcoming and encouraging experience, and I'm not saying that's a good workflow for an open-source maintainer, but he still remains one of the best humans I know, just from that perspective alone.Anaïs: That's amazing. I think it's really about pointing out that there are different ways of doing open-source [laugh] and there is no one way to go about it. So, it's really about—I mean, it's about the community, ultimately. That's what it boils down to, of you are dependent, as an open-source project, on the community, so what is the best experience that you can give them? If that's something that you want to and can invest in, then yeah [laugh] that's probably the best outcome for everybody.Corey: I do have one more question, specifically around things that are more timely. Now, taking a quick look at Trivy and recent features, it seems like you've just now—now-ish—started supporting cloud scanning as well. Previously, it was effectively, “Oh, this scans configuration and containers. Okay, great.” Now, you're targeting actually scanning cloud providers themselves. What does this change and what brought you to this place, as someone who very happily does not deal with AWS?Anaïs: Yeah, totally. So, I just started using AWS, specifically to showcase this feature. So, if you look at the Aqua Open Source YouTube channel, you will find several tutorials that show you how to use that feature, among others.Now, what I mentioned earlier in the podcast already is that Trivy is really versatile, it allows you to scan different aspects of your stack at different stages of your development lifecycle. And that's made possible because Trivy is ultimately using different open-source projects under the hood. For example, if you want to scan your infrastructure-as-code misconfigurations, it's using a tool called tfsec, specifically for Terraform. And then other tools for other scanning, for other security scanning. Now, we have—or had; it's going to be probably deprecated—a tool called CloudSploit in the Aqua open-source project suite.Now, that's going to, kind of like, the functionality that CloudSploit was providing is going to get converted to become part of Trivy, so everything scanning-related is going to become part of Trivy that really, like, once you understand how Trivy works and all of the CLI commands in Trivy have exactly the same structure, it's really easy to scan from container images to infrastructure-as-code, to generating s-bombs to scanning also now, your cloud infrastructure and Trivy can scan any of your AWS services for misconfigurations, and it's using basically the AWS client under the hood to connect with the services of everything you have set up there, and then give you the list of misconfigurations. And once it has done the scan, you can then drill down further into the different aspects of your misconfigurations without performing the entire scan again, since you likely have lots and lots of resources, so you wouldn't want to scan them every time again, right, when you perform the scan. So, once something has been scanned, Trivy will know whether the resource changed or not, it won't scan it again. That's the same way that in-classes scanning works right now. Once a container image has been scanned for vulnerabilities, it won't scan the same container image again because that would just waste time. [laugh]. So yeah, do check it out. It's our most recent feature, and it's going to come out also to the other cloud providers out there. But we're starting with AWS and this kind of forced me to finally [laugh] look at it for the sake of it. But I'm not going to be happy. [laugh].Corey: No, I don't think anyone is. It's every time I see on a resume that someone says, “Oh, I'm an expert in AWS,” it's, “No you're not.” They have 400-some-odd services now. We have crossed the point long ago, where I can very convincingly talk about AWS services that do not exist to Amazonians and not get called out for it because who in the world knows what they run? And half of their services sound like something I made up to be funny, but they're very real. It's wild to me that it is a sprawling as it is and apparently continues to work as a viable business.But no one knows all of it and everyone feels confused, lost, and overwhelmed every time they look at the AWS console. This has been my entire career in life for the last six years, and I still feel that way. So, I'm sure everyone else does, too.Anaïs: And this is how misconfigurations happen, right? You're confused about what you're actually supposed to do and how you're supposed to do it. And that's, for example, with all the access rights in Google Cloud, something that I'm very familiar with, that completely overwhelms you and you get super frustrated by, and you don't even know what you give access to. It's like, if you've ever had to configure Discord user roles, it's a similar disaster. You will not know which user has access to which. They kind of changed it and try to improve it over the past year, but it's a similar issue that you face in cloud providers, just on a much larger-scale, not just on one chat channel. [laugh]. So.Corey: I think that is probably a fair place to leave it. I really want to thank you for spending as much time with me as you have talking about the trials and travails of, well, this industry, for lack of a better term. If people want to learn more, where's the best place to find you?Anaïs: So, I have a weekly DevOps newsletter on my blog, which is anaisurl—like, how you spell U-R-L—and then dot com. anaisurl.com. That's where I have all the links to my different channels, to all of the resources that are published where you can find out more as well. So, that's probably the best place. Yeah.Corey: And we will, of course, put a link to that in the show notes. I really want to thank you for being as generous with your time as you have been. Thank you.Anaïs: Thank you for having me. It was great.Corey: Anaïs, open-source developer advocate at Aqua Security. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that I will never see because it's buried under a whole bunch of minor or false-positive vulnerability reports.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Serverless Should be Simple with Tomasz Łakomy

Screaming in the Cloud

Play Episode Listen Later May 10, 2022 38:43


About TomaszTomasz is a Frontend Engineer at Stedi, Co-Founder/Head of React at Cloudash, egghead.io instructor with over 200 lessons published, a tech speaker, an AWS Community Hero and a lifelong learner.Links Referenced: Cloudash: https://cloudash.dev/ Twitter: https://twitter.com/tlakomy TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate. Is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other; which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at honeycomb.io/screaminginthecloud. Observability: it's more than just hipster monitoring.Corey: This episode is sponsored in part by our friends at ChaosSearch. You could run Elasticsearch or Elastic Cloud—or OpenSearch as they're calling it now—or a self-hosted ELK stack. But why? ChaosSearch gives you the same API you've come to know and tolerate, along with unlimited data retention and no data movement. Just throw your data into S3 and proceed from there as you would expect. This is great for IT operations folks, for app performance monitoring, cybersecurity. If you're using Elasticsearch, consider not running Elasticsearch. They're also available now in the AWS marketplace if you'd prefer not to go direct and have half of whatever you pay them count towards your EDB commitment. Discover what companies like Equifax, Armor Security, and Blackboard already have. To learn more, visit chaossearch.io and tell them I sent you just so you can see them facepalm, yet again.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. It's always a pleasure to talk to people who ask the bold questions. One of those great bold questions is, what if CloudWatch's web page didn't suck? It's a good question. It's one I ask myself all the time.And then I stumbled across a product that wound up solving this for me, and I'm a happy customer. To be clear, they're not sponsoring anything that I do, nor should they. It's one of those bootstrapped, exciting software projects called Cloudash. Today, I'm joined by the Head of React at Cloudash, Tomasz Łakomy. Tomasz, thank you for joining me.Tomasz: It's a pleasure to be here.Corey: So, where did this entire idea come from? Because I sit and I get upset every time I have to go into the CloudWatch dashboard because first, something's broken. In an ideal scenario, I don't have to care about monitoring or observability or anything like that. But then it's quickly overshadowed by the fact that this interface is terrible. And the reason I know it's terrible is that every time I'm in there, I feel dumb.My belief is—for the longest time, I thought that was a problem with me. But no, invariably, when you wind up working with something and consistently finding it a bad—you don't know enough to solve for it, it's not you. It is, in fact, the signs of a poorly designed experience, start to finish. “You should be smarter to use this tool,” is very rarely correct. And there are a bunch of observability tools and monitoring tools for serverless things that have made sense over the years and made this easier, but one of the most—and please don't take this the wrong way—stripped down, bare essentials of just the facts, style of presentation is Cloudash. It's why I continue to pay for it every month with a smile on my face. How did you get here from there?Tomasz: Yeah that's a good question. I would say that. Cloudash was born out of desire for simple things to be simple. So, as you mentioned, Cloudash is basically the monitoring and troubleshooting tool for serverless applications, made for serverless developers because I am very much into serverless space, as is Maciej Winnicki, who is the another half of Cloudash team. And, you know, the whole premise of serverless was things are going to be simpler, right?So, you know, you have a bunch of code, you're going to dump it into a Lambda function, and that's it. You don't have to care about servers, you don't have to care about, you know, provisioning stuff, you don't have to care about maintenance, and so on. And that is not exactly true because why PagerDuty still continues to be [unintelligible 00:02:56] business even in serverless spaces. So, you will get paged every now and then. The problem is—what we kind of found is once you have an incident—you know, PagerDuty always tends to call it in the middle of the night; it's never, like, 11 a.m. during the workday; it's always the middle of the night.Corey: And no one's ever happy when it calls them either. It's, “Ah, hell.” Whatever it rings, it's yeah, the original Call of Duty. PagerDuty hooked up to Nagios. I am old enough to remember those days.Tomasz: [unintelligible 00:03:24] then business, like, imagine paying for something that's going to wake you up in the middle of the night. It doesn't make sense. In any case—Corey: “So, why do you pay for that product? Because it's really going to piss me off.” “Okay, well… does that sound like a good business to you? Well, AWS seems to think so. No one's happy working with that stuff.” “Fair. Fair enough.”Tomasz: So, in any case, like we've established an [unintelligible 00:03:43]. So you wake up, you go to AWS console because you saw a notification that this-and-this API has, you know, this threshold was above it, something was above the threshold. And then you go to the CloudWatch console. And then you see, okay, those are the logs, those are the metrics. I'm going to copy this request ID. I'm going to go over here. I'm going to go to X-Ray.And again, it's 3 a.m. so you don't exactly remember what do you investigate; you have, like, ten minutes. And this is a problem. Like, we've kind of identified that it's not simple to do these kinds of things, too—it's not simple to open something and have an understanding, okay, what exactly is happening in my serverless app at this very moment? Like, what's going on?So, we've built that. So, Cloudash is a desktop app; it lives on your machine, which is a single pane of glass. It's a single pane of glass view into your serverless system. So, if you are using CloudFormation in order to provision something, when you open Cloudash, you're going to see, you know, all of the metrics, all the Lambda functions, all of the API Gateways that you have provisioned. As of yesterday, API Gateway is no longer cool because they did launch the direct integration, so you have—you can call Lambda functions with [crosstalk 00:04:57]—Corey: Yeah, it's the one they released, and then rolled back and somehow never said a word—because that's an AWS messaging story, and then some—right around re:Invent last year. And another quarter goes by and out it goes.Tomasz: It's out yesterday.Corey: Yeah, it's terrific. I love that thing. The only downside to it is, ah, you have to use one of their—you have to use their domain; no custom domain support. Really? Well, you can hook up CloudFront to it, but the pricing model that way makes it more expensive than API Gateway.Okay, so I could use Cloudflare in front of it, and then it becomes free, so I bought a domain just for that purpose. That's right, my serverl—my direct Lambda URLs now live behind the glorious domain of cheapass.cloud because of course. They are. It's a day-one product from AWS, so of course, it's not feature-complete.But one of the things I like about the serverless model, and it's also a challenge when it comes to troubleshooting stuff is that it's very much set it and forget it style because serverless in many cases, at least the way that I tend to use it, is back-office stuff, its back-end things, it's processing on things that are not necessarily always direct front and center. So, these things can run on their own for years until finally, you find a strange bug in a new use case, or you want to go and change something. And then it's how the hell did this ever work? And it's still working, kind of, but what fool built this? Of course, it was me; it's always me.But what happened here? You're basically excavating your own legacy code, trying to understand what's going on. And so, you're already upset then. Cloudash makes this easier to find the things, to navigate through a whole bunch of different accounts. And there are a bunch of decisions that you made while building the app that are so clearly correct, that I get actively annoyed when others don't because oh, it looks at your AWS configuration file in your user home directory. Great, awesome. It's a desktop app, but it still consults that file. Yay, integration between ClickOps and the terminal. Wonderful.But ah, use SSO for a lot of stuff, so that's going to fix your little red wagon. I click on that app, and suddenly, bam, a browser opens asking me to log in and authenticate, allow the request. It works, and then suddenly, it goes back to doing exactly what you'd expect it to. It's really nice. The affordances behind this are glorious.Tomasz: Like I said, one of our kind of design goals when building Cloudash was to make simple things simple again. The whole purpose is to make sure that you can get into the root cause of an issue within, like, five minutes, if not less. And this is kind of the app that you're going to tend to open whenever that—as I said, because some of the systems can be around for, like, ages, literally without any incident whatsoever, then the data is going to change because somebody [unintelligible 00:07:30] got that the year is 2020 and off you go, we have an incident.But what's important about Cloudash is that we don't send logs anywhere. And that's kind of important because you don't pay for [PUT 00:07:42] metric API because we are not sending those logs anywhere. If you install Cloudash on your machine, we are not going to get your logs from the last ten years, put them in into a system, charge you for that, just so you are able to, you know, find out what happened in this particular hour, like, two weeks ago. We genuinely don't care about your logs; we have enough of our own logs at work to, you know, to analyze, to investigate, and so on; we are not storing them anywhere.In fact, you know, whatever happens on your machine stays on the machine. And that is partially why this is a desktop app. Because we don't want to handle your credentials. We don't—absolutely, we don't want you to give us any of your credentials or access keys, you know, whatever. We don't want that.So, that is why you install Cloudash, it's going to run on your machine, it's going to use your local credentials. So, it's… effectively, you could say that this is a much more streamlined and much more laser-focused browser or like, an eye into AWS systems, which live on the serverless side of things.Corey: I got to deal with it in a bit of an interesting way, recently. I have a detector in my company's production AWS org, to detect when ClickOps is afoot. Now, I'm a big proponent of ClickOps, but I also want to know what's going on, so I have a whole thing that [runs detects 00:09:04] when people are doing things in the console versus via API. And it alerts on certain subsets of them. I had to build a special case for the user agent string coming out of Cloudash because no, no, this is an app, this is not technically ClickOps—it is also read-only, which is neither here nor there, to my understanding.But it was, “Oh yeah, this is effectively an Electron app.” It just wraps, effectively, a browser and presents that as an application. And cool. From my perspective, that's an implementation detail. It feels like a native app—because it is—and I can suddenly see the things I care about in a way that is much more straightforward without having to have four different browser tabs open where, okay, here's the CloudTrail log for this thing, here's the metrics next to it. Oh, those are two separate windows already, and so on and so forth. It just makes hunting down to the obnoxious problems so much nicer.It's also, you're one of those rare products where if I don't use it for a month, I don't get the bill at the end of the month and think, “Ooh, that's going to—did I waste the money?” It's no, nice. I had a whole month where I didn't have to mess with this. It's great.Tomasz: Exactly. I feel like, you know, it's one of those systems where, as you said, we send you an email at the end of every month that we're going to charge you X dollars for the month—by the way, we have fixed pricing and then you can cancel anytime—and it's like one of those things that, you know, I didn't have to open this up for a month. This is awesome because I didn't have any incidents. But I know whenever again, PagerDuty is going to decide, “Hey, dude, wake up. You know, if slept for three hours. That is definitely long enough,” then you know that; you know, this app is there and you can use that.We very much care about, you know, building this stuff, not only for our customers, but we also use that on a daily basis. In fact, I… every single time that I have to—I want to investigate something in, like, our serverless systems at Stedi because everything that we do at work, at Stedi, since this incident serverless paradigm. So, I tend to open Cloudash, like, 95% of the time whenever I want to investigate something. And whenever I am not able to do something in Cloudash, this goes, like, straight to the top of our, you know, issue lists or backlog or whatever you want to call it. Because we want to make this product, not only awesome, you know, for customers to buy a [unintelligible 00:11:22] or whatever, but we also want to be able to use that on a daily basis.And so far, I think we've kind of succeeded. But then again, we have quite a long way to go because we have more ideas, than we have the time, definitely, so we have to kind of prioritize what exactly we're going to build. So, [unintelligible 00:11:39] integrations with alarms. So, for instance, we want to be able to see the alarms directly in the Cloudash UI. Secondly, integration with logs insights, and many other ideas. I could probably talk for hours about what we want to build.Corey: I also want to point out that this is still your side gig. You are by day a front-end engineer over at Stedi, which has a borderline disturbing number of engineers with side gigs, generally in the serverless space, doing interesting things like this. Dynobase is another example, a DynamoDB desktop client; very similar in some respects. I pay for that too. Honestly, for a company in Stedi's space, which is designed as basically a giant API for deep, large enterprise business stuff, there's an awful lot of stuff for small-scale coming out of that.Like, I wind up throwing a disturbing amount of money in the general direction of Stedi for not being their customer. But there's something about the culture that you folks have built over there that's just phenomenal.Tomasz: Yeah. For the record, you know, having a side gig is another part of interview process at Stedi. You don't have to have [laugh] a side project, but yeah, you're absolutely right, you know, the amount of kind of side projects, and you know, some of those are monetized, as you mentioned, you know, Cloudash and Dynobase and others. Some of those—because for instance, you talked to Aidan, I think a couple of weeks ago about his shenanigans, whenever you know, AWS is going to announce something he gets in and try to [unintelligible 00:13:06] this in the most amusing ways possible. Yeah, I mean, I could probably talk for ages about why Stedi is by far the best company I've ever worked at, but I'm going to say this: that this is the most talented group of people I've ever met, and myself, honestly.And, you know, the fact that I think we are the second largest, kind of, group of AWS experts outside of AWS because the density of AWS Heroes, or ex-AWS employees, or people who have been doing cloud stuff for years, is frankly, massive, I tend to learn something new about cloud every single day. And not only because of the Last Week in AWS but also from our Slack.Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of “Hello, World” demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And—let me be clear here—it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself, all while gaining the networking, load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisks next to the word free? This is actually free, no asterisk. Start now. Visit snark.cloud/oci-free that's snark.cloud/oci-free.Corey: There's something to be said for having colleagues that you learn from. I have never enjoyed environments where I did not actively feel like the dumbest person in the room. That's why I love what I do now. I inherently am. I have to talk about so many different things, that whenever I talk to a subject matter expert, it is a certainty that they know more about the thing than I do, with the admitted and depressing exception of course of the AWS bill because it turns out the reason I had to start becoming the expert in that was because there weren't any. And here we are now.I want to talk as well about some of—your interaction outside of work with AWS. For example, you've been an Egghead instructor for a while with over 200 lessons that you published. You're an AWS Community Hero, which means you have the notable distinction of volunteering for a for-profit company—good work—no, the community is very important. It's helping each other make sense of the nonsense coming out of there. You've been involved within the ecosystem for a very long time. What is it about, I guess—the thing I'm wondering about myself sometimes—what is it about the AWS universe that drew you in, and what keeps you here?Tomasz: So, give you some context, I've started, you know, learning about the cloud and AWS back in early-2019. So, fun fact: Maciej Winnicki—again, the co-founder of Cloudash—was my manager at the time. So, we were—I mean, the company I used to work for at the time, OLX Group, we are in the middle of cloud transformation, so to speak. So, going from, you know, on-premises to AWS. And I was, you know, hired as a senior front-end engineer doing, you know, all kinds of front-end stuff, but I wanted to grow, I wanted to learn more.So, the idea was, okay, maybe you can get AWS Certified because, you know, it's one of those corporate goals that you have to have something to put that checkbox next to it. So, you know, getting certified, there you go, you have a checkbox. And off you go. So, I started, you know, diving in, and I saw this whole ocean of things that, you know, I was not entirely aware of. To be fair, at the time I knew about this S3, I knew that you can put a file in an S3 bucket and then you can access it from the internet. This is, like, the [unintelligible 00:16:02] idea of my AWS experiences.Corey: Ideally, intentionally, but one wonders sometimes.Tomasz: Yeah, exactly. That is why you always put stuff as public, right? Because you didn't have to worry about who [unintelligible 00:16:12] [laugh] public [unintelligible 00:16:15]. No, I'm kidding, of course. But still, I think what's [unintelligible 00:16:20] to AWS is what—because it is this endless ocean of things to learn and things to play with, and, you know, things to teach.I do enjoy teaching. As you said, I have quite a lot of, you know, content, videos, blog posts, conference talks, and a bunch of other stuff, and I do that for two reasons. You know, first of all, I tend to learn the best by teaching, so it helps me very much, kind of like, solidify my own knowledge. Whenever I record—like, I have two courses about CDK, you know, when I was recording those, I definitely—that kind of solidify my, you know, ideas about CDK, I get to play with all those technologies.And secondly, you know, it's helpful for others. And, you know, people have opinions about certificates, and so on and so forth, but I think that for somebody who's trying to get into either the tech industry or, you know, cloud stuff in general, being certified helps massively. And I've heard stories about people who are basically managed to double or triple their salaries by going into tech, you know, with some of those certificates. That is why I strongly believe, by the way, that those certificates should be free. Like, if you can pass the exam, you shouldn't have to worry about this $150 of the fee.Corey: I wrote a blog post a while back, “The Dumbest Dollars a Cloud Provider Can Make,” and it's charging for training and certification because if someone's going to invest that kind of time in learning your platform, you're going to try and make $150 bucks off them? Which in some cases, is going to put people off from even beginning that process. “What cloud provider I'm not going to build a project on?” Obviously, the one I know how to work with and have a familiarity with, in almost every case. And the things you learn in your spare time as an independent learner when you get a job, you tend to think about your work the same way. It matters. It's an early on-ramp that pays off down the road and the term of years.I used to be very anti-cert personally because it felt like I was jumping through hoops, and paying, in some cases, for the privilege. I had a CCNA for a while from Cisco. There were a couple of smaller companies, SaltStack, for example, that I got various certifications from at different times. And that was sort of cheating because I helped write the software, but that's neither here nor there. It's the—and I do have a standing AWS cert that I get a different one every time—mine is about to expire—because it gets me access to lounges at physical events, which is the dumbest of all reasons to get certs, but here you go. I view it as the $150 lounge pass with a really weird entrance questionnaire.But in my case it certs don't add anything to what I do. I am not the common case. I am not early in my career. Because as you progress through your career, things—there needs to be a piece of paper that says you know things, and early on degree or certifications are great at that. In the time it becomes your own list of experience on your resume or CV or LinkedIn or God knows what. Polywork if you're doing it the right way these days.And it shows a history of projects that are similar in scope and scale and impact to the kinds of problems that your prospective employer is going to have to solve themselves. Because the best answer to hear—especially in the ops world—when there's a problem is, “Oh, I've seen this before. Here's how you fix it.” As opposed to, “Well, I don't know. Let me do some research.”There's value to that. And I don't begrudge anyone getting certs… to a point. At least that's where I sit on it. At some point when you have 25 certs, it's when you actually do any work? Because it's taking the tests and learning all of these things, which in many ways does boil down to trivia, it stands in counterbalance to a lot of these things.Tomasz: Yeah. I mean, I definitely, totally agree. I remember, you know, going from zero to—maybe not Hero; I'm not talking about AWS Hero—but going from zero to be certified, there was the Solutions Architect Associate. I think it took me, like, 200 hours. I am not the, you know, the brightest, you know, the sharpest tool in the shed, so it probably took me, kind of, somewhat more.I think it's doable in, like, 100 hours, but I tend to over-prepare for stuff, so I didn't actually take the actual exam until I was able to pass the sample exams with, like, 90% pass, just to be extra sure that I'm actually going to pass it. But still, I think that, you know, at some point, you probably should focus on, you know, getting into the actual stuff because I hold two certificates, you know, one of those is going to expire, and I'm not entirely sure if I want to go through the process again. But still, if AWS were to introduce, like, a serverless specialty exam, I would be more than happy to have that. I genuinely enjoy, kind of, serverless, and you know, the fact that I would be able to solidify my knowledge, I have this kind of established path of the things that I should learn about in order to get this particular certificate, I think this could be interesting. But I am not probably going to chase all the 12 certificates.Maybe if AWS IQ was available in Poland, maybe that would change because I do know that with IQ, those certs do matter. But as of [unintelligible 00:21:26] now, I'm quite happy with my certs that I have right now.Corey: Part of the problem, too, is the more you work with these things, the harder it becomes to pass the exams, which sounds weird and counterintuitive, but let me use myself as an example. When I got the cloud practitioner cert, which I believe has lapsed since then, and I got one of the new associate-level betas—I'll keep moving up the stack until I start failing exams. But I got a question wrong on the cloud practitioner because it was, “How long does it take to restore an RDS database from a snapshot backup?” And I gave the honest answer of what I've seen rather than what it says in the book, and that honest answer can be measured in days or hours. Yeah.And no, that's not the correct answer. Yeah, but it is the real one. Similarly, a lot of the questions get around trivia, syntax of which of these is the correct argument, and which ones did we make up? It's, I can explain in some level of detail, virtually every one of AWS has 300 some-odd services to you. Ask me about any of them, I could tell you what it is, how it works, how it's supposed to work and make a dumb joke about it. Fine, whatever.You'll forgive me if I went down that path, instead of memorizing what is the actual syntax of this YAML construct inside of a CloudFormation template? Yeah, I can get the answer to that question in the real world, with about ten seconds of Googling and we move on. That's the way most of us learn. It's not cramming trivia into our heads. There's something broken about the way that we do certifications, and tech interviews in many cases as well.I look back at some of the questions I used to ask people for Linux sysadmin-style jobs, and I don't remember the answer to a lot of these things. I could definitely get back into it, but if I went through one of these interviews now, I wouldn't get the job. One would argue I shouldn't because of my personality, but that's neither here nor there.Tomasz: [laugh]. I mean, that's why you use CDK, so you'd have to remember random YAML comments. And if you [unintelligible 00:23:26] you don't have YAML anymore. [unintelligible 00:23:27].Corey: Yes, you're quite the CDK fanboy, apparently.Tomasz: I do like CDK, yes. I don't like, you know, mental overhead, I don't like context switching, and the way we kind of work at Stedi is everything is written in TypeScript. So, I am a front-end engineer, so I do stuff in the front-end line in TypeScript, all of our Lambda functions are written in TypeScript, and our [unintelligible 00:23:48] is written in TypeScript. So, I can, you know, open up my Visual Studio Code and jump between all of those files, and the language stays the same, the syntax stays the same, the tools stay the same. And I think this is one of the benefits of CDK that is kind of hard to replicate otherwise.And, you know, people have many opinions about the best to deploy infrastructure in the cloud, you know? The best infrastructure-as-code tool is the one that you use at work or in your private projects, right? Because some people enjoy ClickOps like you do; people—Corey: Oh yeah.Tomasz: Enjoy CloudFormation by hand, which I don't; people are very much into Terraform or Serverless Framework. I'm very much into CDK.Corey: Or the SAM CLI, like, three or four more, and I use—Tomasz: Oh, yeah. [unintelligible 00:24:33]—Corey: —all of these things in various ways in some of my [monstrous 00:24:35] projects to keep up on all these things. I did an exploration with the CDK. Incidentally, I think you just answered why I don't like it.Tomasz: Because?Corey: Because it is very clear that TypeScript is a first-class citizen with the CDK. My language of choice is shitty bash because, grumpy old sysadmin; it happens. And increasingly, that is switching over to terrible Python because I'm very bad at that. And the problem that I run into as I was experimenting with this is, it feels like the Python support is not fully baked, most people who are using the CDK are using a flavor of JavaScript and, let's be very clear here, the every time I have tried to explore front-end, I have come away more confused than I was when I started, part of me really thinks I should be learning some JavaScript just because of its versatility and utility to a whole bunch of different problems. But it does not work the way I think, on some level, that it should because of my own biases and experiences. So, if you're not a JavaScript person, I think that you have a much rockier road with the CDK.Tomasz: I agree. Like I said, I tend to talk about my own experiences and my kind of thoughts about stuff. I'm not going to say that, you know, this tool or that tool is the best tool ever because nothing like that exists. Apart from jQuery, which is the best thing that ever happened to the web since, you know, baked bread, honestly. But you are right about CDK, to the best of my knowledge, kind of, all the other languages that are supported by CDK are effectively transpiled down from TypeScript. So it's, like, first of all, it is written in TypeScript, and then kind of the Python, all of the other languages… kind of come second.You know, and afterwards, I tend to enjoy CDK because as I said, I use TypeScript on a daily basis. And you know, with regards to front-end, you mentioned that you are, every single time you is that you end up being more confused. It never goes away. I've been doing front-end stuff for years, and it's, you know, kind of exactly the same. Fun story, I actually joined Cloudash because, well, Maciej started working on Cloudash alone, and after quite some time, he was so frustrated with the modern front-end landscape that he asked me, “Dude, you need to help me. Like, I genuinely need some help. I am tired of React. I am tired of React hooks. This is way too complex. I want to go back to doing back-end stuff. I want to go back, you know, thinking about how we're going to integrate with all those APIs. I don't want to do UI stuff anymore.”Which was kind of like an interesting shift because I remember at the very beginning of my career, where people were talking about front-end—you know, “Front-end is not real programming. Front-end is, you know, it's easy, it's simple. I can learn CSS in an hour.” And the amount of people who say that CSS is easy, and are good at CSS is exactly zero. Literally, nobody who's actually good at CSS says that, you know, CSS, or front-end, or anything like that is easy because it's not. It's incredibly complex. It's getting probably more and more complex because the expectations of our front-end UIs [unintelligible 00:27:44].Corey: It's challenging, it is difficult, and one of the things I find most admirable about you is not even your technical achievements, it's the fact that you're teaching other people to do this. In fact, this gets to the last point I want to cover on our conversation today. When I was bouncing topic ideas off of you, one of the points you brought up that I'm like, “Oh, we're keeping that and saving that for the end,” is why—to your words—why speaking at tech events gets easier, but never easy. Let's dive into that. Tell me more about it.Tomasz: Basically, I've accidentally kickstarted my career by speaking at meetups which later turned into conferences, which later turned into me publishing courses online, which later turned into me becoming an AWS Hero, and here we are, you know, talking to each other. I do enjoy, you know, going out in public and speaking and being on stage. I think, you know, if somebody has, kind of, the heart, the ability to do that, I do strongly recommend, you know, giving it a shot, not only to give, like, an honestly life-changing experience because the first time you go in front of hundreds of people, this is definitely, you know, something that's going to shake you, while at the same time acknowledging that this is absolutely, definitely not for everyone. But if you are able to do that, I think this is definitely worth your time. But as you said—by quoting me—that it gets easier, so every single time you go on stage, talk at a meetup or at a conference or online conferences—which I'm not exactly a fan of, for the record—it's—Corey: It's too much like work, too much like meetings. There's nothing different about it.Tomasz: Yeah, exactly. Like, there's no journey. There's no adventure in online conferences. I know that, of course, you know, given all of that, you know, we had to kind of switch to online conferences for quite some time where I think we are pretending that Covid is not a thing anymore, so we, you know, we're effectively going back, but kind of the point I wanted to make is that I am a somewhat experienced public speaker—I'd like to say that because I've been doing that for years—but I've been, you know, talking to people who actually get paid to speak at the conferences, to actually kind of do that for a living, and they all say the same thing. It gets simpler, it gets easier, but it's never freaking easy, you know, to go out there, and you know, to share whatever you've learned.Corey: I'm one of those people. I am a paid public speaker fairly often, even ignoring the podcast side, and I've spoken on conference stages a couple hundred times at least. And it does get easier but never easy. That's a great way of framing it. You… I get nervous before every talk I give.There are I think two talks I've given that I did not have an adrenaline hit and nervous energy before I went onstage, and both of those were duds. Because I think that it's part of the process, at least for me. And it's like, “Oh, how do you wind up not being scared for before you go on stage?” You don't. You really don't.But if that appeals to you and you enjoy the adrenaline rush of the rest, do it. If you're one of those people who've used public speaking as, “I would prefer death over that,” people are more scared of public speaking their death, in some cases, great. There are so many ways to build audiences and to reach people that fine, if you don't like doing it on stage, don't force yourself to. I'd say try it once; see how it feels meetups are great for this.Tomasz: Yeah. Meetups are basically the best way to get started. I'm yet to meet a meetup, either, you know, offline or online, who is not looking for speakers. It's always quite the opposite, you know? I was, you know, co-organizing a meetup in my city here in Poznań, Poland, and the story always goes like this: “Okay, we have a date. We have a venue. Where are the speakers?” And then you know, the tumbleweed is going to roll across the road and, “Oh, crap, we don't have any speakers.” So, we're going to try to find some, reach out to people. “Hey, I know that you did this fantastic project at your workplace. Come to us, talk about this.” “No, I don't want to. You know, I'm not an expert. I am, you know, I have on the 50 years of experience as an engineer. This is not enough.” Like I said, I do strongly recommend it, but as you said, if you're more scared of public speaking than, like, literally dying, maybe this is not for you.Corey: Yeah. It comes down to stretching your limits, finding yourself interesting. I find that there are lots of great engineers out there. The ones that I find myself drawn to are the ones who aren't just great at building something, but at storytelling around the thing that they are built of, yes, you build something awesome, but you have to convince me to care about it. You have to show me the thing that got you excited about this.And if you can't inspire that excitement in other people, okay. Are you really excited about it? Or what is the story here? And again, it's a different skill set. It is not for everyone, but it is absolutely a significant career accelerator if it's leveraged right.Tomasz: [crosstalk 00:32:45].Corey: [crosstalk 00:32:46] on it.Tomasz: Yeah, absolutely. I think that we don't talk enough about, kind of, the overlap between engineering and marketing. In the good sense of marketing, not the shady kind of marketing. The kind of marketing that you do for yourself in order to elevate yourself, your projects, your successes to others. Because, you know, try as you might, but if you are kind of like sitting in the corner of an office, you know, just jamming on your keyboard 40 hours per week, you're not exactly likely to be promoted because nobody's going to actively reach out to you to find out about your, you know, recent successes and so on.Which at the same time, I'm not saying that you should go @channel in Slack every single time you push a commit to the main branch, but there's definitely, you know, a way of being, kind of, kind to yourself by letting others know that, “Okay, I'm here. I do exist, I have, you know, those particular skills that you may be interested about. And I'm able to tell a story which is, you know, convincing.” So it's, you know, you can tell a story on stage, but you can also tell your story to your customers by building a future that they're going to use. [unintelligible 00:33:50].Corey: I really want to thank you for taking the time to speak with me today. If people want to learn more, where's the best place to find you?Tomasz: So, the best place to find me is on Twitter. So, my Twitter handle is @tlakomy. So, it's T-L-A-K-O-M-Y. I'm assuming this is going to be in the [show notes 00:34:06] as well.Corey: Oh, it absolutely is. You beat me to it.Tomasz: [laugh]. So, you can find Cloudash at cloudash.dev. You can probably also find my email, but don't email me because I'm terrible, absolutely terrible at email, so the best way to kind of reach out to me is via my Twitter DMs. I'm slightly less bad at those.Corey: Excellent. And we will, of course, put links to that in the [show notes 00:34:29]. Thank you so much for being so generous with your time. I appreciate it.Tomasz: Thank you. Thank you for having me.Corey: Tomasz Łakomy, Head of React at Cloudash. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, and if you're on the YouTubes, smash the like and subscribe button, as the kids say. Whereas if you've hated this episode, please do the exact same thing—five-star reviews smash the buttons—but this time also leave an insulting and angry comment written in the form of a CloudWatch log entry that no one is ever able to find in the native interface.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
The Multi-Colored Brick Road to the Cloud with Rachel Dines

Screaming in the Cloud

Play Episode Listen Later Mar 23, 2022 38:08


About RachelRachel leads product and technical marketing for Chronosphere. Previously, Rachel wore lots of marketing hats at CloudHealth (acquired by VMware), and before that, she led product marketing for cloud-integrated storage at NetApp. She also spent many years as an analyst at Forrester Research. Outside of work, Rachel tries to keep up with her young son and hyper-active dog, and when she has time, enjoys crafting and eating out at local restaurants in Boston where she's based.Links: Chronosphere: https://chronosphere.io Twitter: https://twitter.com/RachelDines Email: rachel@chronosphere.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: The company 0x4447 builds products to increase standardization and security in AWS organizations. They do this with automated pipelines that use well-structured projects to create secure, easy-to-maintain and fail-tolerant solutions, one of which is their VPN product built on top of the popular OpenVPN project which has no license restrictions; you are only limited by the network card in the instance. To learn more visit: snark.cloud/deployandgoCorey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. A repeat guest joins me today, and instead of talking about where she works, instead we're going to talk about how she got there. Rachel Dines is the Head of Product and Technical Marketing at Chronosphere. Rachel, thank you for joining me.Rachel: Thanks, Corey. It's great to be here again.Corey: So, back in the early days of me getting started, well, I guess all this nonsense, I was an independent consultant working in the world of cloud cost management and you were over at CloudHealth, which was effectively the 800-pound gorilla in that space. I've gotten louder, and of course, that means noisier as well. You wound up going through the acquisition by VMware at CloudHealth, and now you're over at Chronosphere. We're going to get to all of that, but I'd rather start at the beginning, which, you know, when you're telling stories seems like a reasonable place to start. Your first job out of school, to my understanding, was as an analyst at Forrester is that correct?Rachel: It was yeah. Actually, I started as a research associate at Forrester and eventually became an analyst. But yes, it was Forrester. And when I was leaving school—you know, I studied art history and computer science, which is a great combination, makes a ton of sense—I can explain it another time—and I really wanted to go work at the equivalent of FAANG back then, which was just Google. I really wanted to go work at Google.And I did the whole song-and-dance interview there and did not get the job. Best thing that's ever happened to me because the next day a Forrester recruiter called. I didn't know what Forrester was—once again, I was right out of college—I said, “This sounds kind of interesting. I'll check it out.” Seven years later, I was a principal analyst covering, you know, cloud-to-cloud resiliency and backup to the cloud and cloud storage. And that was an amazing start to my career, that really, I'm credited a lot of the things I've learned and done since then on that start at Forrester.Corey: Well, I'll admit this: I was disturbingly far into my 30s before I started to realize what it is that Forrester and its endless brethren did. I'm almost certain you can tell that story better than I can, so what is it that Forrester does? What is its place in the ecosystem?Rachel: Forrester is one of the two or three biggest industry analyst firms. So, the people that work there—the analysts there—are basically paid to be, like, big thinkers and strategists and analysts, right? There's a reason it's called that. And so the way that we spent all of our time was, you know, talking to interesting large, typically enterprise IT, and I was in the infrastructure and operations group, so I was speaking to infrastructure, ops, precursors to DevOps—DevOps wasn't really a thing back in ye olden times, but we're speaking to them and learning their best practices and publishing reports about the technology, the people and the process that they dealt with. And so you know, over a course of a year, I would talk to hundreds of different large enterprises, the infrastructure and ops leaders at everyone from, like, American Express to Johnson & Johnson to Monsanto, learn from them, write research and reports, and also do things like inquiries and speaking engagements and that kind of stuff.So, the idea of industry analysts is that they're neutral, they're objective. You can go to them for advice, and they can tell you, you know, these are the shortlist of vendors you should consider and this is what you should look for in a solution.Corey: I love the idea of what that role is, but it took me a while as a condescending engineer to really wrap my head around it because I viewed it as oh, it's just for a cover your ass exercise so that when a big company makes a decision, they don't get yelled at later, and they said, “Well, it seemed like the right thing to do. You can't blame us.” And that is an overwhelmingly cynical perspective. But the way it was explained to me, it really was put into context—of all things—by way of using the AWS bill as a lens. There's a whole bunch of tools and scripts and whatnot on GitHub that will tell you different things about your AWS environment, and if I run them in my environment, yeah, they work super well.I run them in a client environment and the thing explodes because it's not designed to work at a scale of 10,000 instances in a single availability zone. It's not designed to do backing off so it doesn't exhaust rate limits across the board. It requires a rethinking at that scale. When you're talking about enterprise-scale, a lot of the Twitter zeitgeist, as it were, about what tools work well and what tools don't for various startups, they fail to cross over into the bowels of a regulated entity that has a bunch of other governance and management concerns that don't really apply. So, there's this idea of okay, now that we're a large, going entity with serious revenue behind this, and migrating to any of these things is a substantial lift. What is the right answer? And that is sort of how I see the role of these companies in the ecosystem playing out. Is that directionally correct?Rachel: I would definitely agree that that is directionally correct. And it was the direction that it was going when I was there at Forrester. And by the way, I've been gone from there for, I think, eight-plus years. So, you know, it's definitely evolved it this space—Corey: A lifetime in tech.Rachel: Literally feels like a lifetime. Towards the end of my time there was when we were starting to get briefings from this bookstore company—you might have heard of them—um, Amazon?Corey: Barnes and Noble.Rachel: Yes. And Barnes and Noble. Yes. So, we're starting to get briefings from Amazon, you know, about Amazon Web Services, and S3 had just been introduced. And I got really excited about Netflix and chaos engineering—this was 2012, right?—and so I did a bunch of research on chaos engineering and tried to figure out how it could apply to the enterprises.And I would, like, bring it to Capital One, and they were like, “Ya crazy.” Turns out I think I was just a little bit ahead of my time, and I'm seeing a lot more of the industry analysts now today looking at like, “Okay, well, yeah, what is Uber doing? Like, what is Netflix doing?” And figure out how that can translate to the enterprise. And it's not a one-to-one, right, just because the people and the structures and the process is so different, so the technology can't just, like, make the leap on its own. But yes, I would definitely agree with that, but it hasn't necessarily always been that way.Corey: Oh, yeah. Like, these days, we're seeing serverless adoption on some levels being driven by enterprises. I mean, Liberty Mutual is doing stuff there that is really at the avant-garde that startups are learning from. It's really neat to see that being turned on its head because you always see these big enterprises saying, “We're like a startup,” but you never see a startup saying, “We're like a big enterprise.” Because that's evocative of something that isn't generally compelling.“Well, what does that mean, exactly? You take forever to do expense reports, and then you get super finicky about it, and you have so much bureaucracy?” No, no, no, it's, “Now, that we're process bound, it's that we understand data sovereignty and things like that.” But you didn't stay there forever. You at some point decided, okay, talking to people who are working in this industry is all well and good, but time for you to go work in that industry yourself. And you went to, I believe, NetApp by way of Riverbed.Rachel: Yes, yeah. So, I left Forrester and I went over to Riverbed to work on their cloud storage solution as a product marketing. And I had an amazing six months at Riverbed, but I happened to join, unfortunately, right around the time they were being taken private, and they ended up divesting their storage product line off to NetApp. And they divested some of their other product lines to some other companies as part of the whole deal going private. So, it was a short stint at Riverbed, although I've met some people that I've stayed in touch with and are still my friends, you know, many years later.And so, yeah, ended up over at NetApp. And it wasn't necessarily what I had initially planned for, but it was a really fun opportunity to take a cloud-integrated storage product—so it was an appliance that people put in their data centers; you could send backups to it, and it shipped those backups on the back end to S3 and then to Glacier when that came out—trying to make that successful in a company that was really not overly associated with cloud. That was a really fun process and a fun journey. And now I look at NetApp and where they are today, and they've acquired Spot and they've acquired CloudCheckr, and they're, like, really going all-in in public cloud. And I like to think, like, “Hey, I was in the early days of that.” But yeah, so that was an interesting time in my life for multiple reasons.Corey: Yeah, Spot was a fascinating product, and I was surprised to see it go to NetApp. It was one of those acquisitions that didn't make a whole lot of sense to me at the time. NetApp has always been one of those companies I hold in relatively high regard. Back when I was coming up in the industry, a bit before the 2012s or so, it was routinely ranked as the number one tech employer on a whole bunch of surveys. And I don't think these were the kinds of surveys you can just buy your way to the top of.People who worked there seemed genuinely happy, the technology was fantastic, and it was, for example, the one use case in which I would run a database where its data store lived on a network file system. I kept whining at the EFS people over at AWS for years that well, EFS is great and all but it's no NetApp. Then they released NetApps on tap on FSX as a first-party service, in which case, okay, thank you. You have now solved every last reservation I have around this. Onward.And I still hold the system in high regard. But it has, on some level, seen an erosion. We're no longer in a world where I am hurling big money—or medium money by enterprise standards—off to NetApp for their filers. It instead is something that the cloud providers are providing, and last time I checked, no matter how much I spend on AWS they wouldn't let me shove a NetApp filer into us-east-1 without asking some very uncomfortable questions.Rachel: Yeah. The whole storage industry is changing really quickly, and more of the traditional on-premises storage vendors have needed to adapt or… not, you know, be very successful. I think that NetApp's done a nice job of adapting in recent years. But I'd been in storage and backup for my entire career at that point, and I was like, I need to get out. I'm done with storage. I'm done with backup. I'm done with disaster recovery. I had that time; I want to go try something totally new.And that was how I ended up leaving NetApp and joining CloudHealth. Because I'd never really done the startup thing. I done a medium-sized company at Riverbed; I'd done a pretty big company at NetApp. I've always been an entrepreneur at heart. I started my first business on the playground in second grade, and it was reselling sticks of gum. Like, I would go use my allowance to buy a big pack of gum, and then I sold the sticks individually for ten cents apiece, making a killer margin. And it was a subscription, actually. [laugh].Corey: Administrations generally—at least public schools—generally tend to turn a—have a dim view of those things, as I recall from my misspent youth.Rachel: Yeah. I was shut down pretty quickly, but it was a brilliant business model. It was—so you had to join the club to even be able to buy into getting the sticks of gum. I was, you know, all over the subscription business [laugh] back then.Corey: And area I want to explore here is you mentioned that you double-majored. One of those majors was computer science—art history was sort of set aside for the moment, it doesn't really align with either direction here—then you served as a research associate turned analyst, and then you went into product marketing, which is an interesting direction to go in. Why'd you do it?Rachel: You know, product marketing and industry analysts are there's a lot of synergy; there's a lot of things that are in common between those two. And in fact, when you see people moving back and forth from the analyst world to the vendor side, a lot of the time it is to product marketing or product management. I mean, product marketing, our whole job is to take really complex technical concepts and relate them back to business concepts and make them make sense of the broader world and tell a narrative around it. That's a lot of what an analyst is doing too. So, you know, analysts are writing, they're giving public talks, they're coming up with big ideas; that's what a great product marketer is doing also.So, for me, that shift was actually very natural. And by the way, like, when I graduated from school, I knew I was never going to code for a living. I had learned all I was going to learn and I knew it wasn't for me. Huge props, like, you know, all the people that do code for a living, I knew I couldn't do it. I wasn't cut out for it.Corey: I found somewhat similar discoveries on my own journey. I can configure things for a living, it's fun, but I still need to work with people, past a certain point. I know I've talked about this before on some of these shows, but for me, when starting out independently, I sort of assumed at some level, I was going to shut it down, and well, and then I'll go back to being an SRE or managing an ops team. And it was only somewhat recently that I had the revelation that if everything that I'm building here collapses out from under me or gets acquired or whatnot and I have to go get a real job again, I'll almost certainly be doing something in the marketing space as opposed to the engineering space. And that was an interesting adjustment to my self-image as I went through it.Because I've built everything that I've been doing up until this point, aligned at… a certain level of technical delivery and building things as an engineer, admittedly a mediocre one. And it took me a fair bit of time to get, I guess, over the idea of myself in that context of, “Wow, you're not really an engineer. Are you a tech worker?” Kind of. And I sort of find myself existing in the in-between spaces.Did you have similar reticence when you went down the marketing path or was it something that you had, I guess, a more mature view of it [laugh] than I did and said, “Yeah, I see the value immediately,” whereas I had to basically be dragged there kicking and screaming?Rachel: Well, first of all, Corey, congratulations for coming to terms with the fact that you are a marketer. I saw it in you from the minute I met you, and I think I've known you since before you were famous. That's my claim to fame is that I knew you before you were famous. But for me personally, no, I didn't actually have that stigma. But that does exist in this industry.I mean, I think people are—think they look down on marketing as kind of like ugh, you know, “The product sells itself. The product markets itself. We don't need that.” But when you're on the inside, you know you can have an amazing product and if you don't position it well and if you don't message it well, it's never going to succeed.Corey: Our consulting [sub-projects 00:14:31] are basically if you bring us in, you will turn a profit on the engaging. We are selling what basically [unintelligible 00:14:37] money. It is one of the easiest ROI calculations. And it still requires a significant amount of work on positioning even on the sales process alone. There's no such thing as an easy enterprise sale.And you're right, in fact, I think the first time we met, I was still running a DevOps team at a company and I was deploying the product that you were doing marketing for. And that was quite the experience. Honestly, it was one of the—please don't take this the wrong way at all—but you were at CloudHealth at the time and the entire point was that it was effectively positioned in such a way of, right, this winds up solving a lot of the problems that we have in the AWS bill. And looking at how some of those things were working, it was this is an annoying, obnoxious problem that I wish I could pay to make someone else's problem, just to make it go away. Well, that indirectly led to exactly where we are now.And it's really been an interesting ride, just seeing how that whole thing has evolved. How did you wind up finding yourself at CloudHealth? Because after VMware, you said it was time to go to a startup. And it's interesting because I look at where you've been now, and CloudHealth itself gets dwarfed by VMware, which is sort of the exact opposite of a startup, due to the acquisition. But CloudHealth was independent for years while you were there.Rachel: Yeah, it was. I was at CloudHealth for about three-plus years before we were acquired. You know, how did I end up there? It's… it's all hazy. I was looking at a lot of startups, I was looking for, like, you know, a Series B company, about 50 people, I wanted something in the public cloud space, but not storage—if I could get away from storage that was the dream—and I met the folks from CloudHealth, and obviously, I hadn't heard about—I didn't know about cloud cost management or cloud governance or FinOps, like, none of those were things back then, but I was I just was really attracted to the vision of the founders.The founders were, you know, Joe Kinsella and Dan Phillips and Dave Eicher, and I was like, “Hey, they've built startups before. They've got a great idea.” Joe had felt this pain when he was a customer of AWS in the early days, and so I was like—Corey: As have we all.Rachel: Right?Corey: I don't think you'll find anyone in this space who hasn't been a customer in that situation and realized just how painful and maddening the whole space is.Rachel: Exactly, yeah. And he was an early customer back in, I think, 2014, 2015. So yeah, I met the team, I really believed in their vision, and I jumped in. And it was really amazing journey, and I got to build a pretty big team over time. By the time we were acquired a couple of years later, I think we were maybe three or 400 people. And actually, fun story. We were acquired the same week my son was born, so that was an exciting experience. A lot of change happened in my life all at once.But during the time there, I got to, you know, work with some really, really cool large cloud-scale organizations. And that was during that time that I started to learn more about Kubernetes and Mesos at the time, and started on the journey that led me to where I am now. But that was one of the happiest accidents, similar to the happy accident of, like, how did I end up at Forrester? Well, I didn't get the job at Google. [laugh]. How did I end up at CloudHealth? I got connected with the founders and their story was really inspiring.Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.Corey: It's amusing to me the idea that, oh, you're at NetApp if you want to go do something that is absolutely not storage. Great. So, you go work at CloudHealth. You're like, “All right. Things are great.” Now, to take a big sip of scalding hot coffee and see just how big AWS billing data could possibly be. Yeah, oops, you're a storage company all over again.Some of our, honestly, our largest bills these days are RDS, Athena, and of course, S3 for all of the bills storage we wind up doing for our customers. And it is… it is not small. And that has become sort of an eye-opener for me just the fact that this is, on some level, a big data problem.Rachel: Yeah.Corey: And how do you wind up even understanding all the data that lives in just the outputs of the billing system? Which I feel is sort of a good setup for the next question of after the acquisition, you stayed at VMware for a while and then matriculated out to where you are now where you're the Head of Product and Technical Marketing at Chronosphere, which is in the observability space. How did you get there from cloud bills?Rachel: Yeah. So, it all makes sense when I piece it together in my mind. So, when I was at CloudHealth, one of the big, big pain points I was seeing from a lot of our customers was the growth in their monitoring bills. Like, they would be like, “Okay, thanks. You helped us, you know, with our EC2 reservations, and we did right-sizing, and you help with this. But, like, can you help with our Datadog bill? Like, can you help with our New Relic bill?”And that was becoming the next biggest line item for them. And in some cases, they were spending more on monitoring and APM and like, what we now call some things observability, they were spending more on that than they were on their public cloud, which is just bananas. So, I would see them making really kind of bizarre and sometimes they'd have to make choices that were really not the best choices. Like, “I guess we're not going to monitor the lab anymore. We're just going to uninstall the agents because we can't pay this anymore.”Corey: Going down from full observability into sampling. I remember that. The New Relic shuffle is what I believe we call it at the time. Let's be clear, they have since fixed a lot of their pricing challenges, but it was the idea of great suddenly we're doing a lot more staging environments, and they come knocking asking for more money but it's a—I don't need that level of visibility in the pre-prod environments, I guess. I hate doing it that way because then you have a divergence between pre-prod and actual prod. But it was economically just a challenge. Yeah, because again, when it comes to cloud, architecture and cost are really one and the same.Rachel: Exactly. And it's not so much that, like—sure, you know, you can fix the pricing model, but there's still the underlying issue of it's not black and white, right? My pre-prod data is not the same value as my prod data, so I shouldn't have to treat it the same way, shouldn't have to pay for it the same way. So, seeing that trend on the one hand, and then, on the other hand, 2017, 2018, I started working on the container cost allocation products at CloudHealth, and we were—you know, this was even before that, maybe 2017, we were arguing about, like, Mesos and Kubernetes and which one was going to be, and I got kind of—got very interested in that world.And so once again, as I was getting to the point where I was ready to leave CloudHealth, I was like, okay, there's two key things I'm seeing in the market. One is people need a change in their monitoring and observability; what they're doing now isn't working. And two, cloud-native is coming up, coming fast, and it's going to really disrupt this market. So, I went looking for someone that was at the intersection of the two. And that's when I met the team at Chronosphere, and just immediately hit it off with the founders in a similar way to where I hit it off with the founders that CloudHealth. At Chronosphere, the founders had felt pain—Corey: Team is so important in these things.Rachel: It's really the only thing to me. Like, you spend so much time at work. You need to love who you work with. You need to love your—not love them, but, you know, you need to work with people that you enjoy working with and people that you learn from.Corey: You don't have to love all your coworkers, and at best you can get away with just being civil with them, but it's so much nicer when you can have a productive, working relationship. And that is very far from we're going to go hang out, have beers after work because that leads to a monoculture. But the ability to really enjoy the people that you work with is so important and I wish that more folks paid attention to that.Rachel: Yeah, that's so important to me. And so I met the team, the team was fantastic, just incredibly smart and dedicated people. And then the technology, it makes sense. We like to joke that we're not just taking the box—the observability box—and writing Kubernetes in Crayon on the outside. It was built from the ground up for cloud-native, right?So, it's built for this speed, containers coming and going all the time, for the scale, just how much more metrics and observability data that containers emit, the interdependencies between all of your microservices and your containers, like, all of that stuff. When you combine it makes the older… let's call them legacy. It's crazy to call, like, some of these SaaS solutions legacy but they really are; they weren't built for cloud-native, they were built for VMs and a more traditional cloud infrastructure, and they're starting to fall over. So, that's how I got involved. It's actually, as we record, it's my one-year anniversary at Chronosphere. Which is, it's been a really wild year. We've grown a lot.Corey: Congratulations. I usually celebrate those by having a surprise meeting with my boss and someone I've never met before from HR. They don't offer your coffee. They have the manila envelope of doom in front of them and hold on, it's going to be a wild meeting. But on the plus side, you get to leave work early today.Rachel: So, good thing you run in your own business now, Corey.Corey: Yeah, it's way harder for me to wind up getting surprise-fired. I see it coming [laugh]—Rachel: [laugh].Corey: —aways away now, and it looks like an economic industry trend.Rachel: [sigh]. Oh, man. Well, anyhow.Corey: Selfishly, I have to ask. You spent a lot of time working in cloud cost, to a point where I learned an awful lot from you as I was exploring the space and learning as I went. And, on some level, for me at least, it's become an aspect of my identity, for better or worse. What was it like for you to leave and go into an orthogonal space? And sure, there's significant overlap, but it's a very different problem aimed at different buyers, and honestly, I think it is a more exciting problem that you are in now, from a business strategic perspective because there's a limited amount of what you can cut off that goes up theoretically to a hundred percent of the cloud bill. But getting better observability means you can accelerate your feature velocity and that turns into something rather significant rather quickly. But what was it like?Rachel: It's uncomfortable, for sure. And I tend to do this to myself. I get a little bit itchy the same way I wanted to get out of storage. It's not because there's anything wrong with storage; I just wanted to go try something different. I tend to, I guess, do this to myself every five years ago, I make a slightly orthogonal switch in the space that I'm in.And I think it's because I love learning something new. The jumping into something new and having the fresh eyes is so terrifying, but it's also really fun. And so it was really hard to leave cloud cost management. I mean, I got to Chronosphere and I was like, “Show me the cloud bill.” And I was like, “Do we have Reserved Instances?” Like, “Are we doing Committed Use Discounts with Google?”I just needed to know. And then that helped. Okay, I got a look at the cloud bill. I felt a little better. I made a few optimizations and then I got back to my actual job which was, you know, running product marketing for Chronosphere. And I still love to jump in and just make just a little recommendation here and there. Like, “Oh, I noticed the costs are creeping up on this. Did we consider this?”Corey: Oh, I still get a kick out of that where I was talking to an Amazonian whose side project was 110 bucks a month, and he's like, yeah, I don't think you could do much over here. It's like, “Mmm, I'll bet you a drink I can.”—Rachel: Challenge accepted.Corey: —it's like, “All right. You're on.” Cut it to 40 bucks. And he's like, “How did you do that?” It's because I know what I'm doing and this pattern repeats.And it's, are the architectural misconfigurations bounded by contacts that turn into so much. And I still maintain that I can look at the AWS bill for most environments for last month and have a pretty good idea, based upon nothing other than that, what's going on in the environment. It turns out that maybe that's a relatively crappy observability system when all is said and done, but it tells an awful lot. I can definitely see the appeal of wanting to get away from purely cost-driven or cost-side information and into things that give a lot more context into how things are behaving, how they're performing. I think there's been something of an industry rebrand away from monitoring, alerting, and trending over time to calling it observability.And I know that people are going to have angry opinions about that—and it's imperative that you not email me—but it all is getting down to the same thing of is my site up or down? Or in larger distributed systems, how down is it? And I still think we're learning an awful lot. I cringe at the early days of Nagios when that was what I was depending upon to tell me whether my site was up or not. And oh, yeah, turns out that when the Nagios server goes down, you have some other problems you need to think about. It became this iterative, piling up on and piling up on and piling up on until you can get sort of good at it.But the entire ecosystem around understanding what's going on in your application has just exploded since the last time I was really running production sites of any scale, in anger. So, it really would be a different world today.Rachel: It's changing so fast and that's part of what makes it really exciting. And the other big thing that I love about this is, like, this is a must-have. This is not table stakes. This is not optional. Like, a great observability solution is the difference between conquering a market or being overrun.If you look at what our founders—our founders at Chronosphere came from Uber, right? They ran the observability team at Uber. And they truly believe—and I believe them, too—that this was a competitive advantage for them. The fact that you could go to Uber and it's always up and it's always running and you know you're not going to have an issue, that became an advantage to them that helped them conquer new markets. We do the same thing for our customers. Corey: The entire idea around how these things are talked about in terms of downtime and the rest is just sort of ludicrous, on some level, because we take specific cases as industry truths. Like, I still remember, when Amazon was down one day when I was trying to buy a pair of underwear. And by that theory, it was—great, I hit a 404 page and a picture of a dog. Well, according to a lot of these industry truisms, then, well, one day a week for that entire rotation of underpants, I should have just been not wearing any. But no here in reality, I went back an hour later and bought underpants.Now, counterpoint: If every third time I wound up trying to check out at Amazon, I wound up hitting that error page, I would spend a lot more money at Target. There is a point at which repeated downtime comes at a cost. But one-offs for some businesses are just fine. Counterpoint with if Uber is down when you're trying to get a ride, well, that ride [unintelligible 00:28:36] may very well be lost for them and there is a definitive cost. No one's going to go back and click on an ad as well, for example, and Amazon is increasingly an advertising company.So, there's a lot of nuance to it. I think we can generally say that across the board, in most cases, downtime bad. But as far as how much that is and what form that looks like and what impact that has on your company, it really becomes situationally dependent.Rachel: I'm just going to gloss over the fact that you buy your underwear on Amazon and really not make any commentary on that. But I mean—Corey: They sell everything there. And the problem, of course, is the crappy counterfeit underwear under the Amazon Basics brand that they ripped off from the good underwear brands. But that's a whole ‘nother kettle of wax for a different podcast.Rachel: Yep. Once again, not making any commentary on your—on that. Sorry, I lost my train of thought. I work in my dining room. My husband, my dog are all just—welcome to pandemic life here.Corey: No, it's fair. They live there. We don't, as a general rule.Rachel: [laugh]. Very true. Yeah. You're not usually in my dining room, all of you but—oh, so uptime downtime, also not such a simple conversation, right? It's not like all of Amazon is down or all of DoorDash is down. It might just be one individual service or one individual region or something that is—Corey: One service in one subset of one availability zone. And this is the problem. People complain about the Amazon status page, but if every time something was down, it reflected there, you'd see a never ending sea of red, and that would absolutely erode confidence in the platform. Counterpoint when things are down for you and it's not red. It's maddening. And there's no good answer.Rachel: No. There's no good answer. There's no good answer. And the [laugh] yeah, the Amazon status page. And this is something I—bringing me back to my Forrester days, availability and resiliency in the cloud was one of the areas I focused on.And, you know, this was once again, early days of public cloud, but remember when Netflix went down on Christmas Eve, and—God, what year was this? Maybe… 2012, and that was the worst possible time they could have had downtime because so many people are with their families watching their Doctor Who Christmas Specials, which is what I was trying to watch at the time.Corey: Yeah, now you can't watch it. You have to actually talk to those people, and none of us can stand them. And oh, dear Lord, yeah—Rachel: What a nightmare.Corey: —brutal for the family dynamic. Observability is one of those things as well that unlike you know, the AWS bill, it's very easy to explain to people who are not deep in the space where it's, “Oh, great. Okay. So, you have a website. It goes well. Then you want—it gets slow, so you put it on two computers. Great. Now, it puts on five computers. Now, it's on 100 computers, half on the East Coast, half on the West Coast. Two of those computers are down. How do you tell?”And it turns in—like, they start to understand the idea of understanding what's going on in a complex system. “All right, how many people work at your company?” “2000,” “Great. Three laptops are broken. How do you figure out which ones are broken?” If you're one of the people with a broken laptop, how do you figure out whether it's your laptop or the entire system? And it lends itself really well to analogies, whereas if I'm not careful when I describe what I do, people think I can get them a better deal on underpants. No, not that kind of Amazon bill. I'm sorry.Rachel: [laugh]. Yeah, or they started to think that you're some kind of accountant or a tax advisor, but.Corey: Which I prefer, as opposed to people at neighborhood block parties thinking that I'm the computer guy because then it's, “Oh, I'm having trouble with the printer.” It's, “Great. Have you tried [laugh] throwing away and buying a new one? That's what I do.”Rachel: This is a huge problem I have in my life of everyone thinking I'm going to fix all of their computer and cloud things. And I come from a big tech family. My whole family is in tech, yet somehow I'm the one at family gatherings doing, “Did you turn it off and turn it back on again?” Like, somehow that's become my job.Corey: People get really annoyed when you say that and even more annoyed when it fixes the problem.Rachel: Usually does. So, the thread I wanted to pick back up on though before I got distracted by my husband and dog wandering around—at least my son is not in the room with us because he'd have a lot to say—is that the standard industry definition of observability—so once again, people are going to write to us, I'm sure; they can write to me, not you, Corey, about observability, it's just the latest buzzword. It's just monitoring, or you know—Corey: It's hipster monitoring.Rachel: Hipster monitoring. That's what you like to call it. I don't really care what we call it. The important thing is it gets us through three phases, right? The first is knowing that something is wrong. If you don't know what's wrong, how are you supposed to ever go fix it, right? So, you need to know that those three laptops are broken.The next thing is you need to know how bad is it? Like, if those three laptops are broken is the CEO, the COO, and the CRO, that's real bad. If it's three, you know, random peons in marketing, maybe not so bad. So, you need to triage, you need to understand roughly, like, the order of magnitude of it, and then you need to fix it. [laugh].Once you fix it, you can go back and then say, all right, what was the root cause of this? How do we make sure this doesn't happen again? So, the way you go through that cycle, you're going to use metrics, you might use logs, you might use traces, but that's not the definition of observability. Observability is all about getting through that, know, then triage, then fix it, then understand.Corey: I really want to thank you for taking the time to speak with me today. If people do want to learn more, give you their unfiltered opinions, where's the best place to find you?Rachel: Well, you can find me on Twitter, I'm @RachelDines. You can also email me, rachel@chronosphere.io. I hope I don't regret giving out that email address. That's a good way you can come and argue with me about what is observability. I will not be giving advice on cloud bills. For that, you should go to Corey. But yeah, that's a good way to get in touch.Corey: Thank you so much for your time. I really appreciate it.Rachel: Yeah, thank you.Corey: Rachel Dines, Head of Product and Technical Marketing at Chronosphere. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, and castigate me with an angry comment telling me that I really should have followed the thread between the obvious link between art history and AWS billing, which is almost certainly a more disturbing Caravaggio.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

2.5 Admins
2.5 Admins 82: Oil, Gold, and Firstborn

2.5 Admins

Play Episode Listen Later Mar 17, 2022 29:24


Our concerns about Google's latest acquisition, the technical implications of the sanctions on Russia, more on VPNs and filesystem replacements, and using Nagios with Sanoid.   Plugs Controlling Resource Limits with RCTL in FreeBSD Support us on patreon   News Google makes second-largest acquisition ever: $5.4 billion for Mandiant Internet backbone Cogent cuts Russia connectivity […]

Screaming in the Cloud
“Cloudash”ing onto Mac with Maciej Winnicki

Screaming in the Cloud

Play Episode Listen Later Jan 13, 2022 34:41


About MaciejMaciej Winnicki is a serverless enthusiast with over 6 years of experience in writing software with no servers whatsoever. Serverless Engineer at Stedi, Cloudash Founder, ex-Engineering Manager, and one of the early employees at Serverless Inc.Links: Cloudash: https://cloudash.dev Maciej Winnicki Twitter: https://twitter.com/mthenw Tomasz Łakomy Twitter: https://twitter.com/tlakomy Cloudash email: hello@cloudash.dev TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part byLaunchDarkly. Take a look at what it takes to get your code into production. I'm going to just guess that it's awful because it's always awful. No one loves their deployment process. What if launching new features didn't require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren't what you expect? LaunchDarkly does exactly this. To learn more, visitlaunchdarkly.com and tell them Corey sent you, and watch for the wince.Corey: This episode is sponsored in part by our friends at Rising Cloud, which I hadn't heard of before, but they're doing something vaguely interesting here. They are using AI, which is usually where my eyes glaze over and I lose attention, but they're using it to help developers be more efficient by reducing repetitive tasks. So, the idea being that you can run stateless things without having to worry about scaling, placement, et cetera, and the rest. They claim significant cost savings, and they're able to wind up taking what you're running as it is in AWS with no changes, and run it inside of their data centers that span multiple regions. I'm somewhat skeptical, but their customers seem to really like them, so that's one of those areas where I really have a hard time being too snarky about it because when you solve a customer's problem and they get out there in public and say, “We're solving a problem,” it's very hard to snark about that. Multus Medical, Construx.ai and Stax have seen significant results by using them. And it's worth exploring. So, if you're looking for a smarter, faster, cheaper alternative to EC2, Lambda, or batch, consider checking them out. Visit risingcloud.com/benefits. That's risingcloud.com/benefits, and be sure to tell them that I said you because watching people wince when you mention my name is one of the guilty pleasures of listening to this podcast.Corey: Welcome to Screaming in the Cloud. I'm Cloud Economist Corey Quinn. And my guest today is Maciej Winnicki, who is the founder of Cloudash. Now, before I dive into the intricacies of what that is, I'm going to just stake out a position that one of the biggest painful parts of working with AWS in any meaningful sense, particularly in a serverless microservices way, is figuring out what the hell's going on in the environment. There's a bunch of tools offered to do this and they're all—yeee, they aspire to mediocrity. Maciej, thank you for joining me today.Corey: Welcome to Screaming in the Cloud. I'm Cloud Economist Corey Quinn. And my guest today is Maciej Winnicki, who is the founder of Cloudash. Now, before I dive into the intricacies of what that is, I'm going to just stake out a position that one of the biggest painful parts of working with AWS in any meaningful sense, particularly in a serverless microservices way, is figuring out what the hell's going on in the environment. There's a bunch of tools offered to do this and they're all—yeee, they aspire to mediocrity. Maciej, thank you for joining me today.Maciej: Thank you for having me.Corey: So, I turned out to have accidentally blown up Cloudash, sort of before you were really ready for the attention. You, I think, tweeted about it or put it on Hacker News or something; I stumbled over it because it turns out that anything that vaguely touches cloud winds up in my filters because of awesome technology, and personality defects on my part. And I tweeted about it as I set it up and got the thing running, and apparently this led to a surge of attention on this thing that you've built. So, let me start off with an apology. Oops, I didn't realize it was supposed to be a quiet launch.Maciej: I actually thank you for that. Like, that was great. And we get a lot of attention from your tweet thread, actually because at the end, that was the most critical part. At the end of the twitter, you wrote that you're staying as a customer, so we have it on our website and this is perfect. But actually, as you said, that's correct.Our marketing strategy for releasing Cloudash was to post it on LinkedIn. I know this is not, kind of, the best strategy, but that was our plan. Like, it was like, hey, like, me and my friend, Tomasz, who's also working on Cloudash, we thought like, let's just post it on LinkedIn and we'll see how it goes. And accidentally, I'm receiving a notification from Twitter, “Hey, Corey started tweeting about it.” And I was like, “Oh, my God, I'm having a heart attack.” But then I read the, you know—Corey: Oops.Maciej: [laugh]. Yeah. I read the, kind of, conclusion, and I was super happy. And again, thank you for that because this is actually when Cloudash kind of started rolling as a product and as a, kind of, business. So yeah, that was great.Corey: To give a little backstory and context here is, I write a whole bunch of serverless nonsense. I build API's Gateway, I hook them up to Lambda's Function, and then it sort of kind of works. Ish. From there, okay, I would try and track down what was going on because in a microservices land, everything becomes a murder mystery; you're trying to figure out what's broken, and things have exploded. And I became a paying customer of IOpipe. And then New Relic bought them. Well, crap.Then I became a paying customer of Epsagon. And they got acquired by Cisco, at which point I immediately congratulated the founders, who I know on a social basis, and then closed my account because I wanted to get out before Cisco ruins it because, Cisco. Then it was, what am I going to use next? And right around that time is when I stumbled across Cloudash. And it takes a different approach than any other entity in the space that I've seen because you are a native Mac desktop app. I believe your Mac only, but you seem to be Electron, so I could be way off base on that.Maciej: So, we're Linux as well right now and soon we'll be Windows as well. But yeah, so, right now is Mac OS and Linux. Yeah, that's correct. So, our approach is a little bit different.So, let me start by saying what's Cloudash? Like, Cloudash is a desktop app for, kind of, monitoring and troubleshooting serverless architectures services, like, serverless stuff in general. And the approach that we took is a little bit different because we are not web-based, we're desktop-based. And there's a couple of advantages of that approach. The first one is that, like, you don't need to share your data with us because we're not, kind of, downloading your metrics and logs to our back end and to process them, et cetera, et cetera. We are just using the credentials, the AWS profiles that you have defined on your computer, so nothing goes out of your AWS account.And I think this is, like, considering, like, from the security perspective, this is very crucial. You don't need to create a role that you give us access to or anything like that. You just use the stuff that you have on your desktop, and everything stays on your AWS account. So, nothing—we don't download it, we don't process it, we don't do anything from that. And that's one approach—well, that's the one advantage. The other advantage is, like, kind of, onboarding, as I kind of mentioned because we're using the AWS profiles that you have defined in your computer.Corey: Well, you're doing significantly more than that because I have a lot of different accounts configured different ways, and when I go to one of them that uses SSO, it automatically fires me off to the SSO login page if I haven't logged in that day for a 12 hour session—Maciej: Yes.Corey: —for things that have credentials stored locally, it uses those; and for things that are using role-chaining to use assuming roles from the things I have credentials for, and the things that I just do role assumption in, and it works flawlessly. It just works the way that most of my command-line tools do. I've never seen a desktop app that does this.Maciej: Yeah. So, we put a lot of effort into making sure that this works great because we know that, like, no one will use Cloudash if there's—like, not no one, but like, we're targeting, like, serverless teams, maybe, in enterprise companies, or serverless teams working on some startups. And in most cases, those teams or those engineers, they use SSO, or at least MFA, right? So, we have it covered. And as you said, like, it should be the onboarding part is really easy because you just pick your AWS profile, you just pick region, and just pick, right now, a CloudFormation stack because we get the information about your service based on CloudFormation stack. So yeah, we put a lot of effort in making sure that this works without any issues.Corey: There are some challenges to it that I saw while setting it up, and that's also sort of the nature of the fact you are, in fact, integrating with CloudWatch. For example, it's region specific. Well, what if I want to have an app that's multi-region? Well, you're going to have a bad time because doing [laugh] anything multi-region in AWS means you're going to have a bad time that gets particularly obnoxious and EC2 get to when you're doing something like Lambda@Edge, where, oh, where are the logs live; that's going to be in a CloudFront distribution in whatever region it winds up being accessed from. So, it comes down to what distribution endpoint or point of presence did that particular request go through, and it becomes this giant game of whack-a-mole. It's frustrating, and it's obnoxious, and it's also in no way your fault.Maciej: Yeah, I mean, we are at the beginning. Right now, it's the most straightforward, kind of pe—how people think about stacks of serverless. They're think in terms of regions because I think for us, regions, or replicated stacks, or things like that are not really popular yet. Maybe they will become—like, this is how AWS works as a whole, so it's not surprising that we're kind of following this path. I think my point is that our main goal, the ultimate goal, is to make monitoring, as I said, the troubleshooting serverless app as simple as possible.So, once we will hear from our customers, from our users that, “Hey, we would like to get a little bit better experience around regions,” we will definitely implement that because why not, right? And I think the whole point of Cloudash—and maybe we can go more deep into that later—is that we want to bring context into your metrics and logs. If you're seeing a, for example, X-Ray trace ID in your logs, you should be able with one click just see that the trace. It's not yet implemented in Cloudash, but we are having it in the backlog. But my point is that, like, there should be some journey when you're debugging stuff, and you shouldn't be just, like, left alone having, like, 20 tabs, Cloudash tabs open and trying to figure out where I was—like, where's the Lambda? Where's the API Gateway logs? Where are the CloudFront logs? And how I can kind of connect all of that? Because that's—it's an issue right now.Corey: Even what you've done so far is incredibly helpful compared to the baseline experience that folks will often have, where I can define a service that is comprised of a number of different functions—I have one set up right now that has seven functions in it—I grab any one of those things, and I can set how far the lookback is, when I look at that function, ranging from 5 minutes to 30 days. And it shows me at the top the metrics of invocations, the duration that the function runs for, and the number of errors. And then, in the same pane down below it, it shows the CloudWatch logs. So, “Oh, okay, great. I can drag and zoom into a specific timeframe, and I see just the things inside of that.”And I know this sounds like well, what's the hard part here? Yeah, except nothing else does it in an easy-to-use, discoverable way that just sort of hangs out here. Honestly, the biggest win for me is that I don't have to log in to the browser, navigate through some ridiculous other thing to track down what I'm talking about. It hangs out on my desktop all the time, and whether it's open or not, whenever I fire it up, it just works, basically, and I don't have to think about it. It reduces the friction from, “This thing is broken,” to, “Let me see what the logs say.”Very often I can go from not having it open at all to staring at the logs and having to wait a minute because there's some latency before the event happens and it hits CloudWatch logs itself. I'm pretty impressed with it, and I've been keeping an eye on what this thing is costing me. It is effectively nothing in terms of CloudWatch retrieval charges. Because it's not sitting there sucking all this data up all the time, for everything that's running. Like, we've all seen the monitoring system that winds up costing you more than it costs more than they charge you ancillary fees. This doesn't do that.I also—while we're talking about money, I want to make very clear—because disclaiming the direction the money flows in is always important—you haven't paid me a dime, ever, to my understanding. I am a paying customer at full price for this service, and I have been since I discovered it. And that is very much an intentional choice. You did not sponsor this podcast, you are not paying me to say nice things. We're talking because I legitimately adore this thing that you've built, and I want it to exist.Maciej: That's correct. And again, thank you for that. [laugh].Corey: It's true. You can buy my attention, but not my opinion. Now, to be clear, when I did that tweet thread, I did get the sense that this was something that you had built as sort of a side project, as a labor of love. It does not have VC behind it, of which I'm aware, and that's always going to, on some level, shade how I approach a service and how critical I'm going to be on it. Just because it's, yeah, if you've raised a couple 100 million dollars and your user experience is trash, I'm going to call that out.But if this is something where you just soft launched, yeah, I'm not going to be a jerk about weird usability bugs here. I might call it out as “Ooh, this is an area for improvement,” but not, “What jackwagon thought of this?” I am trying to be a kinder, gentler Corey in the new year. But at the same time, I also want to be very clear that there's room for improvement on everything. What surprised me the most about this is how well you nailed the user experience despite not having a full team of people doing UX research.Maciej: That was definitely a priority. So, maybe a little bit of history. So, I started working on Cloudash, I think it was April… 2019. I think? Yeah. It's 2021 right now. Or we're 2022. [unintelligible 00:11:33].Corey: Yeah. 2022, now. I—Maciej: I'm sorry. [laugh].Corey: —I've been screwing that up every time I write the dates myself, I'm with you.Maciej: [laugh]. Okay, so I started working on Cloudash, in 2020, April 2020.Corey: There we go.Maciej: So, after eight months, I released some beta, like, free; you could download it from GitHub. Like, you can still download on GitHub, but at that time, there was no license, you didn't have to buy a license to run it. So, it was, like, very early, like, 0.3 version that was working, but sort of, like, [unintelligible 00:12:00] working. There were some bugs.And that was the first time that I tweeted about it on Twitter. It gets some attention, but, like, some people started using it. I get some feedback, very initial feedback. And I was like, every time I open Cloudash, I get the sense that, like, this is useful. I'm talking about my own tool, but like, [laugh] that's the thing.So, further in the history. So, I'm kind of service engineer by my own. I am a software engineer, I started focusing on serverless, in, like, 2015, 2016. I was working for Serverless Inc. as an early employee.I was then working as an engineering manager for a couple of companies. I work as an engineering manager right now at Stedi; we're also, like, fully serverless. So I, kind of, trying to fix my own issues with serverless, or trying to improve the whole experience around serverless in AWS. So, that's the main purpose why we're building Cloudash: Because we want to improve the experience. And one use case I'm often mentioning is that, let's say that you're kind of on duty. Like, so in the middle of night PagerDuty is calling you, so you need to figure out what's going on with your Lambda or API Gateway.Corey: Yes. PagerDuty, the original [Call of Duty: Nagios 00:13:04]. “It's two in the morning; who is it?” “It's PagerDuty. Wake up, jackass.” Yeah. We all had those moments.Maciej: Exactly. So, the PagerDuty is calling you and you're, kind of, in the middle of night, you're not sure what's going on. So, the kind of thing that we want to optimize is from waking up into understanding what's going on with your serverless stuff should be minimized. And that's the purpose of Cloudash as well. So, you should just run one tool, and you should immediately see what's going on. And that's the purpose.And probably with one or two clicks, you should see the logs responsible, for example, in your Lambda. Again, like that's exactly what we want to cover, that was the initial thing that we want to cover, to kind of minimize the time you spent on troubleshooting serverless apps. Because as we all know, kind of, the longer it's down, the less money you make, et cetera, et cetera, et cetera.Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of "Hello, World" demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And—let me be clear here—it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself all while gaining the networking load, balancing and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisks next to the word free. This is actually free, no asterisk. Start now. Visit snark.cloud/oci-free that's snark.cloud/oci-free.Corey: One of the things that I appreciate about this is that I have something like five different microservices now that power my newsletter production pipeline every week. And periodically, I'll make a change and something breaks because testing is something that I should really get around to one of these days, but when I'm the only customer, cool. Doesn't really matter until suddenly I'm trying to write something and it doesn't work. Great. Time to go diving in, and always I'm never in my best frame of mind for that because I'm thinking about writing for humans not writing for computers. And that becomes a challenge.And okay, how do I get to the figuring out exactly what is broken this time? Regression testing: It really should be a thing more than it has been for me.Maciej: You should write those tests. [laugh].Corey: Yeah. And then I fire this up, and okay, great. Which sub-service is it? Great. Okay, what happened in the last five minutes on that service? Oh, okay, it says it failed successfully in the logs. Okay, that's on me. I can't really blame you for that. But all right.And then it's a matter of adding more [print or 00:14:54] debug statements, and understanding what the hell is going on, mostly that I'm bad at programming. And then it just sort of works from there. It's a lot easier to, I guess, to reason about this from my perspective than it is to go through the CloudWatch dashboards, where it's okay, here's a whole bunch of metrics on different graphs, most of which you don't actually care about—as opposed to unified view that you offer—and then “Oh, you want to look at logs, that's a whole separate sub-service. That's a different service team, obviously, so go open that up in another browser.” And I'm sitting here going, “I don't know who designed this, but are there any windows in their house? My God.”It's just the saddest thing I can possibly experience when I'm in the middle of trying to troubleshoot. Let's be clear, when I'm troubleshooting, I am in no mood to be charitable to anyone or anything, so that's probably unfair to those teams. But by the same token, it's intensely frustrating when I keep smacking into limitations that get in my way while I'm just trying to get the thing up and running again.Maciej: As you mentioned about UX that, like, we've spent a lot of time thinking about the UX, trying different approaches, trying to understand which metrics are the most important. And as we all know, kind of, serverless simplifies a lot of stuff, and there's, like, way less metrics that you need to look into when something is happening, but we want to make sure that the stuff that we show—which is duration errors, and p95—are probably the most important in most cases, so like, covering most of this stuff. So sorry, I didn't mention that before; it was very important from the very beginning. And also, like, literally, I spent a lot of time, like, working on the colors, which sounds funny, [laugh] but I wanted to get them right. We're not yet working on dark mode, but maybe soon.Anyways, the visual part, it's always close to my heart, so we spent a lot of time going back to what just said. So, definitely the experience around using CloudWatch right now, and CloudWatch logs, CloudWatch metrics, is not really tailored for any specific use case because they have to be generic, right? Because AWS has, like, I don't know, like, 300, or whatever number of services, probably half of them producing logs—maybe not half, maybe—Corey: We shouldn't name a number because they'll release five more between now and when this publishes in 20 minutes.Maciej: [laugh]. So, CloudWatch has to be generic. What we want to do with Cloudash is to take those generic tools—because we use, of course, CloudWatch logs, CloudWatch metrics, we fetch data from them—but make the visual part more tailored for specific use case—in our case, it's the serverless use case—and make sure that it's really, kind of—it shows only the stuff that you need to see, not everything else. So again, like that's the main purpose. And then one more thing, we—like this is also some kind of measurement of success, we want to reduce number of tabs that you need to have open in your browser when you're dealing with CloudWatch. So, we tried to put most important stuff in one view so you don't need to flip between tabs, as you usually do when try to under some kind of broader scope, or broader context of your, you know, error in Lambda.Corey: What inspired you to do this as a desktop application? Because a lot of companies are doing similar things, as SaaS, as webapps. And I have to—as someone who yourself—you're a self-described serverless engineer—it seems to me that building a webapp is sort of like the common description use case of a lot of serverless stuff. And you're sitting here saying, “Nope, it's desktop app time.” Which again, I'm super glad you did. It's exactly what I was looking for. How do you get here?Maciej: I'd been thinking about both kinds of types of apps. So like, definitely webapp was the initial idea how to build something, it was the webapp. Because as you said, like, that's the default mode. Like, we are thinking webapp; like, let's build a webapp because I'm an engineer, right? There is some inspiration coming from Dynobase, which was made by a friend [unintelligible 00:18:55] who also lives in Poland—I didn't mention that; we're based in [Poznań 00:18:58], Poland.And when I started thinking about it, there's a lot of benefits of using this approach. The biggest benefit, as I mentioned, is security; and the second benefit is just most, like, cost-effective because we don't need to run in the backend, right? We don't need to download all your metrics, all your logs. We I think, like, let's think about it, like, from the perspective. Listen, so everyone in the company to start working, they have to download all of your stuff from your AWS account. Like, that sounds insane because you don't need all of that stuff elsewhere.Corey: Store multiple copies of it. Yeah I, generally when I'm looking at this, I care about the last five to ten minutes.Maciej: Exactly.Corey: I don't—Maciej: Exactly.Corey: —really care what happened three-and-a-half years ago on this function. Almost always. But occasionally I want to look back at, “Oh, this has been breaking. How long has it been that way?” But I already have that in the AWS environment unless I've done the right thing and turned on, you know, log expiry.Maciej: Exactly. So, this is a lot of, like, I don't want to be, like, you know, mean to anyone but like, that's a lot of waste. Like, that's a lot of waste of compute power because you need to download it; of cost because you need to get this data out of AWS, which you need to pay for, you know, get metric data and stuff like this. So, you need to—Corey: And almost all of its—what is it? Write once, read never. Because it's, you don't generally look at these things.Maciej: Yeah, yeah. Exactly.Corey: And so much of this, too, for every invocation I have, even though it's low traffic stuff, it's the start with a request ID and what version is running, it tells me ‘latest.' Helpful. A single line of comment in this case says ‘200.' Why it says that, I couldn't tell you. And then it says ‘End request ID.' The end.Now, there's no way to turn that off unless you disabled the ability to write to CloudWatch logs in the function, but ingest on that cost 50 cents a gigabyte, so okay, I guess that's AWS's money-making scam of the year. Good for them. But there's so much of that, it's like looking at—like, when things are working, it's like looking at a low traffic site that's behind a load balancer, where there's a whole—you have gigabytes, in some cases, of load balancer—of web server logs on the thing that's sitting in your auto-scaling group. And those logs are just load balancer health checks. 98% of it is just that.Same type of problem here, I don't care about that, I don't want to pay to store it, I certainly don't want to pay to store it twice. I get it, that makes an awful lot of sense. It also makes your security job a hell of a lot easier because you're not sitting on a whole bunch of confidential data from other people. Because, “Well, it's just logs. What could possibly be confidential in there?” “Oh, my sweet summer child, have you seen some of the crap people put in logs?”Maciej: I've seen many things in logs. I don't want to mention them. But anyways—and also, you know, like, usually when you gave access to your AWS account, it can ruin you. You know, like, there might be a lot of—like, you need to really trust the company to give access to your AWS account. Of course, in most cases, the roles are scoped to, you know, only CloudWatch stuff, actions, et cetera, et cetera, but you know, like, there are some situations in which something may not be properly provisioned. And then you give access to everything.Corey: And you can get an awful lot of data you wouldn't necessarily want out of that stuff. Give me just the PDF printout of last month's bill for a lot of environments, and I can tell you disturbing levels of detail about what your architecture is, just because when you—you can infer an awful lot.Maciej: Yeah.Corey: Yeah, I hear you. It makes your security story super straightforward.Maciej: Yeah, exactly. So, I think just repeat my, like, the some inspiration. And then when I started thinking about Cloudash, like, definitely one of the inspiration was Dynobase, from the, kind of, GUI for, like, more powerful UI for DynamoDB. So, if you're interested in that stuff, you can also check this out.Corey: Oh, yeah, I've been a big fan of that, too. That'll be a separate discussion on a different episode, for sure.Maciej: [laugh]. Yeah.Corey: But looking at all of this, looking at the approach of, the only real concern—well, not even a concern. The only real challenge I have with it for my use case is that when I'm on the road, the only thing that I bring with me for a computer is my iPad Pro. I'm not suggesting by any means that you should build this as a new an iPad app; that strikes me as, like, 15 levels of obnoxious. But it does mean that sometimes I still have to go diving into the CloudWatch console when I'm not home. Which, you know, without this, without Cloudash, that's what I was doing originally anyway.Maciej: You're the only person that requested that. And we will put that into backlog, and we will get to that at some point. [laugh].Corey: No, no, no. Smart question is to offer me a specific enterprise tier pricing—.Maciej: Oh, okay. [laugh].Corey: —that is eye-poppingly high. It's like, “Hey, if you want a subsidize feature development, we're thrilled to empower that.” But—Maciej: [laugh]. Yeah, yeah. To be honest, I like that would be hard to write [unintelligible 00:23:33] implement as iPad app, or iPhone app, or whatever because then, like, what's the story behind? Like, how can I get the credentials, right? It's not possible.Corey: Yeah, you'd have to have some fun with that. There are a couple of ways I can think of offhand, but then that turns into a sandboxing issue, and it becomes something where you have to store credentials locally, regardless, even if they're ephemeral. And that's not great. Maybe turn it into a webapp someday or something. Who knows.What I also appreciate is that we had a conversation when you first launched, and I wound up basically going on a Zoom call with you and more or less tearing apart everything you've built—and ideally constructive way—but looking at a lot of the things you've changed in your website, you listened to an awful lot of feedback. You doubled your pricing, for example. Used to be ten bucks a month; now you're twenty. Great. I'm a big believer in charging more.You absolutely add that kind of value because it's, “Well, twenty bucks a month for a desktop app. That sounds crappy.” It's, “Yeah, jackwagon, what's your time worth?” I was spending seven bucks a month in serverless charges, and 120 or 130 a month for Epsagon, and I was thrilled to pieces to be doing it because the value I got from being able to quickly diagnose what the hell was going on far outstripped what the actual cost of doing these things. Don't fall into the trap of assuming that well, I shouldn't pay for software. I can just do it myself. Your time is never free. People think it is, but it's not.Maciej: That's true. The original price of $9.99, I think that was the price was the launch promo. After some time, we've decided—and after adding more features: API Gateway support—we've decided that this is, like, solving way more problems, so like, you should probably pay a little bit more for that. But you're kind of lucky because you subscribed to it when it was 9.99, and this will be your kind of prize for the end of, you know—Corey: Well, I'm going to argue with you after the show to raise the price on mine, just because it's true. It's the—you want to support the things that you want to exist in the world. I also like the fact that you offered an annual plan because I will go weeks without ever opening the app. And that doesn't mean it isn't adding value. It's that oh, yeah, I will need that now that I'm hitting these issues again.And if I'm paying on a monthly basis, and it shows up with a, “Oh, you got charged again.” “Well, I didn't use it this month; I should cancel.” And [unintelligible 00:25:44] to an awful lot of subscriber churn. But in the course of a year, if I don't have at least one instance in which case, wow, that ten minute span justified the entire $200 annual price tag, then, yeah, you built the wrong thing or it's not for me, but I can think of three incidents so far since I started using it in the past four months that have led to that being worth everything you will charge me a year, and then some, just because it made it so clear what was breaking.Maciej: So, in that regard, we are also thinking about the team licenses, that's definitely on the roadmap. There will be some changes to that. And we definitely working on more and more features. And if we're—like, the roadmap is mostly about supporting more and more AWS services, so right now it's Lambda, API Gateway, we're definitely thinking about SQS, SNS, to get some sense how your messages are going through, probably something, like, DynamoDB metrics. And this is all kind of serverless, but why not going wider? Like, why not going to Fargate? Like, Fargate is theoretically serverless, but you know, like, it's serverless on—Corey: It's serverless with a giant asterisk next to it.Maciej: Yeah, [laugh] exactly. So, but why not? Like, it's exactly the same thing in terms of, there is some user flow, there is some user journey, when you want to debug something. You want to go from API Gateway, maybe to the container to see, I don't know, like, DynamoDB metric or something like that, so it should be all easy. And this is definitely something.Later, why not EC2 metrics? Like, it would be a little bit harder. But I'm just saying, like, first thing here is that you are not, like, at this point, we are serverless, but once we cover serverless, why not going wider? Why not supporting more and more services and just making sure that all those use cases are correctly modeled with the UI and UX, et cetera?Corey: That's going to be an interesting challenge, just because that feels like what a lot of the SaaS monitoring and observability tooling is done. And then you fire this thing up, and it looks an awful lot like the AWS console. And it's, “Yeah, I just want to look at this one application that doesn't use any of the rest of those things.” Again, I have full faith and confidence in your ability to pull this off. You clearly have done that well based upon what we've seen so far. I just wonder how you're going to wind up tackling that challenge when you get there.Maciej: And maybe not EC2. Maybe I went too far. [laugh].Corey: Yeah, honestly, even EC2-land, it feels like that is more or less a solved problem. If you want to treat it as a bunch of EC2, you can use Nagios. It's fine.Maciej: Yeah, totally.Corey: There are tools that have solved that problem. But not much that I've seen has solved the serverless piece the way that I want it solved. You have.Maciej: So, it's definitely a long road to make sure that the serverless—and by serverless, I mean serverless how AWS understands serverless, so including Fargate, for example. So, there's a lot of stuff that we can improve. It's a lot of stuff that can make easier with Cloudash than it is with CloudWatch, just staying inside serverless, it will take us a lot of time to make sure that is all correct. And correctly modeled, correctly designed, et cetera. So yeah, I went too far with EC2 sorry.Corey: Exactly. That's okay. We all go too far with EC2, I assure you.Maciej: Sorry everyone using EC2 instances. [laugh].Corey: If people want to kick the tires on it, where can they find it?Maciej: They can find it on cloudash.dev.Corey: One D in the middle. That one throws me sometimes.Maciej: One D. Actually, after talking to you, we have a double-D domain as well, so we can also try ‘Clouddash' with double-D. [laugh].Corey: Excellent, excellent. Okay, that is fantastic. Because I keep trying to put the double-D in when I'm typing it in my search tool on my desktop, and it doesn't show up. And it's like, “What the—oh, right.” But yeah, we'll get there one of these days.Maciej: Only the domain. It's only the domain. You will be redirected to single-D.Corey: Exactly.Maciej: [laugh].Corey: We'll have to expand later; I'll finance the feature request there. It'll go well. If people want to learn more about what you have to think about these things, where else can they find you?Maciej: On Twitter, and my Twitter handle is @mthenw. M-then-W, which is M-T-H—mthenw. And my co-founder @tlakomy. You can probably add that to [show notes 00:29:35]. [laugh].Corey: Oh, I certainly will. It's fine, yeah. Here's a whole bunch of letters. I hear you. My Twitter handle used to be my amateur radio callsign. It turns out most people don't think like that. And yeah, it's become an iterative learning process. Thank you so much for taking the time to speak with me today and for building this thing. I really appreciate both of them.Maciej: Thank you for having me here. I encourage everyone to visit cloudash.dev, if you have any feature requests, any questions just send us an email at hello@cloudash.dev, or just go to GitHub repository in the issues; just create an issue, describe what you want and we can talk about it.We are always happy to help. The main purpose, the ultimate goal of Cloudash is to make the serverless engineer's life easier, on very high level. And on a little bit lower level, just to make, you know, troubleshooting and debugging serverless apps easier.Corey: Well, from my perspective, you've succeeded.Maciej: Thank you.Corey: Thank you. Maciej Winnicki, founder of Cloudash. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment telling me exactly why I'm wrong for using an iPad do these things, but not being able to send it because you didn't find a good way to store the credentials.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Revisiting BrakTooth: Two Months Later https://isc.sans.edu/forums/diary/Revisiting+BrakTooth+Two+Months+Later/27992/ Escalating XSS to Sainthood with Nagios https://blog.grimm-co.com/2021/11/escalating-xss-to-sainthood-with-nagios.html Pentaho Business Analytics Vulnerablity https://hawsec.com/publications/pentaho/HVPENT210401-Pentaho-BA-Security-Assessment-Report-v1_1.pdf

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Revisiting BrakTooth: Two Months Later https://isc.sans.edu/forums/diary/Revisiting+BrakTooth+Two+Months+Later/27992/ Escalating XSS to Sainthood with Nagios https://blog.grimm-co.com/2021/11/escalating-xss-to-sainthood-with-nagios.html Pentaho Business Analytics Vulnerablity https://hawsec.com/publications/pentaho/HVPENT210401-Pentaho-BA-Security-Assessment-Report-v1_1.pdf

Screaming in the Cloud
The Value of Analysts and Observability with Nick Heudecker

Screaming in the Cloud

Play Episode Listen Later Oct 20, 2021 40:42


About NickNick Heudecker leads market strategy and competitive intelligence at Cribl, the observability pipeline company. Prior to Cribl, Nick spent eight years as an industry analyst at Gartner, covering data and analytics. Before that, he led engineering and product teams at multiple startups, with a bias towards open source software and adoption, and served as a cryptologist in the US Navy. Join Corey and Nick as they discuss the differences between observability and monitoring, why organizations struggle to get value from observability data, why observability requires new data management approaches, how observability pipelines are creating opportunities for SRE and SecOps teams, the balance between budgets and insight, why goats are the world's best mammal, and more.Links: Cribl: https://cribl.io/ Cribl Community: https://cribl.io/community Twitter: https://twitter.com/nheudecker Try Cribl hosted solution: https://cribl.cloud TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me. I linked against an early version of their tool, canarytokens.org in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to; it gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It's an awesome approach. I've used something similar for years. Check them out. But wait, there's more. They also have an enterprise option that you should be very much aware of canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files on it, you get instant alerts. It's awesome. If you don't do something like this, you're likely to find out that you've gotten breached, the hard way. Take a look at this. It's one of those few things that I look at and say, “Wow, that is an amazing idea. I love it.” That's canarytokens.org and canary.tools. The first one is free. The second one is enterprise-y. Take a look. I'm a big fan of this. More from them in the coming weeks.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is a bit fun because I'm joined by someone that I have a fair bit in common with. Sure, I moonlight sometimes as an analyst because I don't really seem to know what that means, and he spent significant amounts of time as a VP analyst at Gartner. But more importantly than that, a lot of the reason that I am the way that I am is that I spent almost a decade growing up in Maine, and in Maine, there's not a lot to do other than sit inside for the nine months of winter every year and develop personality problems.You've already seen what that looks like with me. Please welcome Nick Heudecker, who presumably will disprove that, but maybe not. He is currently a senior director of market strategy and competitive intelligence at Cribl. Nick, thanks for joining me.Nick: Thanks for having me. Excited to be here.Corey: So, let's start at the very beginning. I like playing with people's titles, and you certainly have a lofty one. ‘competitive intelligence' feels an awful lot like jeopardy. What am I missing?Nick: Well, I'm basically an internal analyst at the company. So, I spend a lot of time looking at the broader market, seeing what trends are happening out there; looking at what kind of thought leadership content that I can create to help people discover Cribl, get interested in the products and services that we offer. So, I'm mostly—you mentioned my time in Maine. I was a cryptologist in the Navy and I spent almost all of my time focused on what the bad guys do. And in this job, I focus on what our potential competitors do in the market. So, I'm very externally focused. Does that help? Does that explain it?Corey: No, it absolutely does. I mean, you folks have been sponsoring our nonsense for which we thank you, but the biggest problem that I have with telling the story of Cribl was that originally—initially it was, from my perspective, “What is this hokey nonsense?” And then I learned and got an answer and then finish the sentence with, “And where can I buy it?” Because it seems that the big competitive threat that you have is something crappy that some rando sysadmin has cobbled together. And I say that as the rando sysadmin, who has cobbled a lot of things like that together. And it's awful. I wasn't aware you folks had direct competitors.Nick: Today we don't. There's a couple that it might be emerging a little bit, but in general, no, it's mostly us, and that's what I analyze every day. Are there other emerging companies in the space? Are there open-source projects? But you're right, most of the things that we compete against are DIY today. Absolutely.Corey: In your previous role, which you were at for a very long time in tech terms—which in a lot of other cases is, “Okay, that doesn't seem that long,” but seven and a half years is a respectable stint at a company. And you were at Gartner doing a number of analyst-like activities. Let's start at the beginning because I assure you, I'm asking this purely for the audience and not because I don't know the answer myself, but what exactly is the purpose of an analyst firm, of which Gartner is the most broadly known and, follow up, why do companies care what Gartner thinks?Nick: Yeah. It's a good question, one that I answer a lot. So, what is the purpose of an analyst firm? The purpose of an analyst firm is to get impartial information about something, whether that is supply chain technology, big data tech, human resource management technologies. And it's often difficult if you're an end-user and you're interested in say, acquiring a new piece of technology, what really works well, what doesn't.And so the analyst firm because in the course of a given year, I would talk to nearly a thousand companies and both end-users and vendors as well as investors about what they're doing, what challenges they're having, and I would distill that down into 30-minute conversations with everyone else. And so we provided impartial information in aggregate to people who just wanted to help. And that's the purpose of an analyst firm. Your second question, why do people care? Well, I didn't get paid by vendors.I got paid by the company that I worked for, and so I got to be Tron; I fought for the users. And because I talk to so many different companies in different geographies, in different industries, and I share that information with my colleagues, they shared with me, we had a very robust understanding of what's actually happening in any technology market. And that's uncommon kind of insight to really have in any kind of industry. So, that's the purpose and that's why people care.Corey: It's easy from the engineering perspective that I used to inhabit to make fun of it. It's oh, it's purely justification when you're making a big decision, so if it goes sideways—because find me a technology project that doesn't eventually go sideways—I want to be able to make sure that I'm not the one that catches heat for it because Gartner said it was good. They have an amazing credibility story going on there, and I used to have that very dismissive perspective. But the more I started talking to folks who are Gartner customers themselves and some of the analyst-style things that I do with a variety of different companies, it's turned into, “No, no. They're after insight.”Because it turns out, from my perspective at least, the more that you are focused on building a product that solves a problem, you sort of lose touch with the broader market because the only people you're really talking to are either in your space or have already acknowledged and been right there and become your customer and have been jaded to see things from your point of view. Getting a more objective viewpoint from an impartial third party does have value.Nick: Absolutely. And I want you to succeed, I want you to be successful, I want to carry on a relationship with all the clients that I would speak with, and so one of the fun things I would always ask is, “Why are you asking me this question now?” Sometimes it would come in, they'd be very innocuous;, “Compare these databases,” or, “Compare these cloud services.” “Well, why are you asking?” And that's when you get to, kind of like, the psychology of it.“Oh, we just hired a new CIO and he or she hates vendor X, so we have to get rid of it.” “Well, all right. Let's figure out how we solve this problem for you.” And so it wasn't always just technology comparisons. Technology is easy, you write a check and you hope for the best.But when you're dealing with large teams and maybe a globally distributed company, it really comes down to culture, and personality, and all the harder factors. And so it was always—those were always the most fun and certainly the most challenging conversations to have.Corey: One challenge that I find in this space is—in my narrow niche of the world where I focus on AWS bills, where things are extraordinarily yes or no, black or white, binary choices—that I talked to companies, like during the pandemic, and they were super happy that, “Oh, yeah. Our infrastructure has auto-scaling and it works super well.” And I look at the bill and the spend graph over time is so flat you could basically play a game of pool on top of it. And I don't believe that I'm talking to people who are lying to me. I truly don't believe that people make that decision, but what they believe versus what is evidenced in reality are not necessarily congruent. How do you disambiguate from the stories that people want to tell about themselves? And what they're actually doing?Nick: You have to unpack it. I think you have to ask a series of questions to figure out what their motivation is. Who else is on the call, as well? I would sometimes drop into a phone call and there would be a dozen people on the line. Those inquiry calls would go the worst because everyone wants to stake a claim, everyone wants to be heard, no one's going to be honest with you or with anyone else on the call.So, you typically need to have a pretty personal conversation about what does this person want to accomplish, what does the company want to accomplish, and what are the factors that are pushing against what those things are? It's like a novel, right? You have a character, the character wants to achieve something, and there are multiple obstacles in that person's way. And so by act five, ideally everything wraps up and it's perfect. And so my job is to get the character out of the tree that is on fire and onto the beach where the person can relax.So, you have to unpack a lot of different questions and answers to figure out, well, are they telling me what their boss wants to hear or are they really looking for help? Sometimes you're successful, sometimes you're not. Not everyone does want to be open and honest. In other cases, you would have a team show up to a call with maybe a junior engineer and they really just want you to tell them that the junior engineer's architecture is not a good idea. And so you do a lot of couples therapy as well. I don't know if this is really answering the question for you, but there are no easy answers. And people are defensive, they have biases, companies overall are risk-averse. I think you know this.Corey: Oh, yeah.Nick: And so it can be difficult to get to the bottom of what their real motivation is.Corey: My approach has always been that if you want serious data, you go talk to Gartner. If you want [anec-data 00:09:48] and some understanding, well, maybe we can have that conversation, but they're empowering different decisions at different levels, and that's fine. To be clear, I do not consider Gartner to be a competitor to what I do in any respect. It turns out that I am not very good at drawing charts in varying shades of blue and positioning things just so with repeatable methodology, and they're not particularly good at having cartoon animals as their mascot that they put into ridiculous situations. We each have our portion of the universe, and that's working out reasonably well.Nick: Well, and there's also something to unpack there as well because I would say that people look at Gartner and they think they have a lot of data. To a certain degree they do, but a lot of it is not quantifiable data. If you look at a firm like IDC, they specialize in—like, they are a data house; that is what they do. And so their view of the world and how they advise their clients is different. So, even within analyst firms, there is differentiation in what approach they take, how consultative they might be with their clients, one versus another. So, there certainly are differences that you could find the more exposure you get into the industry.Corey: For a while, I've been making a recurring joke that Route 53—Amazon's managed DNS service—is in fact a database. And then at some point, I saw a post on Reddit where someone said, “Yeah, I see the joke and it's great, but why should I actually not do this?” At which point I had to jump in and say, “Okay, look. Jokes are all well and good, but as soon as people start taking me seriously, it's very much time to come clean.” Because I think that's the only ethical and responsible thing to do in this ecosystem.Similarly, there was another great joke once upon a time. It was an April Fool's Day prank, and Google put out a paper about this thing they called MapReduce. Hilarious prank that Yahoo fell for hook, line, and sinker, and wound up building Hadoop out of it and we're still paying the price for that, years later. You have a bit of a reputation from your time at Gartner as being—and I quote—“The man who killed Hadoop.” What happened there? What's the story? And I appreciate your finally making clear to the rest of us that it was, in fact, a joke. What happened there?Nick: Well, one of the pieces of research that Gartner puts out every year is this thing called a Hype Cycle. And we've all seen it, it looks like a roller coaster in profile; big mountain goes up really high and then comes down steeply, drops into a valley, and then—Corey: ‘the trough of disillusionment,' as I recall.Nick: Yes, my favorite. And then plateaus out. And one of the profiles on that curve was Hadoop distributions. And after years of taking inquiry calls, and writing documents, and speaking with everybody about what they were doing, we realized that this really isn't taking off like everyone thinks it is. Cluster sizes weren't getting bigger, people were having a lot of challenges with the complexity, people couldn't find skills to run it themselves if they wanted to.And then the cloud providers came in and said, “Well, we'll make a lot of this really simple for you, and we'll get rid of HDFS,” which is—was a good idea, but it didn't really scale well. I think that the challenge of having to acquire computers with compute storage and memory again, and again, and again, and again, just was not sustainable for the majority of enterprises. And so we flagged it as this will be obsolete before plateau. And at that point, we got a lot of hate mail, but it just seemed like the right decision to make, right? Once again, we're Tron; we fight for the users.And that seemed like the right advice and direction to provide to the end-users. And so didn't make a lot of friends, but I think I was long-term right about what happened in the Hadoop space. Certainly, some fragments of it are left over and we're still seeing—you know, Spark is going strong, there's a lot of Hive still around, but Hadoop as this amalgamation of open-source projects, I think is effectively dead.Corey: I sure hope you're right. I think it has a long tail like most things that are there. Legacy is the condescending engineering term for ‘it makes money.' You were at Gartner for almost eight years and then you left to go work at Cribl. What triggered that? What was it that made you decide, “This is great. I've been here a long time. I've obviously made it work for me. I'm going to go work at a startup that apparently, even though it recently raised a $200 million funding round”—congratulations on that, by the way—“It still apparently can't afford to buy a vowel in its name.” That's C-R-I-B-L because, of course, it is. Maybe another consonant, while you're shopping. But okay, great. It's oddly spelled, it is hard to explain in some cases, to folks who are not already feeling pain in that space. What was it that made you decide to sit up and, “All right, this is where I want to be?”Nick: Well, I met the co-founders when I was an analyst. They were working at Splunk and oddly enough—this is going to be an interesting transition compared to the previous thing we talked about—they were working on Hunk, which was, let's use HDFS to store Splunk data. Made a lot of sense, right? It could be much more cost-effective than high-cost infrastructure for Splunk. And so they told me about this; I was interested.And so I met the co-founders and then I reconnected with them after they left and formed Cribl. And I thought the story was really cool because where they're sitting is between sources and destinations of observability data. And they were solving a problem that all of my customers had, but they couldn't resolve. They would try and build it themselves. They would look at—Kafka was a popular choice, but that had some challenges for observability data—works fantastically well for application data.And they were just—had a very pragmatic view of the world that they were inhabiting and the problem that they were looking to solve. And it looked kind of like a no-brainer of a problem to solve. But when you double-click on it, when you really look down and say, “All right, what are the challenges with doing this?” They're really insurmountable for a lot of organizations. So, even though they may try and take a DIY approach, they often run into trouble after just a few weeks because of all the protocols you have to support, all the different data formats, and all the destinations, and role-based access control, and everything else that goes along with it.And so I really liked the team. I thought the product inhabited a unique space in the market—we've already talked about the lack of competitors in the space—and I just felt like the company was on a rocket ship—or is a rocket ship—that basically had unbounded success potential. And so when the opportunity arose to join the team and do a lot of the things I like doing as an analyst—examining the market, talking to people looking at competitive aspects—I jumped at it.Corey: It's nice when you see those opportunities that show up in front of you, and the stars sort of align. It's like, this is not just something that I'm excited about and enthused about, but hey, they can use me. I can add something to where they're going and help them get there better, faster, sooner, et cetera, et cetera.Nick: When you're an analyst, you look at dozens of companies a month and I'd never seen an opportunity that looked like that. Everything kind of looked the same. There's a bunch of data integration companies, there's a bunch of companies with Spark and things like that, but this company was unique; the product was unique, and no one was really recognizing the opportunity. So, it was just a great set of things that all happen at the same time.Corey: It's always fun to see stars align like that. So—Nick: Yeah.Corey: —help me understand in a way that can be articulated to folks who don't have 15 years of grumpy sysadmin experience under their belts, what does Cribl do?Nick: So, Cribl does a couple of things. Our flagship product is called LogStream, and the easiest way to describe that is as an abstraction between sources and destinations of data. And that doesn't sound very interesting, but if you, from your sysadmin background, you're always dealing with events, logs, now there's traces, metrics are also hanging around—Corey: Oh, and of course, the time is never synchronized with anything either, so it's sort of a giant whodunit, mystery, where half the eyewitnesses lie.Nick: Well, there's that. There's a lot of data silos. If you got an agent deployed on a system, it's only going to talk to one destination platform. And you repeat this, maybe a dozen times per server, and you might have 100,000 or 200,000 servers, with all of these different agents running on it, each one locked into one destination. So, you might want to be able to mix and match that data; you can't. You're locked in.One of the things LogStream does is it lets you do that exact mixing and matching. Another thing that this product does, that LogStream does, is it gives you ability to manage that data. And then what I mean by that is, you may want to reduce how much stuff you're sending into a given platform because maybe that platform charges you by your daily ingest rates or some other kind of event-based charges. And so not all that data is valuable, so why pay to store it if it's not going to be valuable? Just dump it or reduce the amount of volume that you've got in that payload, like a Windows XML log.And so that's another aspect that it allows you to do, better management of that stuff. You can redact sensitive fields, you can enrich the data with maybe, say, GeoIPs so you know what kind of data privacy laws you fall under and so on. And so, the story has always been, land the data in your destination platform first, then do all those things. Well, of course, because that's how they charge you; they charge you based on daily ingest. And so now the story is, make those decisions upfront in one place without having to spread this logic all over, and then send the data where you want it to go.So, that's really, that's the core product today, LogStream. We call ourselves an observability pipeline for observability data. The other thing we've got going on is this project called AppScope, and I think this is pretty cool. AppScope is a black box instrumentation tool that basically resides between the application runtime and the kernel and any shared libraries. And so it provides—without you having to go back and instrument code—it instruments the application for you based on every call that it makes and then can send that data through something like LogStream or to another destination.So, you don't have to go back and say, “Well, I'm going to try and find the source code for this 30-year old c++ application.” I can simply run AppScope against the process, and find out exactly what that application is doing for me, and then relay that information to some other destination.Corey: This episode is sponsored in part by Liquibase. If you're anything like me, you've screwed up the database part of a deployment so severely that you've been banned from touching every anything that remotely sounds like SQL, at at least three different companies. We've mostly got code deployments solved for, but when it comes to databases we basically rely on desperate hope, with a roll back plan of keeping our resumes up to date. It doesn't have to be that way. Meet Liquibase. It is both an open source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails to ensure you'll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.Corey: I have to ask because I love what you're doing, don't get me wrong. The counterargument that always comes up in this type of conversation is, “Who in their right mind looks at the state of the industry today and says, ‘You know what we need? That's right; another observability tool.'” what differentiates what you folks are building from a lot of the existing names in the space? And to be clear, a lot of the existing names in the space are treating observability simply as hipster monitoring. I'm not entirely sure they're wrong, but that's a different fight for a different time.Nick: Yeah. I'm happy to come back and talk about that aspect of it, too. What's different about what we're doing is we don't care where the data goes. We don't have a dog in that fight. We want you to have better control over where it goes and what kind of shape it's in when it gets there.And so I'll give an example. One of our customers wanted to deploy a new SIEM—Security Information Event Management—tool. But they didn't want to have to deploy a couple hundred-thousand new agents to go along with it. They already had the data coming in from another agent, they just couldn't get the data to it. So, they use LogStream to send that data to their new desired platform.Worked great. They were able to go from zero to a brand new platform in just a couple days, versus fighting with rolling out agents and having to update them. Did they conflict with existing agents? How much performance did it impact on the servers, and so on? So, we don't care about the destination. We like everybody. We're agnostic when it comes to where that data goes. And—Corey: Oh, it's not about the destination. It's about the journey. Everyone's been saying it, but you've turned it into a product.Nick: It's very spiritual. So, we [laugh] send, we send your observability data on a spiritual [laugh] journey to its destination, and we can do quite a bit with it on the way.Corey: So, you said you offered to go back as well and visit the, “Oh, it's monitoring, but we're going to call it observability because otherwise we get yelled out on Twitter by Charity Majors.” How do you view that?Nick: Monitoring is the things you already know. Right? You know what questions you want to ask, you get an alert if something goes out of bounds or something goes from green to red. Think about monitoring as a data warehouse. You shape your data, you get it all in just the right condition so you can ask the same question over and over again, over different time domains.That's how I think about monitoring. It's prepackaged, you know exactly what you want to do with it. Observability is more like a data lake. I have no idea what I'm going to do with this stuff. I think there's going to be some signals in here that I can use, and I'm going to go explore that data.So, if monitoring is your known knowns, observability is your unknown unknowns. So, an ideal observability solution gives you an opportunity to discover what those are. Once you discover them. Great. Now, you can talk about how to get them into your monitoring system. So, for me, it's kind of a process of discovery.Corey: Which makes an awful lot of sense. The problem I've always had with the monitoring approach is it falls into this terrible pattern of enumerate the badness. In other words, “Imagine all the ways that this system can fail,” and then build an alerting that lets you know when any of those things happen. And what happens next is inevitable to anyone who's ever dealt with the tricksy devils known as computers, and what happens, of course, is that they find new ways to fail and you generally get to add to the list of things to check for, usually at two o'clock in the morning.Nick: On a Sunday.Corey: Oh, absolutely. It almost doesn't matter when. The real problem is when these things happen, it's, “What day, actually, is it?” And you have to check the calendar to figure out because your third time that week being woken up in the dead of night. It's like an infant but less than endearing.So, that has been the old school approach, and there's unfortunately still an awful lot of, we'll just call it nonsense, in the industry that still does exactly the same thing, except now they call it observability because—hearkening back to earlier in our conversation—there's a certain point in the Gartner Hype Cycle that we are all existing within. What's the deal with that?Nick: Well, I think that there are a lot of entrenched interests in the monitoring space. And so I think you always see this when a new term comes around. Vendors will say, “All right, well, there's a lot of confusion about this. Let me back-fit my product into this term so that I can continue to look like I'm on the leading edge and I'm not going to put any of my revenues in jeopardy.” I know, that's a cynical view, but I've seen it over and over again.And I think that's unfortunate because there's a real opportunity to have a better understanding of your systems, to better understand what's happening in all the containers you're deploying and not tearing down the way that you should, to better understand what's happening in distributed systems. And it's going to be a real missed opportunity if that is what happens. If we just call this ‘Monitoring 2.0' it's going to leave a lot of unrealized potential in the market.Corey: The big problem that I've seen in a lot of different areas is—I'll be direct—consolidation where you have a company that starts to do a thing—and that's great—and then they start doing other things that are tied to it. And in turn, they start, I guess, gathering everything in the ecosystem. If you break down observability into various constituent parts, I—know, I know, the pillars thing is going to upset people; ignore that for now—and if you have an offering that's weak in a particular area, okay, instead of building it organically into the product, or saying, “Yeah, that's not what we do,” there's an instinct to acquire a company or build that functionality out. And it turns out that we're building what feels the lot to me like the SaaS equivalent of multifunction printers: they can print, they can scan, they can fax, and none of those three very well, so it winds up with something that dissatisfies everyone, rather than a best-of-breed solution that has a very clear and narrow starting and stopping point. How do you view that?Nick: Well, what you've described is a compromise, right? A compromise is everyone can work and no one's happy. And I think that's the advantage of where LogStream comes in. The reality is best-of-breed. Most enterprises today have 30 or more different monitoring tools—call them observability tools if you want to—and you will never pry those tools from the dead hands of those sysadmins, DevOps engineers, SREs, et cetera.They all integrate those tools into how they work and their processes. So, we're living in a best-of-breed world. It's like that in data and analytics—my former beat—and it's like that in monitoring and observability. People really gravitate towards the tools they like, they gravitate towards the tools their friends are using. And so you need a way to be able to mix and match that stuff.And just because I want to stay [laugh] on message, that's really where the LogStream story kind of blends in because we do that; we allow you to mix and match all those different pieces.Corey: Joke's on you. I use Nagios and I have no friends. I'm not convinced those two things are entirely unrelated, but here we are. So here's, I guess, the big burning question that a lot of folks—certainly not me, but other undefined folks, ‘lots of people are saying'—so you built something interesting that actually works. I want to be clear on this.I have spoken to customers of yours. They swear by it instead of swearing at it, which happens with other companies. Awesome. You have traction, you're moving forward, things are going great. Here's $200 million is the next part of that story, and on some level, my immediate reaction—which does need updating, let's be clear here—is like, all right.I'm trying to build a product. I can see how I could spend a few million bucks. “Well, what can you do with I don't know, 100 times that?” My easy answer is, “Something monstrous.” I don't believe that is the case here. What is the growth plan? What are you doing that makes having that kind of a war chest a useful and valuable thing to have?Nick: Well, if you speak with the co-founders—and they've been open about this—we view ourselves as a generational company. We're not just building one product. We've been thinking about, how do we deliver on observability as this idea of discovery? What does that take? And it doesn't mean that we're going to be less agnostic to other destinations, we still think there's an incredible amount of value there and that's not going away, but we think there's maybe an interim step that we build out, potentially this idea of an observability data lake where you can explore these environments.Certainly, there's other types of options in the space today. Most of them are SQL-based, which is interesting because the audience that uses monitoring and observability tools couldn't care less about SQL right? They want search, they want regex, and so you've got to have the right tool for that audience. And so we're thinking about what that looks like going forward. We're doubling down on people.Surprisingly, this is a very—like anything else in software, it is people-intensive. And so certainly those are other aspects that we're exploring with the recent investment, but definitely, multiproduct company is our future and continued expansion.Corey: Expansion is always a fun one. It's the idea of, great, are you looking at going deeper into the areas you're already active within, or is it more of a, “Ah, so we've solved the, effectively, log routing problem. That's great. Let's solve other problems, too.” Or is it more of a, I guess, a doubling down and focusing on what's working? And again, that probably sounds judgmental in a way I don't intend it to at all. I just have a hard time contextualizing that level of scale coming from a small company perspective the way that I do.Nick: Yeah. Our plan is to focus more intently on the areas that we're in. We have a huge basis of experience there. We don't want to be all things to all people; that dilutes the message down to nothing, so we want to be very specific in the audiences we talk to, the problems we're trying to solve, and how we try to solve them.Corey: The problem I've always found with a lot of the acquisition, growth thrashing of—let me call it what I think it is: companies in decline trying to strain relevancy, it feels almost like a, “We don't see a growth strategy. So, we're going to try and acquire everything that hold still long enough, at some level, trying to add more revenue to the pile, but also thrashing in the sense of, okay. They're going to teach us how to do things in creative, awesome ways,” but it never works out that way. When you have a 50,000 person company acquiring a 200 person company, invariably the bigger culture is going to dominate. And I don't understand why that mistake seems to continually happen again, and again, and again.And people think I'm effectively alluding to—or whenever the spoken word version of subtweeting is—a particular company or a particular acquisition. I'm absolutely not, there are probably 50 different companies listening right now who thinks, “Oh, God. He's talking about us.” It's the common repeating trend. What is that?Nick: It's hard to say. In some cases, these acquisitions might just be talent. “We need to know how to do X. They know how to do X. Let's do it.” They may have very unique niche technology or software that another company thinks they can more broadly apply.Also, some of these big companies, these may not be board-level or CEO-level decisions. A business unit might decide, “Oh, I like what that company is doing. I'm going to go acquire it.” And so it looks like MegaCorp bought TinyCorp, but it's really, this tiny business unit within MegaCorp bought tiny company. The reality is often different from what it looks like on the outside.So, that's one way. Another is, you know, if they're going to teach us to be more effective with tech or something like that, you're never going to beat culture. You're never going to be the existing culture. If it's 50,000, against 200, obviously we know who wins there. And so I don't know if that's realistic.I don't know if the big companies are genuine when they say that, but it could just be the messaging that they use to make people happy and hopefully retain as many of those new employees for as long as they can. Does that make sense?Corey: No, it makes perfect sense. It's the right answer. It does articulate what is happening there, and I think I keep falling prey to the same failure. And it's hard. It's pernicious, but companies are not monolithic entities.There's no one person at all of these companies each who is making these giant unilateral decisions. It's always some product manager or some particular person who has a vision and a strategy in the department. It is not something that the company board is agreeing on every little decision that gets made. They're distributed entities in many respects.Nick: Absolutely. And that's only getting more pervasive as companies get larger [laugh] through acquisition. So, you're going to see more and more of that, and so it's going to look like we're going to put one label on it, one brand. Often, I think internally, that's the exact opposite of what actually happened, how that decision got made.Corey: Nick, I want to thank you for taking so much time to speak with me about what you're up to over there, how your path has shaped, how you view the world, and also what Cribl does these days. If people want to learn more about what you're up to, how you think about the world, or even possibly going to work at Cribl which, having spoken to a number of people over there, I would endorse it. How do they find you?Nick: Best place to find us is by joining our community: cribl.io/community, and Cribl is spelled C-R-I-B-L. You can certainly reach out there, we've got about 2300 people in our community Slack, so it's a great group. You can also reach out to me on Twitter, I'm @nheudecker, N-H-E-U-D-E-C-K-E-R. Tell me what you thought of the episode; love to hear it. And then beyond that, you can also sign up for our free cloud tier at cribl.cloud. It's a pretty generous one terabyte a day processing, so you can start to send data in and send it wherever you'd like to be.Corey: To be clear, this free as in beer, not free as an AWS free tier?Nick: This is free as in beer.Corey: Excellent. Excellent.Nick: I think I'm getting that right. I think it's free as in beer. And the other thing you can try is our hosted solution on AWS, fully managed cloud at cribl.cloud, we offer a free one terabyte per day processing, so you can start to send data into that environment and send it wherever you'd like to go, in whatever shape that data needs to be in when it gets there.Corey: And we will, of course, put links to that in the [show notes 00:35:21]. Thank you so much for your time today. I really appreciate it.Nick: No, thank you for having me. This was a lot of fun.Corey: Nick Heudecker, senior director, market strategy and competitive intelligence at Cribl. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment explaining that the only real reason a startup should raise a $200 million funding round is to pay that month's AWS bill.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Keeping the Cloudwatch with Ewere Diagboya

Screaming in the Cloud

Play Episode Listen Later Oct 14, 2021 32:21


About EwereCloud, DevOps Engineer, Blogger and AuthorLinks: Infrastructure Monitoring with Amazon CloudWatch: https://www.amazon.com/Infrastructure-Monitoring-Amazon-CloudWatch-infrastructure-ebook/dp/B08YS2PYKJ LinkedIn: https://www.linkedin.com/in/ewere/ Twitter: https://twitter.com/nimboya Medium: https://medium.com/@nimboya My Cloud Series: https://mycloudseries.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Honeycomb. When production is running slow, it's hard to know where problems originate: is it your application code, users, or the underlying systems? I've got five bucks on DNS, personally. Why scroll through endless dashboards, while dealing with alert floods, going from tool to tool to tool that you employ, guessing at which puzzle pieces matter? Context switching and tool sprawl are slowly killing both your team and your business. You should care more about one of those than the other, which one is up to you. Drop the separate pillars and enter a world of getting one unified understanding of the one thing driving your business: production. With Honeycomb, you guess less and know more. Try it for free at Honeycomb.io/screaminginthecloud. Observability, it's more than just hipster monitoring.Corey: This episode is sponsored in part by Liquibase. If you're anything like me, you've screwed up the database part of a deployment so severely that you've been banned from touching every anything that remotely sounds like SQL, at at least three different companies. We've mostly got code deployments solved for, but when it comes to databases we basically rely on desperate hope, with a roll back plan of keeping our resumes up to date. It doesn't have to be that way. Meet Liquibase. It is both an open source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails to ensure you'll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I periodically make observations that monitoring cloud resources has changed somewhat since I first got started in the world of monitoring. My experience goes back to the original Call of Duty. That's right: Nagios.When you set instances up, it would theoretically tell you when they were unreachable or certain thresholds didn't work. It was janky but it kind of worked, and that was sort of the best we have. The world has progressed as cloud has become more complicated, as technologies have become more sophisticated, and here today to talk about this is the first AWS Hero from Africa and author of a brand new book, Ewere Diagboya. Thank you for joining me.Ewere: Thanks for the opportunity.Corey: So, you recently published a book on CloudWatch. To my understanding, it is the first such book that goes in-depth with not just how to wind up using it, but how to contextualize it as well. How did it come to be, I guess is my first question?Ewere: Yes, thanks a lot, Corey. The name of the book is Infrastructure Monitoring with Amazon CloudWatch, and the book came to be from the concept of looking at the ecosystem of AWS cloud computing and we saw that a lot of the things around cloud—I mostly talked about—most of this is [unintelligible 00:01:49] compute part of AWS, which is EC2, the containers, and all that, you find books on all those topics. They are all proliferated all over the internet, you know, and videos and all that.But there is a core behind each of these services that no one actually talks about and amplifies, which is the monitoring part, which helps you to understand what is going on with the system. I mean, knowing what is going on with the system helps you to understand failures, helps you to predict issues, helps you to also envisage when a failure is going to happen so that you can remedy it and also [unintelligible 00:02:19], and in some cases, even give you a historical view of the system to help you understand how a system has behaved over a period of time.Corey: One of the articles that I put out that first really put me on AWS's radar, for better or worse, was something that I was commissioned to write for Linux Journal, back when that was a print publication. And I accidentally wound up getting the cover of it with my article, “CloudWatch is of the devil, but I must use it.” And it was a painful problem that people generally found resonated with them because no one felt they really understood CloudWatch; it was incredibly expensive; it didn't really seem like it was at all intuitive, or that there was any good way to opt out of it, it was just simply there, and if you were going to be monitoring your system in a cloud environment—which of course you should be—it was just sort of the cost of doing business that you then have to pay for a third-party tool to wind up using the CloudWatch metrics that it was gathering, and it was just expensive and unpleasant all around. Now, a lot of the criticisms I put about CloudWatch's limitations in those days, about four years ago, have largely been resolved or at least mitigated in different ways. But is CloudWatch still crappy, I guess, is my question?Ewere: Um, yeah. So, at the moment, I think, like you said, CloudWatch has really evolved over time. I personally also had that issue with CloudWatch when I started using CloudWatch; I had the challenge of usability, I had the challenge of proper integration, and I will talk about my first experience with CloudWatch here. So, when I started my infrastructure work, one of the things I was doing a lot was EC2, basically. I mean, everyone always starts with EC2 at the first time.And then we had a downtime. And then my CTO says, “Okay, [Ewere 00:04:00], check what's going on.” And I'm like, “How do I check?” [laugh]. I mean, I had no idea of what to do.And he says, “Okay, there's a tool called CloudWatch. You should be able to monitor.” And I'm like, “Okay.” I dive into CloudWatch, and boom, I'm confused again. And you look at the console, you see, it shows you certain metrics, and yet [people 00:04:18] don't understand what CPU metric talks about, what does network bandwidth talks about?And here I am trying to dig, and dig, and dig deeper, and I still don't get [laugh] a sense of what is actually going on. But what I needed to find out was, I mean, what was wrong with the memory of the system, so I delved into trying to install the CloudWatch agent, get metrics and all that. But the truth of the matter was that I couldn't really solve my problem very well, but I had [unintelligible 00:04:43] of knowing that I don't have memory out of the box; it's something that has to set up differently. And trust me, after then I didn't touch CloudWatch [laugh] again. Because, like you said, it was a problem, it was a bit difficult to work with.But fast forward a couple of years later, I could actually see someone use CloudWatch for a lot of beautiful stuff, you know? It creates beautiful dashboards, creates some very well-aggregated metrics. And also with the aggregated alarms that CloudWatch comes with, [unintelligible 00:05:12] easy for you to avoid what to call incident fatigue. And then also, the dashboards. I mean, there are so many dashboards that simplified to work with, and it makes it easy and straightforward to configure.So, the bootstrapping and the changes and the improvements on CloudWatch over time has made CloudWatch a go-to tool, and most especially the integration with containers and Kubernetes. I mean, CloudWatch is one of the easiest tools to integrate with EKS, Kubernetes, or other container services that run in AWS; it's just, more or less, one or two lines of setup, and here you go with a lot of beautiful, interesting, and insightful metrics that you will not get out of the box, and if you look at other monitoring tools, it takes a lot of time for you to set up, for you to configure, for you to consistently maintain and to give you those consistent metrics you need to know what's going on with your system from time to time.Corey: The problem I always ran into was that the traditional tools that I was used to using in data centers worked pretty well because you didn't have a whole lot of variability on an hour-to-hour basis. Sure, when you installed new servers or brought up new virtual machines, you had to update the monitoring system. But then you started getting into this world of ephemerality with auto-scaling originally, and later containers, and—God help us all—Lambda now, where it becomes this very strange back-and-forth story of, you need to be able to build something that, I guess, is responsive to that. And there's no good way to get access to some of the things that CloudWatch provides, just because we didn't have access into AWS's systems the way that they do. The inverse, though, is that they don't have access into things running inside of the hypervisor; a classic example has always been memory: memory usage is an example of something that hasn't been able to be displayed traditionally without installing some sort of agent inside of it. Is that still the case? Are there better ways of addressing those things now?Ewere: So, that's still the case, I mean, for EC2 instances. So before, now, we had an agent called a CloudWatch agent. Now, there's a new agent called Unified Cloudwatch Agent which is, I mean, a top-notch from CloudWatch agent. So, at the moment, basically, that's what happens on the EC2 layer. But the good thing is when you're working with containers, or more or less Kubernetes kind of applications or systems, everything comes out of the box.So, with containers, we're talking about a [laugh] lot of moving parts. The container themselves with their own CPU, memory, disk, all the metrics, and then the nodes—or the EC2 instance of the virtual machines running behind them—also having their own unique metrics. So, within the container world, these things are just a click of a button. Everything happens at the same time as a single entity, but within the EC2 instance and ecosystem, you still find this there, although the setup process has been a bit easier and much faster. But in the container world, that problem has totally been eliminated.Corey: When you take a look at someone who's just starting to get a glimmer of awareness around what CloudWatch is and how to contextualize it, what are the most common mistakes people make early on?Ewere: I also talked about this in my book, and one of the mistakes people make in terms of CloudWatch, and monitoring in generalities: “What am I trying to figure out?” [laugh]. If you don't have that answer clearly stated, you're going to run into a lot of problems. You need to answer that question of, “What am I trying to figure out?” I mean, monitoring is so broad, monitoring is so large that if you do not have the answer to that question, you're going to get yourself into a lot of trouble, you're going to get yourself into a lot of confusion, and like I said, if you don't understand what you're trying to figure out in the first place, then you're going to get a lot of data, you're going to get a lot of information, and that can get you confused.And I also talked about what I call alarm fatigues or incident fatigues. This happens when you configure so many alarms, so many metrics, and you're getting a lot of alarms hitting and notification services—whether it's Slack, whether it's an email—and it causes fatigue. What happens here is the person who should know what is going on with the system gets a ton of messages and in that scenario can miss something very important because there's so many messages coming in, so many integrations coming in. So, you should be able to optimize appropriately, to be able to, like you said, conceptualize what you're trying to figure out, what problems are you trying to solve? Most times you really don't figure this out for a start, but there are certain bare minimums you need to know about, and that's part of what I talked about in the book.One of the things that I highlighted in the book when I talked about monitoring of different layers is, when you're talking about monitoring of infrastructure, say compute services, such as virtual machines, or EC2 instances, the certain baseline and metrics you need to take note of that are core to the reliability, the scalability, and the efficiency of your system. And if you focus on these things, you can have a baseline starting point before you start going deeper into things like observability and knowing what's going on entirely with your system. So, baseline understanding of—baseline metrics, and baseline of what you need to check in terms of different kinds of services you're trying to monitor is your starting point. And the mistake people make is that they don't have a baseline. So, we do not have a baseline; they just install a monitoring tool, configure a CloudWatch, and they don't know the problem they're trying to solve [laugh] and that can lead to a lot of confusion.Corey: So, what inspired you from, I guess, kicking the tires on CloudWatch—the way that we all do—and being frustrated and confused by it, all the way to the other side of writing a book on it? What was it that got you to that point? Were you an expert on CloudWatch before you started writing the book, or was it, “Well, by the time this book is done, I will certainly know [laugh] more about the service than I did when I started.”Ewere: Yeah, I think it's a double-edged sword. [laugh]. So, it's a combination of the things you just said. So, first of all, I have experienced with other monitoring tools; I have love for reliability and scalability of a system. I started Kubernetes at some of the early times Kubernetes came out, when it was very difficult to deploy, when it was very difficult to set up.Because I'm looking at how I can make systems a little bit more efficient, a little bit more reliable than having to handle a lot of things like auto-scaling, having to go through the process of understanding how to scale. I mean, that's a school of its own that you need to prepare yourself for. So, first of all, I have a love for making sure systems are reliable and efficient, and second of all, I also want to make sure that I know what is going on with my system per time, as much as possible. The level of visibility of a system gives you the level of control and understanding of what your system is doing per time. So, those two things are very core to me.And then thirdly, I had a plan of a streak of books I want to write based on AWS, and just like monitoring is something that is just new. I mean, if you go to the package website, this is the first book on infrastructure monitoring AWS with CloudWatch; it's not a very common topic to talk about. And I have other topics in my head, and I really want to talk about things like networking, and other topics that you really need to go deep inside to be able to appreciate the value of what you see in there with all those scenarios because in this book, every chapter, I created a scenario of what a real-life monitoring system or what you need to do looks like. So, being that I have those premonitions, I know that whenever it came to, you know, to share with the world what I know in monitoring, what I've learned in monitoring, I took a [unintelligible 00:12:26]. And then secondly, as this opportunity for me to start telling the world about the things I learned, and then I also learned while writing the book because there are certain topics in the book that I'm not so much of an expert in things, like big data and all that.I had to also learn; I had to take some time to do more research, to do more understanding. So, I use CloudWatch, okay? I'm kind of good in CloudWatch, and also, I also had to do more learning to be able to disseminate this information. And also, hopefully, X-Ray some parts of monitoring and different services that people do not really pay so much attention into.Corey: What do you find that is still the most, I guess, confusing to you as you take a look across the ecosystem of the entire CloudWatch space? I mean, every time I play with it, I take a look, and I get lost in, “Oh, they have contributor analyses, and logs, and metrics.” And it's confusing, and every time I wind up, I guess, spiraling out of control. What do you find that, after all of this, is a lot easier for you, and what do you find that's a lot more understandable?Ewere: I'm still going to go back to the containers part. I'm sorry, I'm in love containers. [laugh].Corey: No, no, it's fair. Containers are very popular. Everyone loves them. I'm just basically anti-container based upon no better reason than I'm just stubborn and bloody-minded most of the time.Ewere: [laugh]. So, pretty much like I said, I kind of had experience with other monitoring tools. Trust me, if you want to configure proper container monitoring for other tools, trust me, it's going to take you at least a week or two to get it properly, from the dashboards, to the login configurations, to the piping of the data to the proper storage engine. These are things I talked about in the book because I took monitoring from the ground up. I mean, if you've never done monitoring before, when you take my book, you will understand the basic principles of monitoring.And [funny 00:14:15], you know, monitoring has some big data process, like an ETL process: extraction, transformation, and writing of data into an analytic system. So, first of all, you have to battle that. You have to talk about the availability of your storage engine. What are you using? An Elasticsearch? Are you using an InfluxDB? Where do you want to store your data? And then you have to answer the question of how do I visualize the data? What method do I realize this data? What kind of dashboards do I want to use? What methods of representation do I need to represent this data so that it makes sense to whoever I'm sharing this data with. Because in monitoring, you definitely have to share data with either yourself or with someone else, so the way you present the data needs to make sense. I've seen graphs that do not make sense. So, it requires some level of skill. Like I said, I've [unintelligible 00:15:01] where I spent a week or two having to set up dashboards. And then after setting up the dashboard, someone was like, “I don't understand, and we just need, like, two.” And I'm like, “Really?” [laugh]. You know? Because you spend so much time. And secondly, you discover that repeatability of that process is a problem. Because some of these tools are click and drag; some of them don't have JSON configuration. Some do, some don't. So, you discover that scalability of this kind of system becomes a problem. You can't repeat the dashboards: if you make a change to the system, you need to go back to your dashboard, you need to make some changes, you need to update your login, too, you need to make some changes across the layer. So, all these things is a lot of overhead [laugh] that you can cut off when you use things like Container Insights in CloudWatch—which is a feature of CloudWatch. So, for me, that's a part that you can really, really suck out so much juice from in a very short time, quickly and very efficiently. On the flip side, when you talk about monitoring for big data services, and monitoring for a little bit of serverless, there might be a little steepness in the flow of the learning curve there because if you do not have a good foundation in serverless, when you get into [laugh] Lambda Insights in CloudWatch, trust me, you're going to be put off by that; you're going to get a little bit confused. And then there's also multifunction insights at the moment. So, you need to have some very good, solid foundation in some of those topics before you can get in there and understand some of the data and the metrics that CloudWatch is presenting to you. And then lastly, things like big data, too, there are things that monitoring is still being properly fleshed out. Which I think that in the coming months and years to come, they will become more proper and they will become more presentable than they are at the moment.Corey: This episode is sponsored by our friends at Oracle HeatWave is a new high-performance accelerator for the Oracle MySQL Database Service. Although I insist on calling it “my squirrel.” While MySQL has long been the worlds most popular open source database, shifting from transacting to analytics required way too much overhead and, ya know, work. With HeatWave you can run your OLTP and OLAP, don't ask me to ever say those acronyms again, workloads directly from your MySQL database and eliminate the time consuming data movement and integration work, while also performing 1100X faster than Amazon Aurora, and 2.5X faster than Amazon Redshift, at a third of the cost. My thanks again to Oracle Cloud for sponsoring this ridiculous nonsense.Corey: The problem I've always had with dashboards is it seems like managers always want them—“More dashboards, more dashboards”—then you check the usage statistics of who's actually been viewing the dashboards and the answer is, no one since you demoed it to the execs eight months ago. But they always claim to want more. How do you square that?I guess, slicing between what people asked for and what they actually use.Ewere: [laugh]. So yeah, one of the interesting things about dashboards in terms of most especially infrastructure monitoring, is the dashboards people really want is a revenue dashboards. Trust me, that's what they want to see; they want to see the money going up, up, up, [laugh] you know? So, when it comes to—Corey: Oh, yes. Up and to the right, then everyone's happy. But CloudWatch tends to give you just very, very granular, low-level metrics of thing—it's hard to turn that into something executives care about.Ewere: Yeah, what people really care about. But my own take on that is, the dashboards are actually for you and your team to watch, to know what's going on from time to time. But what is key is setting up events across very specific and sensitive data. For example, when any kind of sensitive data is flowing across your system and you need to check that out, then you tie a metric to that, and in turn alarm to it. That is actually the most important thing for anybody.I mean, for the dashboards, it's just for you and your team, like I said, for your personal consumption. “Oh, I can see all the RDS connections are getting too high, we need to upgrade.” Oh, we can see that all, the memory, there was a memory spike in the last two hours. I know that's for you and your team to consume; not for the executive team. But what is really good is being able to do things like aggregate data that you can share.I think that is what the executive team would love to see. When you go back to the core principles of DevOps in terms of the DevOps Handbook, you see things like a mean time to recover, and change failure rate, and all that. The most interesting thing is that all these metrics can be measured only by monitoring. You cannot change failure rates if you don't have a monitoring system that tells you when there was a failure. You cannot know your release frequency when you don't have a metric that measures number of deployments you have and is audited in a particular metric or a particular aggregator system.So, we discovered that the four major things you measure in DevOps are all tied back to monitoring and metrics, at minimum, to understand your system from time to time. So, what the executive team actually needs is to get a summary of what's going on. And one of the things I usually do for almost any company I work for is to share some kind of uptime system with them. And that's where CloudWatch Synthetics Canary come in. So, Synthetic Canary is a service that helps you calculate that helps you check for uptime of the system.So, it's a very simple service. It does a ping, but it is so efficient, and it is so powerful. How is it powerful? It does a ping to a system and it gets a feedback. Now, if the status code of your service, it's not 200 or not 300, it considers it downtime.Now, when you aggregate this data within a period of time, say a month or two, you can actually use that data to calculate the uptime of your system. And that uptime [unintelligible 00:19:50] is something you can actually share to your customers and say, “Okay, we have an SLA of 99.9%. We have an SLA of 99.8%.” That data should not be doctored data; it should not be a data you just cook out of your head; it should be based on your system that you have used, worked with, monitored over a period of time so that the information you share with your customers are genuine, they are truthful, and they are something that they can also see for themselves.Hence companies are using [unintelligible 00:20:19] like status page to know what's going on from time to time whenever there is an incident and report back to their customers. So, these are things that executives will be more interested in than just dashboards, [laugh] dashboards, and more dashboards. So, it's more or less not about what they really ask for, but what you know and what you believe you are going to draw value from. I mean, an executive in a meeting with a client and says, “Hey, we got a system that has 99.9% uptime.”He opens the dashboard or he opens the uptime system and say, “You see our uptime? For the past three months, this has been our metric.” Boom. [snaps fingers]. That's it. That's value, instantly. I'm not showing [laugh] the clients and point of graphs, you know? “Can you explain the memory metric?” That's not going to pass the message, send the message forward.Corey: Since your book came out, I believe, if not, certainly by the time it was finished being written and it was in review phase, they came out with Managed Prometheus and Managed Grafana. It looks almost like they're almost trying to do a completely separate standalone monitoring stack of AWS tooling. Is that a misunderstanding of what the tools look like, or is there something to that?Ewere: Yeah. So, I mean by the time those announced at re:Invent, I'm like, “Oh, snap.” I almost told my publisher, “You know what? We need to add three more chapters.” [laugh]. But unfortunately, we're still in review, in preview.I mean, as a Hero, I kind of have some privilege to be able to—a request for that, but I'm like, okay, I think it's going to change the narrative of what the book is talking about. I think I'm going to pause on that and make sure this finishes with the [unintelligible 00:21:52], and then maybe a second edition, I can always attach that. But hey, I think there's trying to be a galvanization between Prometheus, Grafana, and what CloudWatch stands for. Because at the moment, I think it's currently on pre-release, it's not fully GA at the moment, so you can actually use it. So, if you go to Container Insights, you can see that you can still get how Prometheus and Grafana is presenting the data.So, it's more or less a different view of what you're trying to see. It's trying to give you another perspective of how your data is presented. So, you're going to have CloudWatch: it's going to have CloudWatch dashboards, it's going to have CloudWatch metrics, but hey, this different tools, Prometheus, Grafana, and all that, they all have their unique ways of presenting the data. And part of the reason I believe AWS has Prometheus and Grafana there is, I mean, Prometheus is a huge cloud-native open-source monitoring, presentation, analytics tool; it packs a lot of heat, and a lot of people are so used to it. Everybody like, “Why can't I have Prometheus in CloudWatch?”I mean—so instead of CloudWatch just being a simple monitoring tool, [unintelligible 00:22:54] CloudWatch has become an ecosystem of monitoring tool. So, we got—we're not going to see cloud [unintelligible 00:23:00], or just [unintelligible 00:23:00] log, analytics, metrics, dashboards, no. We're going to see it as an ecosystem where we can plug in other services, and then integrate and work together to give us better performance options, and also different perspectives to the data that is being collected.Corey: What do you think is next, as you take a look across the ecosystem, as far as how people are thinking about monitoring and observability in a cloud context? What are they missing? Where's the next evolution lead?Ewere: Yeah, I think the biggest problem with monitoring, which is part of the introduction part of the book, where I talked about the basic types of monitoring—which is proactive and reactive monitoring—is how do we make sure we know before things happen? [laugh]. And one of the things that can help with that is machine learning. There is a small ecosystem that is not so popular at the moment, which talks about how we can do a lot of machine learning in DevOps monitoring observability. And that means looking at historic data and being able to predict on the basic level.Looking at history, [then are 00:24:06] being able to predict. At the moment, there are very few tools that have models running at the back of the data being collected for monitoring and metrics, which could actually revolutionize monitoring and observability as we see it right now. I mean, even the topic of observability is still new at the moment. It's still very integrated. Observability just came into Cloud, I think, like, two years ago, so it's still being matured.But one thing that has been missing is seeing the value AI can bring into monitoring. I mean, this much [unintelligible 00:24:40] practically tell us, “Hey, by 9 p.m. I'm going to go down. I think your CPU or memory is going down. I think I'm line 14 of your code [laugh] is a problem causing the bug. Please, you need to fix it by 2 p.m. so that by 6 p.m., things can run perfectly.” That is going to revolutionize monitoring. That's going to revolutionize observability and bring a whole new level to how we understand and monitor the systems.Corey: I hope you're right. If you take a look right now, I guess, the schism between monitoring and observability—which I consider to be hipster monitoring, but they get mad when I say that—is there a difference? Is it just new phrasing to describe the same concepts, or is there something really new here?Ewere: In my book, I said, monitoring is looking at it from the outside in, observability is looking at it from the inside out. So, what monitoring does not see under, basically, observability sees. So, they are children of the same mom. That's how I put it. One actually needs the other and both of them cannot be separated from each other.What we've been working with is just understanding the system from the surface. When there's an issue, we go to the aggregated results that come out of the issue. Very basic example: you're in a Java application, and we all know Java is very memory intensive, on the very basic layer. And there's a memory issue. Most times, infrastructure is the first hit with the resultant of that.But the problem is not the infrastructure, it's maybe the code. Maybe garbage collection was not well managed; maybe they have a lot of variables in the code that is not used, and they're just filling up unnecessary memory locations; maybe there's a loop that's not properly managed and properly optimized; maybe there's a resource on objects that has been initialized that has not been closed, which will cause a heap in the memory. So, those are the things observability can help you track. Those are the things that we can help you see. Because observability runs from within the system and send metrics out, while basic monitoring is about understanding what is going on on the surface of the system: memory, CPU, pushing out logs to know what's going on and all that.So, on the basic level, observability helps gives you, kind of, a deeper insight into what monitoring is actually telling you. It's just like the result of what happened. I mean, we are told that the symptoms of COVID is coughing, sneezing, and all that. That's monitoring. [laugh].But before we know that you actually have COVID, we need to go for a test, and that's observability. Telling us what is causing the sneezing, what is causing the coughing, what is causing the nausea, all the symptoms that come out of what monitoring is saying. Monitoring is saying, “You have a cough, you have a runny nose, you're sneezing.” That is monitoring. Observability says, “There is a COVID virus in the bloodstream. We need to fix it.” So, that's how both of them act.Corey: I think that is probably the most concise and clear definition I've ever gotten on the topic. If people want to learn more about what you're up to, how you view about these things—and of course, if they want to buy your book, we will include a link to that in the [show notes 00:27:40]—where can they find you?Ewere: I'm on LinkedIn; I'm very active on LinkedIn, and I also shared the LinkedIn link. I'm very active on Twitter, too. I tweet once in a while, but definitely, when you send me a message on Twitter, I'm also going to be very active.I also write blogs on Medium, I write a couple of blogs on Medium, and that was part of why AWS recognized me as a Hero because I talk a lot about different services, I help with comparing services for you so you can choose better. I also talk about setting basic concepts, too; if you just want to get your foot wet into some stuff and you need something very summarized, not AWS documentation per se, something that you can just look at and know what you need to do with the service, I talk about them also in my blogs. So yeah, those are the two basic places I'm in: LinkedIn and Twitter.Corey: And we will, of course, put links to that in the [show notes 00:28:27]. Thank you so much for taking the time to speak with me. I appreciate it.Ewere: Thanks a lot.Corey: Ewere Diagboya, head of cloud at My Cloud Series. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment telling me how many more dashboards you would like me to build that you will never look at.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Paul's Security Weekly TV
Exchange's Great Leak, RCE in VMware, IoT Bug in MQTT, & Chrome's Memory Safety Nets - ASW #167

Paul's Security Weekly TV

Play Episode Listen Later Sep 28, 2021 34:08


This week in the AppSec News: The Great Leak flaw in Exchange's auto discover feature, common flaws in VMware and Nagios, memory issues and SSRF in Apache's HTTP server, Chrome's plans for memory safety, State of DevOps report, OWASP's 20th anniversary, & more!   Visit https://www.securityweekly.com/asw for all the latest episodes!   Show Notes: https://securityweekly.com/asw167

Application Security Weekly (Audio)
Skills & Knowledge - ASW #167

Application Security Weekly (Audio)

Play Episode Listen Later Sep 28, 2021 71:48


This week, we welcome Anita D'Amico, VP, Market Development at Synopsys, and Patrick Carey, Senior Director of Product Marketing at Synopsys, to discuss AppSec Orchestration/Correlation & DevSecOps Efficiency! In the AppSec News: The Great Leak flaw in Exchange's auto discover feature, common flaws in VMware and Nagios, memory issues and SSRF in Apache's HTTP server, Chrome's plans for memory safety, State of DevOps report, OWASP's 20th anniversary, & more!   Show Notes: https://securityweekly.com/asw167 Visit https://securityweekly.com/synopsys to learn more about them!   Visit https://www.securityweekly.com/asw for all the latest episodes! Follow us on Twitter: https://www.twitter.com/securityweekly Like us on Facebook: https://www.facebook.com/secweekly

Paul's Security Weekly
Skills & Knowledge - ASW #167

Paul's Security Weekly

Play Episode Listen Later Sep 28, 2021 71:48


This week, we welcome Anita D'Amico, VP, Market Development at Synopsys, and Patrick Carey, Senior Director of Product Marketing at Synopsys, to discuss AppSec Orchestration/Correlation & DevSecOps Efficiency! In the AppSec News: The Great Leak flaw in Exchange's auto discover feature, common flaws in VMware and Nagios, memory issues and SSRF in Apache's HTTP server, Chrome's plans for memory safety, State of DevOps report, OWASP's 20th anniversary, & more!   Show Notes: https://securityweekly.com/asw167 Visit https://securityweekly.com/synopsys to learn more about them!   Visit https://www.securityweekly.com/asw for all the latest episodes! Follow us on Twitter: https://www.twitter.com/securityweekly Like us on Facebook: https://www.facebook.com/secweekly

Application Security Weekly (Video)
Exchange's Great Leak, RCE in VMware, IoT Bug in MQTT, & Chrome's Memory Safety Nets - ASW #167

Application Security Weekly (Video)

Play Episode Listen Later Sep 27, 2021 34:08


This week in the AppSec News: The Great Leak flaw in Exchange's auto discover feature, common flaws in VMware and Nagios, memory issues and SSRF in Apache's HTTP server, Chrome's plans for memory safety, State of DevOps report, OWASP's 20th anniversary, & more!   Visit https://www.securityweekly.com/asw for all the latest episodes!   Show Notes: https://securityweekly.com/asw167

K12 Tech Talk
Episode 44 - Stolen Devices, Filter Changes, and a PrintNightmare Patch

K12 Tech Talk

Play Episode Listen Later Sep 26, 2021 40:48


In this episode Chris talks about applying a Windows update that broke all of his printing. Josh talks about a student that sent him an email asking for filter changes to be applied, and he talks about an experience with a stolen Chromebook. Cory talks about his new visitor management system. The newly released critical patches for Nagios and VCenter are also discussed. Article we discuss - https://www.bleepingcomputer.com/news/security/new-windows-security-updates-break-network-printing/ Check out our sponsor ClassLink | Single Sign-On for Education Tweet us at @k12techtalkpod email us k12techtalk@gmail.com BUY A SHIRT AND/OR HOODIE! https://tinyurl.com/k12techtalkGOTSHIRTS Visit our sponsors at: somethingcool.com provisionds.com arubanetworks.com classlink.com

IT-Keller
ITK055 Online-Garnelen!

IT-Keller

Play Episode Listen Later Aug 6, 2021 102:20


QR-Code; EU Digital COVID Certificates (EUDCC); Handy-Signatur; Bürgerkarte; eID; Fairphone 3+; /e/ Android-Alternative; The Manga Guide to Microprocessors; Es war einmal … das Leben; CDU-Connect; jö Bonus Club Profiling; Ulrich's Garnelen; Wireguard; Monit Monitoring; Dart Programmiersprache; OpenNMS; Zabbix; Nagios; Icinga; Syncing und Backup; Goodbye, Twitter Fleets; Twitter Spaces; Twitter Dislike Up/Down Vote; Kehlkopfmikrofon; CAT5-Longe; Apple Glasses; Socialmedia 2.0; Silent Disco; Apple und links und rechts; Apple Magic Keyboard; Ein bisschen Raumfahrt; My Little Pony Gäste: Bernhard, Sindre, Stefan, Ulrich

Screaming in the Cloud
All Along the Shoreline.io of Automation with Anurag Gupta

Screaming in the Cloud

Play Episode Listen Later Jul 20, 2021 39:15


This week Corey is joined by Anurag Gupta, founder and CEO of Shoreline.io. Anurag guides us through the large variety of services he helped launch to include RDS, Aurora, EMR, Redshift and other. The result? Running things almost like a start-up—but with some distinct differences. Eventually Anurag ended up back in the testy waters of start-ups. He and Corey discuss the nature of that transition to get back to solving holistic problems, tapping into conveying those stories, and what Anurag was able to bring to his team at Shoreline.io where automation is king. Anurag goes into the details of what Shoreline is and what they do. Stay tuned for me.Links: Shoreline.io: https://shoreline.io LinkedIn: https://www.linkedin.com/in/awgupta/ Email: anurag@Shoreline.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Your company might be stuck in the middle of a DevOps revolution without even realizing it. Lucky you! Does your company culture discourage risk? Are you willing to admit it? Does your team have clear responsibilities? Depends on who you ask. Are you struggling to get buy in on DevOps practices? Well, download the 2021 State of DevOps report brought to you annually by Puppet since 2011 to explore the trends and blockers keeping evolution firms stuck in the middle of their DevOps evolution. Because they fail to evolve or die like dinosaurs. The significance of organizational buy in, and oh it is significant indeed, and why team identities and interaction models matter. Not to mention weither the use of automation and the cloud translate to DevOps success. All that and more awaits you. Visit: www.puppet.com to download your copy of the report now!Corey: If your familiar with Cloud Custodian, you'll love Stacklet. Which is made by the same people who made Cloud Custodian, but put something useful on top of it so you don't have to be a need to be a YAML expert to work with it. They're hosting a webinar called “Governance as Code: The Guardrails for Cloud at Scale” because its a new paradigm that enables organizations to use code to manage and automate various aspects of governance. If you're interested in exploring this you should absolutely make it a point to sign up, because they're going to have people who know what they're talking about—just kidding they're going to have me talking about this. Its doing to be on Thursday, July 22nd at 1pm Eastern. To sign up visit snark.cloud/stackletwebinar and I'll talk to you on Thursday, July 22nd.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is brought to you by Shoreline, and I'm certain that we're going to get there, but first, I'm notorious for telling the story about how Route 53 is in fact a database, and anyone who disagrees with me is wrong. Now, AWS today is extraordinarily tight-lipped about whether that's accurate or not, so the next best thing, of course, is to talk to the person who used to run all of AWS's database offerings and start off there and get it from the source. Today, of course, he is not at an Amazon, which means he's allowed to speak with me. My guest is Anurag Gupta, the founder and CEO of Shoreline.io. Anurag, thank you for joining me.Anurag: Thanks for having me on the show, Corey. It's great to be on, and I followed you for a long time. I think of you as AWS marketing, frankly.Corey: The running gag has been that I am the de facto head of AWS marketing as a part-time gag because I wandered past and saw an empty seat and sat down and then got stuck with the role. I mostly kid, but there does seem to be, at times, a bit of a challenge as far as expressing stories and telling those stories in useful ways. And some mistakes just sort of persist stubbornly forever. One of them is in the list of services, Route 53 shows up as ‘networking and content delivery,' which I think regardless of the answer, it doesn't really fit there. I maintain it's a database, but did you have oversight into that along with Glue, Athena, all the RDS options, managed blockchain—for some reason—as well. Was it considered a database internally, or was that not really how they viewed it?Anurag: It's not really how they view it. I mean, certainly there's a long IP table, right, and routing tables, but I think we characterized it in a whole different org. So, I had responsibility for Analytics, Redshift, Glue, EMR, et cetera, and transactional databases: Aurora, RDS, stuff like that.Corey: Very often when you have someone who was working at a very large company—and yes, Amazon has a bunch of small teams internally, but let's face it, they're creeping up on $2 trillion in valuation at the time of this recording—it's fairly common to see that startups are, “Oh, this person was at Amazon for ages.” As if it's some sort of amazing selling point because a company with, what is it, 1.2 million people give or take is absolutely like a relatively small just-founded startup culturally, in terms of resources, all the rest. Conversely, when you're working at scales like that, where the edge case becomes the common case, and the corner case becomes something that happens 18 times an hour, it informs the way you think about things radically differently. And your reputation does precede you, so I'm going to opt for assuming that this is, rather than being the story about, “Oh, we're just going to try and turn this company into the second coming of Amazon,” that there's something that you saw while you were at AWS that you thought it was an unmet need in the ecosystem, and that's what Shoreline is setting out to build. Is that slightly accurate? Or no you're just basic—there's a figurehead because the Amazon name is great for getting investors.Anurag: No, that's very astute. So, when I joined AWS, they gave me eight people and they asked me to go disrupt data warehousing and transaction processing. So, those turned into Redshift and Aurora, respectively, and gradually I added on more services. But in that sense, Amazon does operate like a startup. They really believe in restricting the number of resources you get so that you have time and you're forced to think and be creative.That said, you don't really wake up at night sweating about whether you're going to hit payroll. This is, sort of, my fourth startup at this point and there are sleepless nights at a startup and it's different. I'd go launch a service at AWS and there'll be 1000 people who are signed up to the beta the next day, and that's not the way startups work. But there are advantages as well.Corey: I can definitely empathize with that. My last job before I started this place was at a small scrappy startup which was great for three months and then BlackRock bought us, and then, oh, large regulated finance company combined with my personality ended about the way you think it would. And where, so instead of having the fears and the challenges that I dealt with then, I'm going to go start my own company and have different challenges. And yeah, they are definitely different. I never laid awake at night worrying about how I was going to make payroll, for example.There's also the freedom, in some ways, at large companies where whatever function needs to get done, whatever problem you have, there is some department somewhere that handles that almost exclusively, whereas in scrappy startup land, it's, well, whatever problem needs to get done today, that is your job right now. And your job description can easily fill six pages by the end of month two. It's a question of trade-offs and the rest. What did you see that gave you the idea to go for startup number four?Anurag: So, when I joined AWS thinking I was going to build a bunch of database engines—and I've done that before—what I learned is that building services is different than building products. And in particular, nobody cares about your performance or features if your service isn't up. Inside AWS, we used to talk about utility computing, you know, metering and providing compute storage database the way, you know, my local utility provider, PG&E, provides power and gas. And if I call up PG&E and say that the power is out at my house, I don't really want to hear, “Oh, did you know that we have six nines power availability in the state of California?” I mean, the power is still out; go come over here and fix it. And I don't really care about fancy new features they're doing back at the plant. Really, all I care about is cost and availability.Corey: The idea of utility computing got into that direction, too, in a lot of ways, in some strange nuances, too. The idea that when I flip the light switch, I don't stop and wonder, is the light going to turn on? You know, until I installed IoT switches and then everything's a gamble in the wild times again. And if the light doesn't come on, I assume that the fuse is out, or the light bulb is blown. “Did PG&E wind up dropping service to my neighborhood?” Is sort of the last question that I have done that list. It took a while for cloud to get there, but at this point, if I can't access something in AWS, my default assumption is that is my local internet, not the cloud provider. That was hard-won.Anurag: That's right. And so I think a lot of other SaaS companies—or anybody operating in the cloud—are now working and struggling to get that same degree of availability and confidence to supply to their customers. And so that's really the reason for Shoreline.Corey: There's been a lot of discussion around the idea of availability and what that means for a business outcome where, I still tell the story from time to time that back in 2012 or so, I was going to buy a pair of underpants on amazon.com, where I buy everything, and instead of completing the purchase, it threw one of the great pictures of staff dogs up. Now, if you listen to a lot of reports on availability, then for one day out of the week, I would just not wear underwear. In practice, I waited an hour, tried it again, the purchase went through and it was fine. However, if that happened every third time I tried to make a purchase, I would spend a lot more money at Target.There has to be a baseline level of availability. That doesn't mean that your site is never down, period, because that is, in many cases, an unrealistic aspiration and it turns every outage that winds up coming up down the road into an all-hands-on-deck five-alarm fire, which may not be warranted. But you do need to have a certain level of availability that meets or exceeds your customer's expectations of same. At least that's the way that I've always viewed it.Anurag: I think that's exactly right. I also think it's important to look at it from a customer perspective, not a fleet perspective. So, a lot of people do inward-facing SRE measurements of fleet-wide availability. Now, your customer really cares about the region they're in, or perhaps even the particular host they're on. And that's even more true if they've got data. So, for example, an individual database failing, it'll take a long time for it to come back up elsewhere. That's different than something more ephemeral, like an instance, which you can move more easily.Corey: Part of the challenge that I've noticed as well when dealing with large cloud providers, a recurring joke has been the AWS status page: it is the purest possible expression of a static site because it never changes. And people get upset when things go down and the status page isn't updated, but the challenge is when you're talking about something that is effectively global scale, it stops being a question of is it up or is it down and transitions long before then into how up or how down is it? And things that impact one customer may very well completely miss another. If you're being an absolutist, it will always be a sea of red, which doesn't tell people anything useful. Whereas if a customer is down and their site is off, they don't really care that most other customers aren't affected.I mean, on some level, you kind of want everyone to be down because that differs headline risk, as well as if my site is having a problem, it could be days before someone gets around to fixing a small bug, whereas if everything is down, oh, this will be getting attention very rapidly.Anurag: That's exactly right. Sounds like you've done ops before.Corey: Oh, yes. You can tell that because I'm cynical and bitter about everything.Anurag: [laugh].Corey: It doesn't take long working in operationally-focused roles to get there. I appreciate your saying that though. Usually, people say, “Let me guess. You used to be an ops person.” “How can you tell?” “Because your code is garbage,” is the other way that people go down that path.And yeah, credit where due; they're not wrong. You mentioned that back when you were in Amazon, you were given a team of eight people and told to disrupt the data warehouse. Yeah, I've disrupted the data warehouse as a single person before so it doesn't seem that hard. But I'm guessing you mean something beyond causing an outage. It's more about disrupting the space, presumably.Anurag: [crosstalk 00:10:57].Corey: And I think, looking back from 2021, it's hard to argue that Amazon hasn't disrupted the data warehouse space and fifteen other spaces besides.Anurag: Yeah, so that's what we were all about, sort of trying to find areas of non-consumption. So clearly, data was growing; data warehousing was not growing at the same rate. We figured that had to do with either a cost problem, or it had to do with a simplicity problem, or something else. Why aren't people analyzing the data that they're collecting? So, that led to Redshift. A similar problem in transaction processing led to Aurora and various other things.Corey: You also said a couple of minutes ago that Amazon tends to talk more about features than they do about products, and building a product at a startup is a foundationally different experience. I think you're absolutely on to something there. Historically, Amazon has folks get on stage at re:Invent and talk about this new thing that got released, and it feels an awful lot like a company saying, “Yeah, here's some great bricks you can use to build a house.” “Well, okay. What kind of house can I build with those bricks?” “Here to talk about the house that they built as our guest customer speaker from Netflix.”And it seems like they sort of abdicated, in many respects, the storytelling portion to a number of their customers. It is a very rare startup that has the luxury of being able to just punt on building a product and its product story that goes along with it. Have you found that your time at Amazon made storytelling something that you wound up missing a bit more, or retelling stories internally that we just don't get to see from the outside, or is, “Oh, wow. I never learned to tell a story before because at Amazon, no one does that, and I have to learn how to do that now that I'm at a startup again?”Anurag: No, I think it really is a storytelling experience. I mean, it's a narrative-based culture there, which is, in many ways, a storytelling experience. So, we were trying to provide a set of capabilities so that people could build their own things, you know, much as Kindle allows people to self-publish books; we're not really writing books of our own. And so I think that was the experience there. Outside, you are trying to solve more holistic problems, but you're still only a puzzle piece in the experience that any given customer has, right? You don't satisfy all of their needs, you know, soup to nuts.Corey: And part of the challenge too, is that if I'm a small, scrappy startup, trying to get something out the door for the first time, the problems that I'm experiencing and the challenges that I have are radically different than something that has attained hyperscale and now has whole optimization stories or series of stories going on. It's, will this thing even work at all is my initial focus. And in some ways, it feels like conference-ware cuts against a lot of that because it's hard not to look at the aspirational version of events that people tell on stage at every event I've ever seen, and not come away with a takeaway of, “Oh. What I've built is actually terrible, and depressing, and sad.” One of the things that I find that resonates about what you're building over at Shoreline is, it's not just about the build things from scratch and get them provisioned for the first time. It's about the ongoing operationalization, I think—if that's a word—about that experience, and how to wind up handling the care and feeding of something that exists and is running, but is also subject to change because all things are continually being iterated on.Anurag: That's right. I feel like operation is sort of an increasingly important but underappreciated part of the service delivery experience much as, maybe, QA was a couple of decades ago. And over time we've gone and we built pipelines to automate our test infrastructure, we have deployment tools to deploy it, to configure it, but what's weird is that there are two parts of the puzzle that are still highly manual: developing software and operating that software in production. And the other thing that's interesting about that is that you can decide when you are working on developing a piece of code, or testing it, or deploying it, or configuring it. You don't get to decide when the disk goes down or something breaks. That's why you have 24/7 on-call.And so the whole point of Shoreline is to break that into two problems: the things that are automatable, and make it easy, as trivial to automate those things away so you don't wake up to do something for the tenth time; and then for the remaining things that are novel, to make diagnosing and repairing your fleet, as simple and straightforward as diagnosing and repairing a single box. And we do a lot of distributed systems [techs 00:16:01] underneath the covers to make that the case. But those are the two things that we do, and so hopefully that reduces people's downtime and it also brings back a lot of time for the operators so they can focus on higher-value things, like working with you to reduce their AWS bill.Corey: Yeah, for better or worse, working on the AWS bill is always sort of a backseat function, or a backburner function, it's never the burning priority unless things have gone seriously awry. It's a good governance thing; it's the idea of where, let's optimize this fixed unit economics. It is rarely the number one most pressing area of business for a company. Nor should it be; I think people are sometimes surprised to hear me say that. You want to be reasonable stewards of the money entrusted to you and you obviously want to continue to remain in business by not losing money on everything you sell, but trying to make it up in volume. But at some point, it's time to stop cutting and focus instead on revenue growth. That is usually the path to success for almost every company I've ever spoken to, unless they are either very out of kilter, or in a very strange spot in the industry.Anurag: That's true, but it does belong, I think, in the ops function to do optimization of your experience, whether—and, you know, improving your resources, improving your security posture, all of those sorts of things fall into production ops landscape, from my perspective. But people just don't have time for it because their fleets are growing far, far faster than their headcount is. So, the only solution to that is automation.Corey: And I want to talk to you about that. Historically, the idea has been that you have monitoring—or observability these days, which I consider to be hipster monitoring—figuring out what's going on in your environment. Then you wind up with incidents being declared when certain things wind up triggering, which presumably are things that actually matter and not, you're waking someone up for vague reasons like ‘load average is high on these nodes,' which tells you nothing in isolation whatsoever. So, you have the incident management portion of that [next 00:18:03], and that handles a lot of the waking folks up and getting everyone onto the call. You're focusing on, I guess, a third tranche here, which is the idea of incident automation. Tell me about that.Anurag: That's exactly right. So, having been in the trenches, I never got excited about one more dashboard to look at, or someone routing a ticket to the right person, per se, because it'll get there, right?Corey: Oh, yeah. Like, one of the most depressing things you'll ever see in a company is the utilization numbers from the analytics on the dashboards you build for people. They look at them the day you build them and hand it off, and then the next person visiting it is you while running this report to make sure the dashboard is still there.Anurag: Yeah. I mean, they are important things. I mean, you get this huge sinking feeling something is wrong and your observability tool is also down like CloudWatch was in some large-scale events. Or if your ticketing system is down and you don't even notify somebody and you don't even know to wake up. But what did excite me—so you need those things; they're necessary, but they're not sufficient.What I think is also needed is something that actually reduces the number of tickets, not just lets you observe them or find the right person to act upon it. So, automation is the path to reducing tickets, which is when I got excited because that was one less thing to wake up on that gave me more time back to wo—do things, and most importantly, it improved my customer availability because any individual issue handled manually is going to take an hour or two or three to deal with. The issue being done by a computer is going to take a few seconds or a few minutes. It's a whole different thing. It's the difference between a glitch and having to go out on an apology tour to your customers.Corey: I really love installing, upgrading, and fixing security agents in my cloud estate! Why do I say that? Because I sell things, because I sell things for a company that deploys an agent, there's no other reason. Because let's face it. Agents can be a real headache. Well, now Orca Security gives you a single tool that detects basically every risk in your cloud environment -- and that's as easy to install and maintain as a smartphone app. It is agentless, or my intro would've gotten me into trouble here, but  it can still see deep into your AWS workloads, while guaranteeing 100% coverage. With Orca Security, there are no overlooked assets, no DevOps headaches, and believe me you will hear from those people if you cause them headaches. and no performance hits on live environments. Connect your first cloud account in minutes and see for yourself at orca.security. Thats “Orca” as in whale, “dot” security as in that things you company claims to care about but doesn't until right after it really should have.Corey: Oh, yes. I feel like those of us who have been in the ops world for long enough, we always have a horror story or to have automation around incidents run amok. A classic thing that we learned by doing this, for example, is if you have a primary and a secondary, failover should be automated. Failing back should not be, or you wind up in these wonderful states of things thrashing back and forth. And in many cases in data center land, if you have a phantom router ready to step in, if the primary router goes offline, more outages are caused by a heartbeat failure between those two devices, and they both start vying for power.And that becomes a problem. Same story with a lot of automation approaches. For example, if oh, every time a disc winds up getting full, all right, we're going to fire off something automatically expand the volume. Well, without something to stop that feedback loop, you're going to potentially wind up with an unbounded growth problem and then you wind up with having no more discs to expand the volume to, being the way that winds up smacking into things. This is clearly something you've thought about, given that you have built a company out of this, and this is not your first rodeo by a long stretch. How do you think about those things?Anurag: So, I think you're exactly right there, again. So, the key here is to have the operator, or the SRE, define what needs to happen on an individual box, but then provide guardrails around them so that you can decide, oh, a lot of these things have happened at the same time; I'm going to put a rate limiter or a circuit breaker on it and then send it off to somebody else to look at manually. As you said, like failover, but don't flap back and forth, or limit the number of times, but something is allowed to fail before you send it [unintelligible 00:21:44]. Finally, everything grounds that a human being looking at something, but that's not a reason not to do the simple stuff automatically because wasting human intelligence and time on doing just manual stuff again, and again, and again, is pointless, and also increases the likelihood that they're going to cause errors because they're doing something mundane rather than something that requires their intelligence. And so that also is worse than handing it off to be automated.But there are a lot of guardrails that can be put around this—that we put around it—that is the distributed systems part of it that we provide. In some sense, we're an orchestration system for automation, production ops, the same way that other people provide an orchestration system for deployments, and automated rollback, and so forth.Corey: What technical stacks do you wind up supporting for stuff like this? Is it anything you can effectively SSH into? Does it integrate better with certain cloud providers than others? Is it only for cloud and not for folks with data center environments? Where do you start? Where do you stop?Anurag: So, we have started with AWS, and with VMs and Kubernetes on AWS. We're going to expand to the other major cloud providers later this year and likely go to VMware on-prem next year. But finally, customers tell us what to do.Corey: Oh, yeah. Looking for things that have no customer usage is—that's great and all, but talking to folks who are like, “Yeah, it'd be nice if it had this.” “Will you buy it if it does?” “No.” “Yeah, let's maybe put that one on the backlog.”Anurag: And you've done startups, too, I see that.Corey: Oh, once or twice. Talk to customers; I find that's one of those things that absolutely is the most effective use of your time you can do. Looking at your site—Shoreline.io for those who want to follow along at home—it lists a few different remediations that you give as examples. And one of them is expanding disk volumes as they tend to run out of space. I'm assuming from that perspective alone, that you are almost certainly running some form of Agent.Anurag: We are running an Agent. So, part of that is because that way, we don't need credentials so that you can just run inside the customer environment directly and without your having to pass credentials to some third party. Part of it is also so you can do things quickly. So, every second, we'll scrape thousands of metrics from the Prometheus exporter ecosystem, calculate thousands more, compare them against hundreds of alarms, and then take action when necessary. And so if you run on-box, that can be done far faster than if you go on off-box.And also, a lot of the problems that happen in the production environment are related to networking, and it's not like the box isn't accessible, but it may be that the monitoring path is not accessible. So, you really want to make sure that the box can protect itself even if there's some issues somewhere in the fleet. And that really becomes an important thing because that's the only time that you need incident automation: when something's gone wrong.Corey: I assume that Agent then has specific commands or tasks it's able to do, or does it accept arbitrary command execution?Anurag: Arbitrary command execution. Whatever you can type in at the Linux command prompt, whether it's a call to the AWS CLI, Kube control, Linux commands like top, or even shell scripts, you can automate using Shoreline.Corey: Yeah. That was one of the ways that Nagios got it wrong, once upon a time, with their NRP, their Nagios Remote Plugin engine, where you would only be allowed to run explicit things that had been pre-approved and pushed out to things in advance. And it's one of the reasons, I suspect, why remediation in those days never took off. Now, we've learned a lot about observability and monitoring, and keeping an eye on things that have grown well beyond host-based stuff, so it's nice to see that there is growth in that. I'm much more optimistic about it this time around, based upon what you're saying.Anurag: I hope you're right because I think the key thing also is that I think a lot of these tools vendors think of themselves as the center of the universe, whereas I think Shoreline works the best if it's entirely invisible. That's what you want from a feedback control system, from a automation system: that it just give you time back and issues are just getting fixed behind the scenes. That's actually what a lot of AWS is doing behind the scenes. You're not seeing something whenever some rack goes down.Corey: The thing that is always taken me back—and I don't know how many times I'm going to have to learn this lesson before it sticks—I fall into the common trap of take any one of the big internationally renowned tech companies, and it's easy to believe that oh, everything inside is far future wizardry of, everything works super well, the automation is flawless, everything is pristine, and your environment compared to that is relative garbage. It turns out that every company I've ever spoken with and taken SREs from those companies out to have way too many drinks until they hit honesty levels, they always talk about it being a sad dumpster fire in a bunch of different ways. And we're talking some of the companies that people laud as the aspirational, your infrastructure should be like these companies. And I find it really important to continue to socialize that point, just because the failure mode otherwise is people think that their company just employs terrible engineers and if people were any good, it would be seamless, just like they say on conference stages. It's like comparing your dating life to a romantic comedy; it's not an accurate depiction of how the world works.Anurag: Yeah, that's true. That said, I'd say that, like, the average DBA working on-prem may be managing a hundred databases; the average DBA in RDS—or somebody on call—might be managing a hundred thousand.Corey: At that point, automation is no longer optional.Anurag: Yeah. And the way you get there is, every week you squash and extinguish one thing forever, and then you start seeing less and less frequent things because one in a million is actually occurring to you. But if it was one in a hundred, that would just crush you. And so you just need to, you know, very diligently every week, every day, remove something. Yeah, Shoreline is in many ways the product I wish I had had at AWS because it makes automating that stuff easy, a matter of minutes, rather than months. And so that gives you the capability to do automation. Everyone wants automation, but the question is, why don't they do it? And it's just because it takes so much time and we're so busy, as operators.Corey: Absolutely. I don't mean to say that these large companies working at hyperscale have not solved for these problems and done truly impressive things, but there's always sharp edges, there's always things that are challenging and tricky. On this show, we had Dr. Christina Maslach recently as an expert on burnout, given that she spent her entire career studying occupational burnout as an academic. And it turns out that it's not—to equate this to the operations world—it's not waking up at two in the morning to have to fix a problem—generally—that burns people out. It's being woken up to fix a problem at 2 a.m. consistently, and it's always the same problem and nothing ever seems to change. It's the worst ops jobs I've ever seen are the ones where you have to wake up to fix a thing, but you're not empowered to actually fix the cause, just the symptom.Anurag: I couldn't agree more and that's the other aspect of Shoreline is to allow the operators or SREs to build the remediations rather than just put a ticket into some queue for some developer to get prioritized alongside everything else. Because you're on the sharp edge when you're doing ops, right, to deal with all the consequences of the issues that are raised. And so it's fine that you say, “Okay, there's this memory leak. I'll create a ticket back to dev to go and fix it.” But I need something that helps me actually fix it here and now. Or if there's a log that's filling up my disk, it's fine to tell somebody about it, but you have to grow your disk or move that log off the disk. And you don't want to have to wake up for those things.Corey: No. And the idea that everything like this gets fixed is a bit of a misnomer. One of my hobbies is whenever a site goes down and it is uncovered—sometimes very publicly, sometimes in RCEs—that the actual reason everything broke was due to an expired certificate.Anurag: Yep.Corey: I like to go and schedule out a couple of calendar reminders on that one for myself, of check it in 90 days, in case they're using a refresh from Let's Encrypt, and let's check it as well in one year and see if there's another outage just like that. It has a non-zero success rate because as much as we want to convince ourselves that, oh, that bit me once, and I'll never get bitten like that again, that doesn't always hold true.Anurag: Certificates are a very common source of very widespread outages. And it's actually one of the remediations we provide out of the box. So, alongside making it possible for people to create these things quickly, we also provide what we call Op Packs, which are basically getting started things which have the metrics, alarms, actions, bots, so they can just fix it forever without actually having to do very much other than review what we have done.Corey: And that's, on some level, I think, part of the magic is abstracting away the toil so that people are left to solve interesting problems and think about these things, and guiding them down a path where, okay, what should I do on an automatic basis if the disk fills up? Well, I should extend the volume. Yeah. But maybe you should alert after the fifth time in an hour that you have to extend the same volume because—just spitballing here—maybe there's a different problem here that putting a bandaid on isn't going to necessarily solve. It forces people to think about what are those triggers that should absolutely result in human intervention because you don't necessarily want to solve things like memory leaks, for example, oh our application leaks memory so we have to restart it once a day.Now, in practice, the right way to solve that is to fix the application. In practice, there are so many cron jobs out there that are set to restart things specifically for that reason because cron jobs are quick and easy and application developer time is absolutely not easy to come by in many of these shops. It just comes down to something that helps enforce more of a process, more of a rigor. I like the idea quite a bit; it aligns both with where people are and how a better tomorrow starts to look. I really do think you're onto something here.Anurag: I mean, I think it's one of these things where you just have to understand it's not either-or, that it's not a question of operator pain or developer pain. It's, let's go and address it in the here and now and also provide the information, also through an automated ticket generation, to where someone can look to fix it forever, at source.Corey: Oh, yeah. It's always great of the user experience, too. Having those tickets created automatically is also sometimes handy because the worst way to tell someone you don't care about their problem when they come to you in a panic is, “Have you opened a ticket?” And yes, of course, you need a ticket to track these things, but maybe when someone is ghost pale and scared to death about what they think just broke the data, maybe have a little more empathy there. And yeah, the process is important, but there should be automatic ways to do that. These things all have APIs. I really like your vision of operational maturity and managing remediation, in many cases, on an automatic basis.Anurag: I think it's going to be so much more important in a world where deployments are more frequent. You have microservices, you have multiple clouds, you have containers that give a 10x increase in the number of things you have to manage. There's a lot for operators to have to keep in their heads. And things are just changing constantly with containers. Every minute, someone comes and one goes. So, you just really need to—even if you're just doing it for diagnosis, it needs to be collecting it and putting it aside, is really critical.Corey: If people want to learn more about what you're building and how you think about these things, where can they find you?Anurag: They can reach out to me on LinkedIn at awgupta, or of course, they can go to Shoreline.io and reach out there, where I'm also anurag@Shoreline.io if they want to reach out directly. And we'd love to get people demos; we know there's a lot of pain out there. Our mission is to reduce it.Corey: Thank you so much for taking the time to speak with me today. I really appreciate it.Anurag: Yeah. This was a great privilege to talk to you.Corey: Anurag Gupta, CEO and founder of Shoreline.io. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with a comment telling me that I'm wrong and that Amazonians are the best at being on call because they carry six pagers.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Keep on Rockin' in the Server-Free World

Screaming in the Cloud

Play Episode Listen Later Jul 15, 2021 36:01


About MichaelMichael Garski is the Director of Platform Engineering at Fender Musical Instruments, where he leads the teams responsible for service development & testing, devops, and data. He's been with Fender for over 5 years and prior to that  worked as a software engineer & architect on back-end systems at Viant, MySpace, Countrywide Home Loans & Fandango. He is passionate about application reliability and observability and their impact on customer satisfaction.Links:LinkedIn: https://www.linkedin.com/in/mgarski/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Your company might be stuck in the middle of a DevOps revolution without even realizing it. Lucky you! Does your company culture discourage risk? Are you willing to admit it? Does your team have clear responsibilities? Depends on who you ask. Are you struggling to get buy in on DevOps practices? Well, download the 2021 State of DevOps report brought to you annually by Puppet since 2011 to explore the trends and blockers keeping evolution firms stuck in the middle of their DevOps evolution. Because they fail to evolve or die like dinosaurs. The significance of organizational buy in, and oh it is significant indeed, and why team identities and interaction models matter. Not to mention weither the use of automation and the cloud translate to DevOps success. All that and more awaits you. Visit: www.puppet.com to download your copy of the report now!Corey: If your familiar with Cloud Custodian, you'll love Stacklet. Which is made by the same people who made Cloud Custodian, but put something useful on top of it so you don't have to be a need to be a YAML expert to work with it. They're hosting a webinar called “Governance as Code: The Guardrails for Cloud at Scale” because its a new paradigm that enables organizations to use code to manage and automate various aspects of governance. If you're interested in exploring this you should absolutely make it a point to sign up, because they're going to have people who know what they're talking about—just kidding they're going to have me talking about this. Its doing to be on Thursday, July 22nd at 1pm Eastern. To sign up visit snark.cloud/stackletwebinar and I'll talk to you on Thursday, July 22nd.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. We talk to a lot of people here on this show who are deep in the weeds of SaaS companies, or cloud vendors, or cloud vendors cosplaying as SaaS companies. Today, we're taking a bit of a different direction. My guest is Michael Garski, Director of Platform Engineering at Fender Musical Instruments. They make guitars among many other things. Michael, thank you for joining me.Michael: Oh, thanks for having me on, Corey.Corey: So, one of the things that I really appreciate about what you do as a company is I can, at least presumably, explain it to someone who is not super deep in technical weeds without 45 minutes of explainer first. The easy answer is, “Oh, Fender. You folks make guitars.” These days, no one just does one thing, I have to imagine. How do you describe what the company does?Michael: Oh, well, to quote Leo Fender, his view was that artists are angels and it's our job to give them wings. So, in addition to actually making and developing guitars and amplifiers, we've branched off into consumer-facing products to actually teach people how to play those instruments.Corey: You folks have been relatively outspoken about the various things you're doing at different AWS events. I mean, my approach to that tends to be that if AWS is great at making bricks that you can use to build amazing things with, “Well, great, can you draw a picture of the house that you can build with this?” “No, we're going to have a customer come out and talk about that stuff instead.” You folks have been focusing on a lot of serverless work, and you've been very public about the fact that you are almost entirely serverless-driven in terms of architecture if I'm not mistaken.Michael: That is true.Corey: Tell me about that. How did you get there and what brought it about?Michael: So, I work in the digital division in Fender. We started, let's see, we're coming up on five years I've been there. So, what we did was, initially, we started building services that could run within a container, or on an EC2 instance, but we started looking at Lambda functions. We had need to ingest a product catalog, so the IT team was able to drop us off a product catalog into an S3 bucket, and the easiest thing to do then was just trigger a Lambda function to then process that file. And it just kind of snowballed in from there.Corey: I think the common problem when people hear ‘serverless' is they think, “Oh, great. More discussions about Lambda functions.” And Lambda is almost getting something of a tarred reputation in some circles because when we can build amazing things with it ourselves, we love it, but when we ask AWS how to wind up integrating two services, or about a feature gap, their response is, “Oh, use a Lambda function for it,” It starts to feel like they're using it as spackle and the spackle has become load-bearing. Do you view serverless as being purely function-driven or is it broader than that?Michael: It's much broader than that. Serverless is a mindset where you're looking beyond just Lambda functions to using a lot of third-party services so that you can actually focus on your core business. Like, we use Zuora as a subscription provider for web-based subscriptions; we use Algolia for full-text search; we use a variety of other services so that we can just focus on the core business.Corey: One thing that's been on everyone's mind, somewhat recently, has been the idea of dramatic changes as far as user behavior goes. And in the more traditional environments where you see things like EC2 instances or on-premises data centers, back when the pandemic first hit and companies that were very focused on a model of business that aligned directly with people behaving in certain ways that they suddenly didn't, would the 80% drop-offs or more in their user traffic, but their infrastructure spend just kept hanging out exactly where it was, in a straight line. So, at some level, it feels like yes, the whole point of cloud is that it can be elastic, except no one builds it that way for a variety of reasons. When COVID hit, what changed for your business?Michael: Change for our business is we launched a program called Playthrough, okay we did this about a year ago; we started it, we gave away three months of Fender Play for free. It was a single-use code that a user would redeem and no credit card required, and over a period of five days, we saw our traffic increase by more than ten times. And we had very little changes we needed to make. Everything scaled up, we had no issue with—we used a lot of Lambda functions, DynamoDB, everything just scaled up fine. The only point that became a bottleneck was our Elasticsearch cluster. However, beefing up the nodes and adding a few more nodes that resolved that issue immediately.Corey: So, I'm going to go out on a limb and postulate that you folks increased pickup when the lockdowns hit, if for no other reason then, “Well, I'm trapped at home and I'm tired of staring at the guitar on the wall. I may as well learn to play it.” I would guess. I could be way off base on that.Michael: No, no, that's very true. Even since then, even after that program has expired—of course, not everyone then converts and sticks around—but many, many did, many more than we thought would did stick around, and our usage and our goals were exceeded for this last year, and we're in a healthy place, and looking at continuing to grow and expand in the future.Corey: So, one of the applications that I think gets a fair bit of attention—rightfully so—lately, is something called Fender Play, and as best I can tell, that is a app that works in web, it works on mobile, and it's a video-based instruction tool for guitar at least, but some other instruments as well. How did that come to be? Did that exist before COVID hit? Has that been something that's been in the works for a while? Or was it, “Well, we're going to do a two-week sprint and build this thing from scratch?”Michael: No, we launched that—this June we're coming up on the fourth anniversary since it's been launched, so we launched this in summer of 2017.Corey: One of the problems I've always found is that it's challenging to learn to do something that is as, I guess, physical and intricate, et cetera, as playing an instrument without having someone in the room looking at you and smacking you with a stick whenever you do things that are wrong. “Nope, that's a bad habit. If you keep doing that it's going to hurt you.” How do you approach that as a company from a non-interactive perspective of someone who's going to watch a video and do things and maybe it'll work, maybe it won't? Particularly in light of things like, well, the competition is YouTube, which, you know, I'm going to roll the dice and sometimes I'll see a great tutorial, sometimes I'll see one that I don't realize teaching me terrible things, and then it's going to recommend some baseless conspiracy theory because YouTube. How do you differentiate that? What makes Fender Play different?Michael: So currently, you're right; it's just a video-based instruction app. There's not any way to, like, provide direct feedback to students within the web and mobile applications. However, we do have an online community, and our Fender Play instructors do an office hours feature, is where they'll actually answer questions live and talk to students. We are investigating and doing some earlier research in some, possibly, being able to provide that type of feedback to users, but it's very challenging problem, just due to the nature of you're playing an instrument that has multiple strings, so you're trying to pick out the chord that they're playing in, and the timing. But it's something we definitely need to add.Corey: There's something to be said as well for the kind of care and attention that you folks wind up putting into your media where, “This is how you finger a chord,” and someone on the YouTube video will do it for two-tenths of a second, and they're filming it with a potato that isn't focused properly and pointing at the wrong part of the guitar. You folks have a high bar for quality on this. Is that done in-house? Do you wind up just going through a bunch of random folks that you just wind up offering a bunch of gift cards to, or free guitars to do this? How does the program work on the back end?Michael: So, we have an in-house curriculum team that puts together the lesson plans to really help people learn in small bite-sized lessons so that it's not too overwhelming at once. And that curriculum then is shot and filmed by an in-house video team that put that together; they upload the data into S3 for the final cut, then that gets transcoded via MediaConvert, and we serve it up via CloudFront.Corey: It's rare to wind up talking to a company that is something of a household name about something that they're doing, and hear the AWS services that they're using not trend toward a baseline mean if I can be so bold. Normally, you'll see some of the case studies, like, “Oh, this is an online bank. What services are they using?” “Oh, they're using EC2, and S3, and load balancing because did you miss the part where it's a bank?” They're not going to use these far-future services due to regulatory risk, among other things, in many cases.You're using Elemental MediaConvert, which is one of those relatively high-up-the-stack offerings that isn't broadly known. It's one of those services that is focused on specific use cases and specific industry verticals in a way that a baseline primitive service isn't. What does MediaConvert do?Michael: What it does is it takes the final edit of the video, and we have several different presets so that it will put it into an HLS format with different bitrates so that the user is getting the best quality video depending on their bandwidth.Corey: When I looked into it in the early days when it was first launching, I found that it looked an awful lot like Elastic Transcoder, which is a service that they've had for a while, only they changed up some of the capabilities. It's obviously far more capable as a service, but they also added something that felt like 15 different billing dimensions to it, “So, what is this going to cost me?” “Well, we're going to run it for a month and find out if we're still in business.” And it seemed like it was one of those very difficult to get started with and run experiments with service. Now, obviously, services evolve over time. When you started looking into it was that experience roughly akin to what you felt, or am I completely and unfairly slandering in the product?Michael: We actually started out using Elastic Transcoder and then moved over to MediaConvert, I believe it was last year. We found it to be a little bit easier to use, and the pricing overall in transcoding the videos for us is really a drop in the bucket as compared to actually hosting them and serving them up via CloudFront. And when we switched over to MediaConvert, we adjusted our settings to lower the maximum bitrate for a given video, we found that after a certain point, the quality to the user just doesn't really improve, and yet we're paying to serve the larger video.Corey: One statistic that I found was that in March of 2020—you know which I believe we're still in at this point; just, it's the Endless September model, applied to March—you wound up seeing over an order of magnitude in traffic increase within five days, and looking at that through a lens of traditional architecture, that means that nobody sleeps a whole heck of a lot. Given that you're in on the serverless story, and you have been since before that hit, what was that scaling experience like for you?Michael: Scaling experience was completely seamless. We use a lot of Lambda, DynamoDB, Kinesis, SNS, to glue things together, and no problems whatsoever. Just had to bump up our Elasticsearch cluster a bit, that was really the only thing because we saw some latency starting to rise on some of our APIs.Corey: Let me ask the uncomfortable question then because whenever I tried to scale things up quickly in a cloud environment, what was your experience with smacking into various AWS service limits as the traffic grew?Michael: Initially, we actually requested some service limits increase to make sure we weren't hitting the concurrent Lambda invocation limit, and same thing with Cognito, making sure that we weren't going to hit any limits as far as sign-ins and things like that. So, we were able to just put in requests, and they served us around pretty quick turnaround time on that, as well.Corey: It really does seem like there's a strong benefit on the serverless space, but I had to double-check before we started recording that you do, in fact, work at Fender because you are a staunch advocate for observability. And usually, when someone is that passionate about observability, you can guess that they work at an observability-slash-monitoring company. It's akin to the idea of someone selling mattresses telling you that mattresses are great and you should have four of them. You're on the customer side of that and still very passionate about it. Where'd that come from?Michael: Came from my time years ago, when I worked at MySpace—if anyone can still remember that—working on the search systems there. And as the company started winding down, to laying people off, and being one of the only people left working on those systems, being able to know and understand them, you just have to, so you have to continue to monitor and find ways to monitor, and that really ingrained how important instrumentation is and being able to really understand the health of your application as it's running so that you can see, yes, everything is good, and then when something doesn't look right so that you can know where to start looking, and you can be alerted of a problem.Corey: So, I tend to view the world in olden terms where monitoring was what we did, and we use something like Nagios, which was the second-worst option out there because everything else felt like it was tied for first. I also take a somewhat regressive view that observability is to monitoring as DevOps is to being a systems administrator. It's the same thing, but by using the more modern terminology, you can charge more for it. I'm going to go out on a limb and guess that you take a somewhat contrarian [laugh] view to that.Michael: Yes, yes, I do. It's about really understanding how your applications is running. It's not just looking at, oh, how many HTTP 500s am I serving up per hour, if I hit a threshold for the last hour? It's a lot more than that. It's really being able to really dig in and see what the issue is or what's working really well.And to that end, we rely on two services for this. We use Honeycomb and Epsagon. Honeycomb, kind of, acts as our top layer because it gives us the really good high-cardinality metrics where I can punch in a user ID and I can see all the API traffic that this user has performed. As well as, even just like when we launched the Playthrough when our traffic rose, that the reason we discovered that our latency was dropping was due to a service-level objective being triggered in Honeycomb on latency. And we were able to respond to that using that before customers really noticed anything at all.Corey: As an Epsagon customer myself, I'm always conflicted when I find myself going into their service and using it to figure out what the heck's going on with my giant pile of Lambda functions, and API gateways, and whatnot, wired together because the experience is uniformly excellent, but I'm also frustrated in that it needs a third-party to even begin to allude to what's going on. It feels, on some level, like the vendor that is providing this service to me should be reasonably effective at telling me what it's doing, and when it's breaking. I understand that how I wish the world is and how it actually is are two radically different things but does that ever strike you as well?Michael: Whether or not AWS should be providing that type of level, that seems… that seems like more of a service that you can have competition and other vendors that really specialize and get in the weeds on it. I don't think AWS needs to provide every service you could possibly use for your application. That's not something I'm too concerned about. I don't really even think it's their place, frankly.Corey: No, no, I understand. The problem I keep running into, on some level, whenever I try and diagnose it natively is, I look at CloudWatch and it's difficult to understand that is this—in my case because again, I'm still early days with a lot of these things—is it the API gateway that's having the problem? Is it the CloudFront distribution that is tied to that? Is it the Lambda function? Where's the handoff?Trying to understand where in a complicated application the failure is occurring is a challenge. And let's be clear, most of that is a problem of my own making because I didn't have the good sense to instrument this thing in a reliable repeatable way when I built it. It feels like everything is tied together with duct tape, and baling wire, and spit, and a bit of luck. As a counterpoint, the more companies I talk to, the more I realize that no, no, this is actually how most people feel [laugh] when they look at things that are working. It's, yeah, it's terrible. It's a trash fire, but it makes money so we're going to roll with it.And there's always, on some level, a sense of what we've built is very far from the platonic ideal of what we should have built. Does that resonate with you, or do you take a step back and look at what you've achieved with a perspective of, “This is awesome. More people should do it exactly like this.” And honestly, if it's that one, I'd love to take a look at what you've built.Michael: I think there's always room for us to improve on what we're doing because we're constantly learning and evolving to improve both, even at such a low level of like, “Okay, how do we lay out the files in our service repository to make the best organization to make sense?” All the way up to, “Okay, how are we going to do tracing? And what kind of information do we need to get from that so that we can find problems when they occur?” We're always looking to learn what others are doing, and talking to others in this space. No one will ever be a hundred percent right. There's always room for improvement everywhere.Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I'm going to just guess that it's awful because it's always awful. No one loves their deployment process. What if launching new features didn't require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren't what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.Corey: One thing that you folks have done that I think was really interesting and didn't get as much play as I think it really deserved, was that, especially in the early days of the pandemic, you wound up seeing that massive increase due to giving out almost a million free three-month subscriptions to Playthrough. Additionally, you also worked closely with LAUSD, the Los Angeles Unified School District, to add Fender Play to their middle school music program's curriculum to help supplement their remote learning programs. First, was that all in the same timeframe? Or—and, two, what has it been like, I guess, working with a organization that is, I guess, on some level, not particularly cloud-first. I would imagine. When I lived in Los Angeles, I never got the sense that LAUSD was full-on serverless, full on-board with cloud, full on-board with remote learning. And then the pandemic of course exacerbates all of that.Michael: Yeah, so those were really two different projects. So, that the Playthrough project that started in March, and we started working with Los Angeles Unified School District last year during their summer school program; started out with 1500 students and we put it together very quickly. Essentially, we use the same three-month codes that we used for that Playthrough promotion so that we could set things up very quickly for students and gave out, through our nonprofit arm of Fender, the Fender Play Foundation, gave out 1500 instruments to these students to use during the summer school program. And that program became so successful, we continued on with them in the fall, and now in the current semester, and we will be again this summer. I believe there's 7000 students in the program now.And working with their IT team has actually been quite nice. And in dealing with partners, you wouldn't think much of, “Oh, it's a school district, what do they have?” But as far as just ease of working with them, we actually hooked into their SAML provider in Cognito so that LAUSD students could authenticate when they come in through the remote learning systems. And they were great to work with and very helpful and cooperative.Corey: One of the arguments that you'll see that comes up against serverless, from time to time, is that you are now indelibly linked to your provider, but you can't take what you've built with all of these services and just move it over to Azure or GCP on a moment's whim. Now, in practice, people who tend to build for that, just build everything on top of EC2 and very little else, and then run it entirely in AWS and never move it to any of those other places. But was there friction with making that, I guess, architectural commitment to a single vendor?Michael: Oh, you're bringing up the vendor lock-in Boogeyman.Corey: Oh, I absolutely am. Most people who bring that—when I bring it up as a straw man so you can attack it, most people who bring up the vendor lock-in Boogeyman, “Oh, you have to go multi-cloud,” are either trying to sell you something that is required if you want to go multi-cloud, or they're a cloud provider themselves who know that if you go all-in on one provider, it will certainly not be theirs.Michael: I think if you properly architect your applications with separations of concerns that you could move to, say—okay, say Lambda wasn't working out for us anymore, and we needed to take our applications and, where, we're going to put them into a container, but we're going to stay in AWS. Our applications are set up in such a way that Lambda is basically a deployment pattern. We could easily convert those individual function handlers into route handlers with a minimal effort because the business logic and then the underlying data storage are separated. So, it would be feasible for us if we wanted to, say, move to Azure and use Azure Functions and whatever comparable service they have to DynamoDB. I'm not too familiar with a lot of their offerings.But that would certainly be possible to do it with, obviously, some effort and really, at the end of the day, the resources you have working on the applications are end up going to costing you much more than any, sort of like, software licensing or specific savings you're going to get from a cloud vendor, so might as well go ahead and just use those service that they're providing. So that you can just focus on the business.Corey: My approach has almost universally been that looking at an awful lot of companies and their AWS bills, it is a challenge to find an environment where the resources in the environment cost more than the people who are operating them. In the context of business, AWS bills seemed giant and enormous, right up until you look at payroll and then it's, “Oh, okay.” That's counterintuitive for folks who are learning this, and I fall prey to it myself is, when I'm playing around as a hobbyist trying to build something I value, my time is free because I'm learning as this goes, and then in that context, especially when I was starting out as a student, it was, “Oh, great. So, this winds up costing me $7 a month. Oh, that's a lot of money. That's my ramen budget, so I'm instead going to wind up spending eight hours avoiding it charging me anything.” It's the exact opposite from the direction you want staff that you're paying to work on these things to go in. How do you approach the idea of increasing the cloud cost if it will save time for your team?Michael: It's a balance between, where do we need to build this ourselves? And then not only build it, you have to operate it and maintain it? Or what is the cost of getting this third-party service? And that's really what it comes down to in all of them. And do we actually want to spend time working on this piece of infrastructure that these other people are specializing in and do so well? I've got better things I can have people doing than that.Corey: Speaking of people, one thing that you talk about, as you self-describe, is that you wind up not writing a whole lot of code anymore, but you're something of a stickler for observability and enforcing consistency between services, so you'll periodically do things like submit a PR to tweak a log message to put your mind at ease, was one example that you gave. Given that you're a director, which is generally manager of managers style approaches, how do you avoid having those PRs come across to your team as either micromanagement or a condemnation of what they've built? Because I get it; when I see something that's easy and small to tweak, I want to go ahead and get it fixed immediately. I don't want to go back and forth and play those games; I just want it done. But I'm also always weighing that against, I don't want to have people think that I'm judging them somehow for something I'm very much not.Michael: That's a very good point. The larger technical decisions on how things are laid out, I generally just try to—I don't insert myself into. I let the team go ahead, and make those decisions, and leave that direction, and let them take the charge on that, and I take the approach of looking at it as more of a guiding, and mentoring and teaching to really hone and instill that discipline in really being able to understand what the applications are doing. And as our team is growing, I have less and less time to even do those things, but I can go through the systems and go, “Hey, how come we're not tracing this call to the reCAPTCHA servers? Let's add that in there.” And I'll just at this point now, I mainly just write Jira tickets to have someone else actually do the work.Corey: The more I do this, the more I realize that as complicated as the technology is, the people are in many ways, far more complicated. And let's be fair here, non-deterministic things that work super well on one person one month could work entirely differently a following month, or even with the same person, or between teams. It's a constant balancing act, on some level. And giving people a sense of psychological safety has always been the biggest challenge. The thing that surprised me about management, back when I was running ops teams was the more, I guess, responsibility you accrue as you rise from individual contributor into the management—or ‘rise' is sort of a wrong term; it's an orthogonal transition—is that you spend a lot more time on the people problems, and your ability to directly control or affect change diminishes because you have to do everything via influence. You get a lot more responsibility with a lot less direct power [laugh] over the outcome in some ways. Does that align with how you see it, or am I just—do I have very strange approaches on management? Which may be true, and why I got out of it as fast as I could.Michael: No, that is a good point because you are having to [unintelligible 00:27:05], like, influence, and guide, and more take a higher-level view, as opposed to really getting into the weeds of like, “Okay, what methods are we going to put on this interface? How are we going to, say, architect the internals of an application?” Those are details I just really don't have time for anymore. But larger things as to making sure that we're okay, it's like, “What's the performance of this?” And, “Overall, is something that can be adapted as the business needs change, and as we change? And as we learn, what can we do to modify it?” And more just things like guiding, and mentoring, and really taking a higher-level view of that.Corey: I'm going to selfishly ask about something that I struggle with myself. That goes a bit more into the technical area, but you talk about enforcing consistency across all of your different services. What does that mean? Similar coding style? Similar instrumentation?Because I look at the things I built and microservices that power my internal nonsense, and each one of those is very different than all the rest. So, whatever your version of consistency is, I know I'm not doing it. But how do you view it?Michael: So, there's really two types of consistency. The one I really refer to the most is in observability. So that, if you've got a thousand Lambda functions out there, and each one is logging things slightly differently, that's just a pain to deal with, and realistically, dealing with a thousand unicorns is a real pain. So, through that observability, at least in Lambda, we use an internally developed middleware to make sure that the logging is consistent, and it's easy enough to use. And then other consistency, like, just within projects of how we lay things out.That's something that's been consistently evolving. What's the folder structure in how we organize the code? And we've kind of been evolving that over the last three years. And within about the last six months, we've come up with a really good pattern and a template for the future. And it's not much different from what we started out with, but it's a little bit easier, really, to comprehend as a new engineer coming in. It makes more sense.Corey: I have to ask—and I understand if you don't want to give a particular endorsement in any direction—but do you go through Serverless Framework, SAM CLI, the CDK, using the console and then lying about it? What is the template that you wind up using for that uniformity? Because even internally, I use three or four of those different things and professional advice: don't do that.Michael: Let's see. So, in our development, QA, production environments, infrastructure is all managed with Terraform. Each engineer has their own personal AWS account so that they can work on things there—Corey: Oh, that makes billing granularity super easy.Michael: Oh, yes. You can tell who's got EC2 instances running up for too long. But for the most part, we'll use Serverless Framework in that regard to say—for the engineer can just deploy into your local environment. Although we are working on ways to reuse the Terraform infrastructure and deploy that. But we have our own build and deployment pipeline that we built using CircleCI, and all of our Lambda functions are in Go.And so having to compile, say, 20 binaries in a service, that gets kind of slow, one of our DevOps engineers actually came up with a way to use Lambda to build the Lambdas, so that we can build them all in a distributed parallel fashion during the build process.Corey: One thing that I do love about the whole serverless approach—and it is a neat part about Lambda—is no two people ever seem to do it quite the same way. You can tie things together in so many different and exciting ways, and it's fun. It's almost like a modern version of playing with Lego. And I know that if Jeff Barr is listening, he just perked up at that. But I love the concept that you can take so many different ways to achieve similar outcomes. And it almost gives a bigger sense of creativity in how you approach problems. Has that been your experience?Michael: Oh, definitely. It's not only the creativity; it's also the flexibility in how you solve it, and the ability to adapt and evolve as services evolve, or change, or there's new ones are added. And to the point of using AWS, kind of, saying, “Oh, using a Lambda function to do this.” Like, using Lambda functions for customizing behavior of Cognito with the Cognito triggers, is to me, I think, a perfect way to customize the service to do exactly what you need to do.Corey: I want to thank you so much for taking the time to speak with me today. It's always appreciated. If people want to hear more about what you have to say and how you view these things or even, possibly, decide to work with you, okay can they find you?Michael: I'm somewhat active on LinkedIn. LinkedIn is the best place to find me. Please go ahead and connect to me; tell me you heard me on the podcast here.And yes, we are hiring. We have, all within our technical organization, from client, to web, and mobile engineers, data engineers, DevOps, API, we're always hiring and if we don't have something right now that fits your experience, let me know that you're interested and I'll put you on the list so that when we do have an opening, we'll reach out right away.Corey: And we will, of course, include links to that in the [show notes 00:32:20]. Thank you so much for being so generous with your time. I appreciate it.Michael: Thanks for having me on, Corey. It was nice talking to you.Corey: Michael Garski, Director of Platform Engineering at Fender Musical Instruments. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment telling me that I'm almost certainly doing that chord incorrectly.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Open Core, Real-Time Observability Born in the Cloud with Martin Mao

Screaming in the Cloud

Play Episode Listen Later Jun 22, 2021 41:41


About MartinMartin Mao is the co-founder and CEO of Chronosphere. He was previously at Uber, where he led the development and SRE teams that created and operated M3. Prior to that, he was a technical lead on the EC2 team at AWS and has also worked for Microsoft and Google. He and his family are based in our Seattle hub and he enjoys playing soccer and eating meat pies in his spare time.Links: Chronosphere: https://chronosphere.io/ Email: contact@chronosphere.io TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me. I linked against an early version of their tool, canarytokens.org in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to; it gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It's an awesome approach. I've used something similar for years. Check them out. But wait, there's more. They also have an enterprise option that you should be very much aware of canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files on it, you get instant alerts. It's awesome. If you don't do something like this, you're likely to find out that you've gotten breached, the hard way. Take a look at this. It's one of those few things that I look at and say, “Wow, that is an amazing idea. I love it.” That's canarytokens.org and canary.tools. The first one is free. The second one is enterprise-y. Take a look. I'm a big fan of this. More from them in the coming weeks.Corey: If your mean time to WTF for a security alert is more than a minute, it's time to look at Lacework. Lacework will help you get your security act together for everything from compliance service configurations to container app relationships, all without the need for PhDs in AWS to write the rules. If you're building a secure business on AWS with compliance requirements, you don't really have time to choose between antivirus or firewall companies to help you secure your stack. That's why Lacework is built from the ground up for the Cloud: low effort, high visibility and detection. To learn more, visit lacework.com.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I've often talked about observability, or as I tend to think of it when people aren't listening, hipster monitoring. Today, we have a promoted episode from a company called Chronosphere, and I'm joined today by Martin Mao, their CEO and co-founder. Martin, thank you for coming on the show and suffering my slings and arrows.Martin: Thanks for having me on the show, Corey, and looking forward to our conversation today.Corey: So, before we dive into what you're doing now, I'm always a big sucker for origin stories. Historically, you worked at Microsoft and Google, but then you really sort of entered my sphere of things that I find myself having to care about when I'm lying awake at night and the power goes out by working on the EC2 team over at AWS. Tell me a little bit about that. You've hit the big three cloud providers at this point. What was that like?Martin: Yeah, it was an amazing experience, I was a technical lead on one of the EC2 teams, and I think when an opportunity like that comes up on such a core foundational project for the cloud, you take it. So, it was an amazing opportunity to be a part of leading that team at a fairly early stage of AWS and also helping them create a brand new service from scratch, which was AWS Systems Manager, which was targeted at fleet-wide management of EC2 instances, so—Corey: I'm a tremendous fan of Systems Manager, but I'm still looking for the person who named Systems Manager Session Manager because, at this point, I'm about to put a bounty out on them. Wonderful service; terrible name.Martin: That was not me. So, yes. But yeah, no, it was a great experience, for sure, and I think just seeing how AWS operated from the inside was an amazing learning experience for me. And being able to create foundational pieces for the cloud was also an amazing experience. So, only good things to say about my time at AWS.Corey: And then after that, you left and you went to Uber where you led development and SRE teams that created and operated something called M3. Alternately, I'm misreading your bio, and you bought an M3 from BMW and went to drive for Uber. Which is it?Martin: I wish it was the second one, but unfortunately, it is the first one. So yes, I did leave AWS and joined Uber in 2015 to lead a core part of their monitoring and eventually larger observability team. And that team did go on to build open-source projects such as M3—which perhaps we should have thought about the name and the conflict with the car when we named it at the time—and other projects such as Jaeger for distributed tracing as well, and a logging backend system, too. So, yeah, definitely spent many years there building out their observability stack.Corey: We're going to tie a theme together here. You were at Microsoft, you were at Google, you were at AWS, you were at Uber, and you look at all of this and decide, “All right. My entire career has been spent in large companies doing massive globally scaled things. I'm going to go build a small startup.” What made you decide that, all right, this is something I'm going to pursue?Martin: So, definitely never part of the plan. As you mentioned, a lot of big tech companies, and I think I always got a lot of joy building large distributed systems, handling lots of load, and solving problems at a really grand scale. And I think the reason for doing a startup was really the situation that we were in. So, at Uber as I mentioned, myself and my co-founder led the core part of the observability team there, and we were lucky to happen to solve the problem, not just for Uber but for the broader community, especially the community adopting cloud-native architecture. And it just so happened that we were solving the problem of Uber in 2015, but the rest of the industry has similar problems today.So, it was almost the perfect opportunity to solve this now for a broader range of companies out there. And we already had a lot of the core technology built-in open-source as well. So, it was more of an opportunity rather than a long-term plan or anything of that sort, Corey.Corey: So, before we dive into the intricacies of what you've built, I always like to ask people this question because it turns out that the only thing that everyone agrees on is that everyone else is wrong. What is the dividing line, if any, between monitoring and observability?Martin: That's a great question, and I don't know if there's an easy answer.Corey: I mean, my cynical approach is that, “Well, if you call it monitoring, you don't get to bring in SRE-style salaries. Call it observability and no one knows what the hell we're talking about, so sure, it's a blank check at that point.” It's cynical, and probably not entirely correct. So, I'm curious to get your take on it.Martin: Yeah, for sure. So, you know, there's definitely a lot of overlap there, and there's not really two separate things. In my mind at least, monitoring, which has been around for a very long time, has always been around notification and having visibility into your systems. And then as the system's got more complex over time, being able to understand that and not just have visibility into it but understand it a little bit more required, perhaps, additional new data types to go and solve those problems. And that's how, in my mind, monitoring sort of morphed into observability. So, perhaps one is a subset of the other, and they're not competing concepts there. But at least that's my opinion. I'm sure there are plenty out there that would, perhaps, disagree with that.Corey: On some level, it almost hits to the adage of, past a certain point of scale with distributed systems, it's never a question of is the app up or down, it's more a question of how down is it? At least that's how it was explained to me at one point, and it was someone who was incredibly convincing, so I smiled and nodded and never really thought to question it any deeper than that. But I look back at the large-scale environments I've been in, and yeah, things are always on fire, on some level, and ideally, there are ways to handle and mitigate that. Past a certain point, the approach of small-scale systems stops working at large scale. I mean, I see that over in the costing world where people will put tools up on GitHub of, “Hey, I ran this script, and it works super well on my 10 instances.”And then you try and run the thing on 10,000 instances, and the thing melts into the floor, hits rate limits left and right because people don't think in terms of those scales. So, it seems like you're sort of going from the opposite end. Well, this is how we know things work at large scale; let's go ahead and build that out as an initially smaller team. Because I'm going to assume, not knowing much about Chronosphere yet, that it's the sort of thing that will help a company before they get to the hyperscaler stage.Martin: A hundred percent, and you're spot on there, Corey. And it's not even just a company going from small-stage, small-scale simple systems to more complicated ones, actually, if you think about this shift in the cloud right now, it's really going from cloud to cloud-native. So, going from VMs to container on the infrastructure tier, and going from monoliths to microservices. So, it's not even the growth of the company, necessarily, or the growth of the load that the system has to handle, but this shift to containers and microservices heavily accelerates the growth of the amount of data that gets produced, and that is causing a lot of these problems.Corey: So, Uber was famous for disrupting, effectively, the taxi market. What made you folks decide, “I know. We're going to reinvent observability slash monitoring while we're at it, too.” What was it about existing approaches that fell down and, I guess, necessitated you folks to build your own?Martin: Yeah, great question, Corey. And actually, it goes to the first part; we were disrupting the taxi industry, and I think the ability for Uber to iterate extremely fast and respond as a business to changing market conditions was key to that disruption. So, monitoring and observability was a key part of that because you can imagine it was providing all of the real-time visibility to not only what was happening in our infrastructure and applications, but the business as well. So, it really came out of a necessity more than anything else. We found that in order to be more competitive, we had to adopt what is probably today known as cloud-native architecture, adopt running on containers and microservices so that we can move faster, and along with that, we found that all of the existing monitoring tools we were using, weren't really built for this type of environment. And it was that that was the forcing function for us to create our own technologies that were really purpose-built for this modern type of environment that gave us the visibility we needed to, to be competitive as a company and a business.Corey: So, talk to me a little bit more about what observability is. I hear people talking about it in terms of having three pillars; I hear people talking about it, to be frank, in a bunch of ways so that they're trying to, I guess, appropriate the term to cover what they already are doing or selling because changing vocabulary is easier than changing an entire product philosophy. What is it?Martin: Yeah, we actually had a very similar view on observability, and originally we thought that it is a combination of metrics, logs, and traces, and that's a very common view. You have the three pillars, it's almost like three checkboxes; you tick them off, and you have, quote-unquote, “Observability.” And that's actually how we looked at the problem at Uber, and we built solutions for each one of those and we checked all three boxes. What we've come to realize since then is perhaps that was not the best way to look at it because we had all three, but what we realized is that actually just having all three doesn't really help you with the ultimate goal of what you want from this platform, and having more of each of the types of data didn't really help us with that, either. So, taking a step back from there and when we really looked at it, the lesson that we learned in our view on observability is really more from an end-user perspective, rather than a data type or data input perspective.And really, from an end-user perspective, if you think about why you want to use your monitoring tool or your observability tool, you really want to be notified of issues and remediate them as quickly as possible. And to do that, it really just comes down to answering three questions. “Can I get notified when something is wrong? Yes or no? Do I even know something is wrong?”The second question is, “Can I triage it quickly to know what the impact is? Do I know if it's impacting all of my customers or just a subset of them, and how bad is the issue? Can I go back to sleep if I'm being paged at two o'clock in the morning?”And the third one is, “Can I figure out the underlying root cause to the problem and go and actually fix it?” So, this is how we think about the problem now, is from the end-user perspective. And it's not that you don't need metrics, logs, or distributed traces to solve the problem, but we are now orienting our solution around solving the problem for the end-user, as opposed to just orienting our solution around the three data types, per se.Corey: I'm going to self-admit to a fun billing experience I had once with a different monitoring vendor whom I will not name because it turns out, you can tell stories, you can name names, but doing both gets you in trouble. It was a more traditional approach in a simpler time, and they wound up sending me a message saying, “Oh, we're hitting rate limits on CloudWatch. Go ahead and open a ticket asking for them to raise it.” And in a rare display of foresight, AWS respond to my ticket with a, “We can do this, but understand at this level of concurrency, it will cost something like $90,000 a month on increased charges, with that frequency, for that many metrics.” And that was roughly twice what our AWS bill was in those days, and, “Oh.” So, I'm curious as to how you can offer predictable pricing when you can have things that emit so much data so quickly. I believe you when you say you can do it; I'm just trying to understand the philosophy of how that works.Martin: As I said earlier, we started to approach this by trying to solve it in a very engineering fashion where we just wanted to create more efficient backend technology so that it would be cheaper for the increased amount of data. What we realized over time is that no matter how much cheaper we make it, the amount of data being produced, especially from monitoring and observability, kept increasing, and not even in a linear fashion but in an exponential fashion. And because of that, it really switched the problem not to how efficiently can we store this, it really changed our focus of the problem to how our users using this data, and do they even understand the data that's being produced? So, in addition to the couple of properties I mentioned earlier, around cost accounting and rate-limiting—those are definitely required—the other things we try to make available for our end-users is introspection tools such that they understand the type of data that's being produced. It's actually very easy in the monitoring and observability world to write a single line of code that actually produces a lot of data, and most developers don't understand that that single line of code produces so much data.So, our approach to this is to provide a tool so that developers can introspect and understand what is produced on the backend side, not what is being inputted from their code, and then not only have an understanding of that but also dynamic ways to deal with it. So that again, when they hit the rate limit, they don't just have to monitor it less, they understand that, “Oh, I inserted this particular label and now I have 20 times the amount of data that I needed before. Do I really need that particular label in there> and if not, perhaps dropping it dynamically on the server-side is a much better way of dealing with that problem than having to roll back your code and change your metric instrumentation.” So, for us, the way to deal with it is not to just make the backend even more efficient, but really to have end-users understand the data that they're producing, and make decisions on which parts of it is really useful and which parts of it do they, perhaps not want or perhaps want to retain for shorter periods of time, for example, and then allow them to actually implement those changes on that data on the backend. And that is really how the end-users control the bills and the cost themselves.Corey: So, there are a number of different companies in the observability space that have different approaches to what they solve for. In some cases, to be very honest, it seems like, well, I have 15 different observability and monitoring tools. Which ones do you replace? And the answer is, “Oh, we're number 16.” And it's easy to be cynical and down on that entire approach, but then you start digging into it and they're actually right.I didn't expect that to be the case. What was your perspective that made you look around the, let's be honest, fairly crowded landscape of observability companys' tools that gave insight into the health status and well being of various applications in different ways, and say, “You know, no one's quite gotten this right, yet. I have a better idea.”Martin: Yeah, you're completely correct, and perhaps the previous environments that everybody was operating in, there were a lot of different tools for different purposes. A company would purchase an infrastructure monitoring tool, or perhaps even a network monitoring tool, and then they would have, perhaps, an APM solution for the applications, and then perhaps BI tools for the business. So, there was always historically a collection of different tools to go and solve this problem. And I think, again, what has really happened recently with this shift to cloud-native recently is that the need for a lot of this data to be in a single tool has become more important than ever. So, you think about your microservices running on a single container today, if a single container dies in isolation without knowing, perhaps, which microservice was running on it doesn't mean very much, and just having that visibility is not going to be enough, just like if you don't know which business use case that microservice was serving, that's not going to be very useful for you, either.So, with cloud-native architecture, there is more of a need to have all of this data and visibility in a single tool, which hasn't historically happened. And also, none of the existing tools today—so if you think about both the existing APM solutions out there and the existing hosted solutions that exist in the world today, none of them were really built for a cloud-native environment because you can think about even the timing that these companies were created at, you know, back in early 2010s, Kubernetes and containers weren't really a thing. So, a lot of these tools weren't really built for the modern architecture that we see most companies shifting towards. So, the opportunity was really to build something for where we think the industry and everyone's technology stack was going to be as opposed to where the technology stack has been in the past before. And that was really the opportunity there, and it just so happened that we had built a lot of these solutions for a similar type environment for Uber many years before. So, leveraging a lot of our lessons learned there put us in a good spot to build a new solution that we believe is fairly different from everything else that exists today in the market, and it's going to be a good fit for companies moving forward.Corey: So, on your website, one of the things that you, I assume, put up there just to pick a fight—because if there's one thing these people love, it's fighting—is a use case is outgrowing Prometheus. The entire story behind Prometheus is, “Oh, it scales forever. It's what the hyperscalers would use. This came out of the way that Google does things.” And everyone talks about Google as if it's this mythical Valhalla place where everything is amazing and nothing ever goes wrong. I've seen the conference talks. And that's great. What does outgrowing Prometheus look like?Martin: Yeah, that's a great question, Corey. So, if you look at Prometheus—and it is the graduated and the recommended monitoring tool for cloud-native environments—if you look at it and the way it scales, actually, it's a single binary solution, which is great because it's really easy to get started. You deploy a single instance, and you have ingestion, storage, and visibility, and dashboarding, and alerting, all packaged together into one solution, and that's definitely great. And it can scale by itself to a certain point and is definitely the recommended starting point, but as you really start to grow your business, increase your cluster sizes, increase the number of applications you have, actually isn't a great fit for horizontal scale. So, by default, there isn't really a high availability and horizontal scale built into Prometheus by default, and that's why other projects in the CNCF, such as Cortex and Thanos were created to solve some of these problems.So, we looked at the problem in a similar fashion, and when we created M3, the open-source metrics platform that came out of Uber, it was also approaching it from this different perspective where we built it to be horizontally scalable, and highly reliable from the beginning, but yet, we don't really want it to be a, let's say, competing project with Prometheus. So, it is actually something that works in tandem with Prometheus, in the sense that it can ingest Prometheus metrics and you can issue Prometheus query language queries against it, and it will fulfill those. But it is really built for a more scalable environment. And I would say that once a company starts to grow and they run into some of these pain points and these pain points are surrounding how reliable a Prometheus instance is, how you can scale it up beyond just giving it more resources on the VM that it runs on, vertical scale runs out at a certain point. Those are some of the pain points that a lot of companies do run into and need to solve eventually. And there are various solutions out there, both in open-source and in the commercial world, that are designed to solve those pain points. M3 being one of the open-source ones and, of course, Chronosphere being one of the commercial ones.Corey: This episode is sponsored in part by Salesforce. Salesforce invites you to “Salesforce and AWS: Whats Ahead for Architects, Admins and Developers” on June 24th at 10AM, Pacific Time. Its a virtual event where you'll get a first look at the latest innovations of the Salesforce and AWS partnership, and have an opportunity to have your questions answered. Plus you'll get to enjoy an exclusive performance from Grammy Award winning artist The Roots! I think they're talking about a band, not people with super user access to a system. Registration is free at salesforce.com/whatsahead.Corey: Now, you've also gone ahead and more or less dangled raw meat in front of a tiger in some respects here because one of the things that you wind up saying on your site of why people would go with Chronosphere is, “Ah, this doesn't allow for bill spike overages as far as what the Chronosphere bill is.” And that's awesome. I love predictable pricing. It's sort of the antithesis of cloud bills. But there is the counterargument, too, which is with many approaches to monitoring, I don't actually care what my monitoring vendor is going to charge me because they wind up costing me five times more, just in terms of CloudWatch charges. How does your billing work? And how do you avoid causing problems for me on the AWS side, or other cloud provider? I mean, again, GCP and Azure are not immune from this.Martin: So, if you look at the built-in solutions by the cloud providers, a lot of those metrics and monitoring you get from those like CloudWatch or Stackdriver, a lot of it you get included for free with your AWS bill already. It's only if you want additional data and additional retention, do you choose to pay more there. So, I think a lot of companies do use those solutions for the default set of monitoring that they want, especially for the AWS services, but generally, a lot of companies have custom monitoring requirements outside of that in the application tier, or even more detailed monitoring in the infrastructure that is required, especially if you think about Kubernetes.Corey: Oh, yeah. And then I see people using CloudWatch as basically a monitoring, or metric, or log router, which at its price point, don't do that. [laugh]. It doesn't end well for anyone involved.Martin: A hundred percent. So, our solution and our approach is a little bit different. So, it doesn't actually go through CloudWatch or any of these other inbuilt cloud-hosted solutions as a router because, to your point, there's a lot of cost there as well. It actually goes and collects the data from the infrastructure tier or the applications. And what we have found is that not only does the bill for monitoring climb exponentially—and not just as you grow; especially as you shift towards cloud-native architecture—our very first take of solving that problem is to make the backend a lot more efficient than before so it just is cheaper overall.And we approached it that way at Uber, and we had great results there. So, when we created an—originally before M3, 8% of Uber's infrastructure bill was spent on monitoring all the infrastructure and the application. And by the time we were done with M3, the cost was a little over 1%. So, the very first solution was just make it more efficient. And that worked for a while, but what we saw is that over time, this grew again.And there wasn't any more efficiency, we could crank out of the backend storage system. There's only so much optimization you can do to the compression algorithms in the backend and how much you can get there. So, what we realized the problem shifted towards was not, can we store this data more efficiently because we're already reaching limitations there, and what we noticed was more towards getting the users of this data—so individual developers themselves—to start to understand what data is being produced, how they're using it, whether it's even useful, and then taking control from that perspective. And this is not a problem isolated to the SRE team or the observability team anymore; if you think about modern DevOps practices, every developer needs to take control of monitoring their own applications. So, this responsibility is really in the hands of the developers.And the way we approached this from a Chronosphere perspective is really in four steps. The first one is that we have cost accounting so that every developer, and every team, and the central observability team know how much data is being produced. Because it's actually a hard thing to measure, especially in the monitoring world. It's—Corey: Oh, yeah. Even AWS bills get this wrong. Like if you're sending data between one availability zone to another in the same region, it charges a penny to leave an AZ and a penny to enter an AZ in that scenario. And the way that they reflect this on the bill is they double it. So, if you're sending one gigabyte across AZ link in a month, you'll see two gigabytes on the bill and that's how it's reflected. And that is just a glimpse of the monstrosity that is the AWS billing system. But yeah, exposing that to folks so they can understand how much data their application is spitting off? Forget it. That never happens.Martin: Right. Right. And it's not even exposing it to the company as a whole, it's to each use case, to each developer so they know how much data they are producing themselves. They know how much of the bill is being consumed. And then the second step in that is to put up bumper lanes to that so that once you hit the limit, you don't just get a surprise bill at the end of the month.When each developer hits that limit, they rate-limit themselves and they only impact their own data; there is no impact to the other developers or to the other teams, or to the rest of the company. So, we found that those two were necessary initial steps, and then there were additional steps beyond that, to help deal with this problem.Corey: So, in order for this to work within a multi-day lag, in some cases, it's a near certainty that you're looking at what is happening and the expense that is being incurred in real-time, not waiting for it to pass its way through the AWS billing system and then do some tag attribution back.Martin: A hundred percent. It's in real-time for the stream of data. And as I mentioned earlier, for the monitoring data we are collecting, it goes straight from the customer environment to our backend so we're not waiting for it to be routed through the cloud providers because, rightly so, there is a multi-day or multi-hour delay there. So, as the data is coming straight to our backend, we are actively in real-time measuring that and cost accounting it to each individual team. And in real-time, if the usage goes above what is allocated, will actually limit that particular team or that particular developer, and prevent them by default from using more. And with that mechanism, you can imagine that's how the bill is controlled and controlled in real-time.Corey: So, help me understand, on some level; is your architecture then agent-based? Is it a library that gets included in the application code itself? All of the above and more? Something else entirely? Or is this just such a ridiculous question that you can't believe that no one has ever asked it before?Martin: No, it's a great question, Corey, and would love to give some more insight there. So, it is an agent that runs in the customer environment because it does need to be something there that goes and collects all the data we're interested in to send it to the backend. This agent is unlike a lot of APM agents out there where it does, sort of, introspection, things like that. We really believe in the power of the open-source community, and in particular, open-source standards like the Prometheus format for metrics. So, what this agent does is it actually goes and discovers Prometheus endpoints exposed by the infrastructure and applications, and scrapes those endpoints to collect the monitoring data to send to the backend.And that is the only piece of software that runs in our customer environments. And then from that point on, all of the data is in our backend, and that's where we go and process it and get visibility into the end-users as well as store it and make it available for alerting and dashboarding purposes as well.Corey: So, when did you found Chronosphere? I know that you folks recently raised a Series B—congratulations on that, by the way; that generally means, at least if I understand the VC world correctly, that you've established product-market fit and now we're talking about let's scale this thing. My experience in startup land was, “Oh, we've raised a Series B, that means it's probably time to bring in the first DevOps hire.” And that was invariably me, and I wound up screaming and freaking out for three months, and then things were better. So, that was my exposure to Series B.But it seems like, given what you do, you probably had a few SRE folks kicking around, even on the product team because everything you're saying so far absolutely resonates with the experiences someone who has run these large-scale things in production. No big surprise there. Is that where you are? I mean, how long have you been around?Martin: Yeah, so we've been around for a couple of years thus far—so still a relatively new company, for sure. A lot of the core team were the team that both built the underlying technology and also ran it in production the many years at Uber, and that team is now here at Chronosphere. So, you can imagine from the very beginning, we had DevOps and SREs running this hosted platform for us. And it's the folks that actually built the technology and ran it for years running it again, outside of Uber now. And then to your first question, yes, we did establish fairly early on, and I think that is also because we could leverage a lot of the technology that we had built at Uber, and it sort of gave us a boost to have a product ready for the market much faster.And what we're seeing in the industry right now is the adoption of cloud-native is so fast that it's sort of accelerating a need of a new monitoring solution that historical solutions, perhaps, cannot handle a lot of the use cases there. It's a new architecture, it's a new technology stack, and we have the solution purpose-built for that particular stack. So, we are seeing fairly fast acceleration and adoption of our product right now.Corey: One problem that an awful lot of monitoring slash observability companies have gotten into in the last few years—at least it feels this way, and maybe I'm wildly incorrect—is that it seems that the target market is the Ubers of the world, the hyperscalers where once you're at that scale, then you need a tool like this, but if you're just building a standard three-tier web app, oh, you're nowhere near that level of scale. And the problem with go-to-market in those stories inherently seems that by the time you are a hyperscalers, you have already built a somewhat significant observability apparatus, otherwise you would not have survived or stayed up long enough to become a hyperscalers. How do you find that the on-ramp looks? I mean, your website does talk about, “When you outgrow Prometheus.” Is there a certain point of scale that customers should be at before they start looking at things like Chronosphere?Martin: I think if you think about the companies that are born in the cloud today and how quickly they are running and they are iterating their technology stack, monitoring is so critical to that. It's the real-time visibility of these changes that are going out multiple times a day is critical to the success and growth of a lot of new companies. And because of how critical that piece is, we're finding that you don't have to be a giant hyperscalers like Uber to need technology like this. And as you rightly pointed out, you need technology like this as you scale up. And what we're finding is that while a lot of large tech companies can invest a lot of resources into hiring these teams and building out custom software themselves, generally, it's not a great investment on their behalf because those are not companies that are selling monitoring technology as their core business.So generally, what we find is that it is better for companies to perhaps outsource or purchase, or at least use open-source solutions to solve some of these problems rather than custom-build in-house. And we're finding that earlier and earlier on in a company's lifecycle, they're needing technology like this.Corey: Part of the problem I always ran into was—again, I come from the old world of grumpy Unix sysadmins—for me, using Nagios was my approach to monitoring. And that's great when you have a persistent stateful, single node or a couple of single nodes. And then you outgrow it because well, now everything's ephemeral and by the time you realize that there's an outage or an issue with a container, the container hasn't existed for 20 minutes. And you better have good telemetry into what's going on and how your application behaves, especially at scale because at that point, edge cases, one-in-a-million events happen multiple times a second, depending upon scale, and that's a different way of thinking. I've been somewhat fortunate in that, in my experience at least, I've not usually had to go through those transformative leaps.I've worked with Prometheus, I've worked with Nagios, but never in the same shop. That's the joy of being a consultant. You go into one environment, you see what they're doing and you take notes on what works and what doesn't, you move on to the next one. And it's clear that there's a definite defined benefit to approaching observability in a more modern way. But I despair the idea of trying to go from one to the other. And maybe that just speaks to a lack of vision for me.Martin: No, I don't think that's the case at all, Corey. I think we are seeing a lot of companies do this transition. I don't think a lot of companies go and ditch everything that they've done. And things that they put years of investment into, there's definitely a gradual migration process here. And what we're seeing is that a lot of the newer projects, newer environments, newer efforts that have been kicked off are being monitored and observed using modern technology like Prometheus.And then there's also a lot of legacy systems which are still going to be around and legacy processes which are still going to be around for a very long time. It's actually something we had to deal with that at Uber as well; we were actually using Nagios and a StatsD Graphite stack for a very long time before switching over to a more modern tag-like system like Prometheus. So—Corey: Oh, modern Nagios. What was it, uh… that's right, Icinga. That's what it was.Martin: Yes, yes. It was actually the system that we were using Uber. And I think for us, it's not just about ditching all of that investment; it's really about supporting this migration as well. And this is why both in the open-source technology M3, we actually support both the more legacy data types, like StatsD and the Graphite query language, as well as the more modern types like Prometheus and PromQL. And having support for both allows for a migration and a transition.And not even a complete transition; I'm sure there will always be StatsD, Graphite data in a lot of these companies because they're just legacy applications that nobody owns or touches anymore, and they're just going to be lying around for a long time. So, it's actually something that we proactively get ahead of and ensure that we can support both use cases even though we see a lot of companies and trending towards the modern technology solutions, for sure.Corey: The last point I want to raise has always been a personal, I guess, area of focus for me. I allude to it, sometimes; I've done a Twitter thread or two on it, but on your website, you say something that completely resonates with my entire philosophy, and to be blunt is why in many cases, I'm down on an awful lot of vendor tooling across a wide variety of disciplines. On the open-source page on your site, near the bottom, you say, and I quote, “We want our end-users to build transferable skills that are not vendor or product-specific.” And I don't think I've ever seen a vendor come out and say something like that. Where did that come from?Martin: Yeah. If you look at the core of the company, it is built on top of open-source technology. So, it is a very open core company here at Chronosphere, and we really believe in the power of the open-source community and in particular, perhaps not even individual projects, but industry standards and open standards. So, this is why we don't have a proprietary protocol, or proprietary agent, or proprietary query language in our product because we truly believe in allowing our end-users to build these transferable skills and industry-standard skills. And right now that is using Prometheus as the client library for monitoring and PromQL as the query language.And I think it's not just a transferable skill that you can bring with you across multiple companies, it is also the power of that broader community. So, you can imagine now that there is a lot more sharing of, “Hey, I am monitoring, for example, MongoDB. How should I best do that?” Those skills can be shared because the common language that they're all speaking, the queries that everybody is sharing with each other, the dashboards everybody is sharing with each other, are all, sort of, open-source standards now. And we really believe in the power that and we really do everything we can to promote that. And that is why in our product, there isn't any proprietary query language, or definitions of dashboarding, or [learning 00:35:39] or anything like that. So yeah, it is definitely just a core tenant of the company, I would say.Corey: It's really something that I think is admirable, I've known too many people who wind up, I guess, stuck in various environments where the thing that they work on is an internal application to the company, and nothing else like it exists anywhere else, so if they ever want to change jobs, they effectively have a black hole on their resume for a number of years. This speaks directly to the opposite. It seems like it's not built on a lock-in story; it's built around actually solving problems. And I'm a little ashamed to say how refreshing that is [laugh] just based upon what that says about our industry.Martin: Yeah, Corey. And I think what we're seeing is actually the power of these open-source standards, let's say. Prometheus is actually having effects on the broader industry, which I think is great for everybody. So, while a company like Chronosphere is supporting these from day one, you see how pervasive the Prometheus protocol and the query language are that actually all of these probably more traditional vendors providing proprietary protocols and proprietary query languages all actually have to have Prometheus—or not ‘have to have,' but we're seeing that more and more of them are having Prometheus compatibility as well. And I think that just speaks to the power of the industry, and it really benefits all of the end-users and the industry as a whole, as opposed to the vendors, which we are really happy to be supporters of.Corey: Thank you so much for taking the time to speak with me today. If people want to learn more about what you're up to, how you're thinking about these things, where can they find you? And I'm going to go out on a limb and assume you're also hiring.Martin: We're definitely hiring right now. And you can find us on our website at chronosphere.io or feel free to shoot me an email directly. My email is martin@chronosphere.io. Definitely massively hiring right now, and also, if you do have problems trying to monitor your cloud-native environment, please come check out our website and our product.Corey: And we will, of course, include links to that in the [show notes 00:37:41]. Thank you so much for taking the time to speak with me today. I really appreciate it.Martin: Thanks a lot for having me, Corey. I really enjoyed this.Corey: Martin Mao, CEO and co-founder of Chronosphere. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an insulting comment speculating about how long it took to convince Martin not to name the company ‘Observability Manager Chronosphere Manager.'Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Cyber and Technology with Mike
25 May 2021 Cyber and Tech News

Cyber and Technology with Mike

Play Episode Listen Later May 25, 2021 13:42


In today's podcast we cover four crucial cyber and technology topics, including: 1. Nagios open IT monitoring software potentially vulnerable 2. Former FEMA IT specialist admits to theft of UPMC data in 20143. Bose says data potentially stolen in ransomware attack 4. Florida signs bill to ban social media companies from banning political candidates I'd love feedback, feel free to send your comments and feedback to  | cyberandtechwithmike@gmail.com

Hacker Public Radio
HPR3326: HPR Community News for April 2021

Hacker Public Radio

Play Episode Listen Later May 3, 2021


table td.shrink { white-space:nowrap } New hosts Welcome to our new hosts: Anonymous Host, Trey. Last Month's Shows Id Day Date Title Host 3304 Thu 2021-04-01 Newsflash 21/01/04 Anonymous Host 3305 Fri 2021-04-02 Nagios part 2 norrist 3306 Mon 2021-04-05 HPR Community News for March 2021 HPR Volunteers 3307 Tue 2021-04-06 Git worktree klaatu 3308 Wed 2021-04-07 let's talk about Thunderbird Some Guy On The Internet 3309 Thu 2021-04-08 Linux Inlaws S01E27: The Big Uncertainties in Life and beyond monochromec 3310 Fri 2021-04-09 Layer Masks Ahuka 3311 Mon 2021-04-12 Bradley M. Kuhn's article from 2019 on Richard M. Stallman Anonymous Host 3312 Tue 2021-04-13 COVID Doldrums Dave Morriss 3313 Wed 2021-04-14 Zoom Update ToeJet 3314 Thu 2021-04-15 Introduction... A little bit about me Trey 3315 Fri 2021-04-16 tesseract optical character recognition Ken Fallon 3316 Mon 2021-04-19 FSF and RMS on election of Richard Stallman Anonymous Host 3317 Tue 2021-04-20 Reading a manifesto: Towards A Cooperative Technology Movement clacke 3318 Wed 2021-04-21 Modding a Wii classic with a DNS exploit Archer72 3319 Thu 2021-04-22 Linux Inlaws S01E28: Politicians and artificial intelligence part 1 monochromec 3320 Fri 2021-04-23 YouTube Channels for Learning Spanish, Part 2 Ahuka 3321 Mon 2021-04-26 DNS66 URANDOM RANDOM operat0r 3322 Tue 2021-04-27 Tune system performance with tuned klaatu 3323 Wed 2021-04-28 The alternate Internet you never knew existed klaatu 3324 Thu 2021-04-29 Infosec Podcasts Part 1 News and Current Events Trey 3325 Fri 2021-04-30 Games and rules klaatu Comments this month These are comments which have been made during the past month, either to shows released during the month or to past shows. There are 25 comments in total. Past shows There are 4 comments on 4 previous shows: hpr3291 (2021-03-15) "The New Audacity and Batch Processing Macros" by Ahuka. Comment 4: brother mouse on 2021-04-04: "audacity batch" hpr3296 (2021-03-22) "Spam Bot Honey Pot" by Rho`n. Comment 2: rtsn on 2021-04-13: "nice" hpr3301 (2021-03-29) "K S P Kerbal Space Program! (Game)" by operat0r. Comment 1: Aaronb on 2021-04-17: "Have you seen xkcd about Kerbal" hpr3303 (2021-03-31) "Slackware on RaspberryPi" by Brian in Ohio. Comment 1: Zen_floater2 on 2021-04-01: "Bravo" This month's shows There are 21 comments on 9 of this month's shows: hpr3305 (2021-04-02) "Nagios part 2" by norrist. Comment 1: Kevin O'Brien on 2021-04-05: "Adding my endorsement" hpr3308 (2021-04-07) "let's talk about Thunderbird" by Some Guy On The Internet. Comment 1: Henry on 2021-04-09: "Got some good tips ... thank you" hpr3311 (2021-04-12) "Bradley M. Kuhn's article from 2019 on Richard M. Stallman" by Anonymous Host. Comment 1: brian-in-ohio on 2021-04-12: "bravery"Comment 2: Ken Fallon on 2021-04-13: "Long history of supporting anonymous posts"Comment 3: Reto on 2021-04-13: "RMS"Comment 4: Ken Fallon on 2021-04-14: "Interview with RMS/FSF ?"Comment 5: Ken Fallon on 2021-04-14: "Interview with RMS/FSF ? - links"Comment 6: Beeza on 2021-04-14: "Richard Stallman"Comment 7: Torao on 2021-04-16: "Stallman"Comment 8: Cfish on 2021-04-19: "The responsibility of leadership" hpr3317 (2021-04-20) "Reading a manifesto: Towards A Cooperative Technology Movement" by clacke. Comment 1: clacke on 2021-04-02: "Errata"Comment 2: Cfish on 2021-04-20: "Great show"Comment 3: Kevin O'Brien on 2021-04-22: "Fantastic show!!!"Comment 4: clacke on 2021-04-28: "A season for manifestos"Comment 5: Kevin O'Brien on 2021-04-29: "Keep it up" hpr3318 (2021-04-21) "Modding a Wii classic with a DNS exploit" by Archer72. Comment 1: Ken Fallon on 2021-04-21: "sdcard and a usbstick"Comment 2: archer72 on 2021-04-22: "Re: sdcard and a usbstick" hpr3319 (2021-04-22) "Linux Inlaws S01E28: Politicians and artificial intelligence part 1" by monochromec. Comment 1: Ken Fallon on 2021-04-23: "AI is misleading AP would be better" hpr3320 (2021-04-23) "YouTube Channels for Learning Spanish, Part 2" by Ahuka. Comment 1: Aaronb on 2021-04-26: "Listening to Twit podcasts" hpr3321 (2021-04-26) "DNS66 URANDOM RANDOM" by operat0r. Comment 1: Thaj on 2021-04-26: "Mission accomplished" hpr3324 (2021-04-29) "Infosec Podcasts Part 1 News and Current Events " by Trey. Comment 1: Xoke on 2021-04-29: "GRRRRR!" Mailing List discussions Policy decisions surrounding HPR are taken by the community as a whole. This discussion takes place on the Mail List which is open to all HPR listeners and contributors. The discussions are open and available on the HPR server under Mailman. The threaded discussions this month can be found here: http://hackerpublicradio.org/pipermail/hpr_hackerpublicradio.org/2021-April/thread.html Events Calendar With the kind permission of LWN.net we are linking to The LWN.net Community Calendar. Quoting the site: This is the LWN.net community event calendar, where we track events of interest to people using and developing Linux and free software. Clicking on individual events will take you to the appropriate web page. Any other business Booth kit update The FSF Europe have sent us a selection of stickers and leaflets to add to our booth kit(s). Posting Anonymously HPR supports anonymous posting Use Tor browser Sign up on a temporary email address https://en.wikipedia.org/wiki/Disposable_email_address Select your slot and check for the confirmation email Post your show as normal - but add a note to say it's an anonymous user. Older HPR shows on archive.org The project to upload the older HPR shows to archive.org has been quiescent for a few years. The shows between 1 and 870 had not been uploaded prior to April 2021 (though some shows had been uploaded in batches in the early days, without notes etc). However, this old show project has been restarted this month, after new software had been written to help manage the process. At the time of writing 65 shows in the range 1-870 have been uploaded, with notes, and with the same range of audio formats used for current shows. Since we don't want to upload shows without summaries or tags the two projects are now tied together. So we will be all the more welcoming of tag and summary updates sent to the address referenced in the next section! The plan is to report the numbers uploaded each month in the AOB section of the Community News show notes. HPR shows on archive.org have the URL https://archive.org/details/hprXXXX where XXXX is the show number with leading zeroes. So for example, show 840's URL is: https://archive.org/details/hpr0840. Tags and Summaries Thanks to the following contributors for sending in updates in the past month: Dave Morriss, Windigo Over the period tags and/or summaries have been added to 23 shows which were without them. There are now 384 shows which need a summary and/or tags. If you would like to contribute to the tag/summary project visit the summary page at https://hackerpublicradio.org/report_missing_tags.php and follow the instructions there.

Podcast xavi74
Nagios y Grafana ?

Podcast xavi74

Play Episode Listen Later May 3, 2021 10:37


La entrada Nagios y Grafana ? se publicó primero en Blog Xavi74.

Hacker Public Radio
HPR3305: Nagios part 2

Hacker Public Radio

Play Episode Listen Later Apr 2, 2021


I did not get any feed back on my first nagios episode, so I can only assume that I perfectly explained what nagios is. And my installation instructions were so good, that no one had any questions. So I will move on to some additional nagios topics. Why use nagios One thing I meant to talk about but forgot in the intro is why you may want to run nagios as a hobbyist. Education, learning a new technology for fun Network Monitoring is a valuable skill and benefit your career if you work in IT Early warning for failing hardware Monitoring self hosted applications Notification for home security devices IP cameras Most of the benefits of nagios are not specific to nagios. There are plenty of other options for monitoring, and all of them are worth exploring. Notification Options Email I had planned on discussing how to set up postfix to send emails. But, that is such a big topic I will have to skip it. I will instead talk about what I do to send email. And Maybe you can do something similar. Spammers have ruined the ability to directly send email. Most residential ISPs block port 25 outbound to prevent malware from sending email. Some Virtual hosting providers may not block sending mail, but many mail servers will not accept mail from VPS IP ranges. There are a few ways to get around this problem. I use the email delivery service Sendgrid. They do all the work of staying off the list of spammers, and most email providers trust mail send via Sendgrid. I wont go into the instructions for configuring postfix to relay outgoing mail via Sendgrid, but their documentation is easy to follow. There are plenty of services like sendgrid. And most have a free tier. So unless you are blasting out alerts you probably will not have to pay. If you want to send alerts from nagios via email, I recommend finding a email sending service that works for you. Push alerts There are a few options (besides email) for getting alerts on your phone. aNag The easiest way to get alerts is probably the aNag Android app. aNag connects to the nagios UI to get status updates. It can be configured to check in periodically and there generate notifications for failed checks. One downside to aNag is the phone has to be able to connect to the nagios server. So, if nagios is on a private network, you will need a VPN when you are not on the same network. If you decide to put nagios on a public network, be sure to configure apache to only use HTTPS. certbot makes this really easy. Pushover Another option is to us a Push Notification service that can send notifications that are triggered by API calls. I like to use the pushover.net You pay $5 when you download the pushover app from the app store, and then notifications are sent for free. They offer a 30 day trial if you want to evaluate the service. To use pushover, we will add a new contact to nagios. The command for the pushover contact is a script that calls the pushover API via curl. Remember from the previous episode, nagios has a conf.d directory and will load any files in that directory. So we will create a new file /etc/nagios4/conf.d/pushover.cfg and restart nagios. The contents of the pushover file will be in the show notes. To use pushover for specific checks, and the contact to that check. See the example in the show notes. Or if you want to use pushover for everything Modify the definitions for the host and service templates to use pushover as a contact The script that calls the Pushover API is at https://github.com/jedda/OSX-Monitoring-Tools/blob/master/notify_by_pushover.sh Save a copy of the script in the nagios plugins directory. pushover.cfg # 'notify-host-pushover' command definition define command{ command_name notify-host-pushover command_line $USER1$/notify_by_pushover.sh -u $CONTACTADDRESS1$ -a $CONTACTADDRESS2$ -c 'persistent' -w 'siren' -t "Nagios" -m "$NOTIFICATIONTYPE$ Host $HOSTNAME$ $HOSTSTATE$" } # 'notify-service-pushover' command definition define command{ command_name notify-service-pushover command_line $USER1$/notify_by_pushover.sh -u $CONTACTADDRESS1$ -a $CONTACTADDRESS2$ -c 'persistent' -w 'siren' -t "Nagios" -m "$HOSTNAME$ $SERVICEDESC$ : $SERVICESTATE$ Additional info: $SERVICEOUTPUT$" } define contact{ name generic-pushover host_notifications_enabled 1 service_notifications_enabled 1 host_notification_period 24x7 service_notification_period 24x7 service_notification_options w,c,r host_notification_options d,r host_notification_commands notify-host-pushover service_notification_commands notify-service-pushover can_submit_commands 1 retain_status_information 1 retain_nonstatus_information 1 contact_name Pushover address1 {{ pushover_user_key }} address2 {{ pushover_app_key }} } writing custom checks One of the big advantages of nagios is the ability to write custom checks. In the previous episode, I mentioned that the status of the nagios checks are based on exit code. Exit Code status 0 OK/UP 1 WARNING 2 CRITICAL So, to write a custom check, we need a script that will perform a check, and exit with an exit code based on the results of the check. Verify recent log entry I have a server where occasionally the syslog daemon stop running, Instead of trying to figure out why syslog keeps crashing, I wrote a script to check the log file is being updated. The script looks for the expected log file and tests that it has been modified in the last few minutes. The script will: exit 0 if the syslog file is less than 1 minute old exit 1 if the syslog file is less than 10 minutes old exit 2 if the syslog file is more that than 10 minutes old or does not exist Since the server with the crashy syslog is not the same server running nagios, I need a way for nagios to execute the script on the remote server. Nagios has a few ways to run check commands on remote servers. I prefer to use ssh, but there are some disadvantages to using ssh. Specifically the resources required to establish the ssh connection can be heavier than some of the other remote execution methods. The check_by_ssh plugin can be used to execute check commands on another system. Typically ssh-key authentication is set up so the user that is running the nagios daemon can log in to the remote system without a password You can try the command to make sure it is working. cd /usr/lib/nagios/plugins ./check_by_ssh -H RemoteHost -u RemoteUser -C /path/to/remote/script/check_log_age.sh The new command can be added to a file in the nagios conf.d directory define command { command_name check_syslog_age command_line $USER1$/check_by_ssh -u RemoteUser -C /remote/path/check_log_age.sh } After adding the command definition, check_syslog_age can be added as a service check. The Log Check script: #!/usr/bin/bash TODAY=$(date +%Y%m%d) LOGPATH="/syslog" TODAYSLOG="$TODAY.log" if test `find "$LOGPATH/$TODAYSLOG" -mmin -1` then echo OK exit 0 elif test `find "$LOGPATH/$TODAYSLOG" -mmin -10` then echo WARNING exit 1 else echo CRITICAL exit 2 fi Using snmp to monitor load average and disk usage SNMP can get complicated and I have mixed feelings about using it. I am not going to go into the SNMP versions or the different authentication options for SNMP. But I will show a minimal setup that allows some performance data to be checked by nagios The SNMP authentication that I am demonstrating is only appropriate for isolated networks. If you plan to use snmp over a public network, I recommend looking into more secure versions of SNMP or tunnelling the check traffic via ssh or a VPN. If you want to learn more about SNMP, I recommend "SNMP Mastery" by Michael W Lucas. https://www.tiltedwindmillpress.com/product/snmp-mastery/ SNMP setup First we need to configure the client to respond to SNMP request. On Ubuntu, apt install snmpd By default, snmpd listens on localhost. Replace the existing snmpd.conf with this example to set a read only community string and listen on all IP addresses. And don't forget, I do not recommend this for a Public Network. Restart snmpd and open port 161 if there is a firewall enabled. agentAddress udp:161,udp6:[::1]:161 rocommunity NEW_SECURE_PASSWORD disk / SNMP nagios checks The nagios plugin package installs several pre-defined snmp checks in /etc/nagios-plugins/config/snmp.cfg Look through the file to get an idea of the checks that can be performed via SNMP. Below is an example of a client configuration that uses SNMP. If you look at how the command definitions, most of them have an option to accept arguments to modify how the check is done The argument placeholders re represented by $ARG1$ In most cases, the arguments are optional. This particular SNMP check for disk space requires an argument to complete the disk ID being checked. When the service check is defined, the arguments are separated by ! You can also see in the example how you can add additional contacts Change the check attempts - number or retires before sending an alert Frequency of checks, the default is every 5 minutes define host { host_name ServerIP use linux-server } define service { use generic-service host_name ServerIP contacts Pushover max_check_attempts 1 check_interval 1 service_description DISK check_command snmp_disk!NEW_SECURE_PASSWORD!1!1 # first arg is disk number # command in /etc/nagios-plugins/config/snmp.cfg } define service { use generic-service host_name ServerIP contacts Pushover service_description LOAD check_command snmp_load!NEW_SECURE_PASSWORD # command in /etc/nagios-plugins/config/snmp.cfg } define service { use generic-service host_name ServerIP service_description Memory check_command snmp_mem!NEW_SECURE_PASSWORD # command in /etc/nagios-plugins/config/snmp.cfg } define service { use generic-service host_name ServerIP service_description Swap check_command snmp_swap!NEW_SECURE_PASSWORD # command in /etc/nagios-plugins/config/snmp.cfg } Check servers for updates Nagios has plugins that can check if there are system updates required. Number of updates Check will be CRITICAL if any of the updates are security related. Is a reboot required to load the latest kernel. The check plugin is installed on the remote server. The plugin for Debian based systems is nagios-plugins-contrib or nagios-plugins-check-updates for Red Hat based systems. The command definitions are below. Since the plugins take longer to run, you will probably need to modify the nagios plugin timeout. define command { command_name check_yum command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -t 120 -u root -C "/usr/lib64/nagios/plugins/check_updates -t120" } define command { command_name check_apt command_line $USER1$/check_by_ssh -H $HOSTADDRESS$ -t 120 -u nagios-ssh -C "/usr/lib/nagios/plugins/check_apt -t60" } That's probably all the nagios I can handle for now. Leave a comment if there are nagios topics you would like to hear about. Thanks for listening and I will see you next time.

2.5 Admins
2.5 Admins 27: Learning to say no

2.5 Admins

Play Episode Listen Later Feb 25, 2021 29:34


Where we draw the line when supporting friends and family, monitoring SSL certs, ZFS on single drives, Chromium finally cleans up its act, and more in this feedback-filled episode.   Feedback Chromium finally fixes its DNS issue, a Samba AD server called Zentyal, firewalls, Unraid and SnapRAID, xCloud, and Nagios plugins.   Free consulting We […]

Hacker Public Radio
HPR3264: Intro to Nagios

Hacker Public Radio

Play Episode Listen Later Feb 4, 2021


Nagios Basics Introduction I noticed nagios on the requested topics page. I am far from being an expert with nagios and there is a lot I do not know. I have a working knowledge of most of the basic nagios principles. So, hopefully, I can give a useful introduction and review some one the principles of nagios along the way Nagios is a network monitoring tool. You define some things for nagios to check, and nagios will alert you if those checks fail. Nagios has a web UI that is normally used to see the status of the checks. There are some basic administration tasks you can do from the web UI enabling/disabling notifications Scheduling Downtime Forcing immediate checks Nagios is primarily configured with text files. You have to edit the nagios config files for things like adding servers customizing commands Nagios core vs NagiosXI NagiosXI is the commercial version of nagios. NagiosXI requires a paid license and includes support. NagiosXI has some extra features including wizards for adding hosts and easy cloning of hosts. I have used NagiosXI, and personally don't find the extra features very useful. Probably the biggest reason to use NagiosXI is Enterprise that requires commercial support The community version of nagios is normally referred to as nagios core This episode will focus on the nagios core Nagios Documentation I don't like the official nagios core documentation. A lot like man pages, It is a good reference, but can be hard to follow. Maybe is it possible for someone to read the documentation and be able to install and configure nagios for the first time. But it took me a lot of trial and error to get a functional nagios server following the nagios documentation Outside of the official documentation, Most of the nagios installation guides I found online recommend downloading and building nagios from the nagios site. My general policy is to use OS provided packages whenever possible. Normally, sticking to packages eases long the term maintenance. You may not always get the latest feature release, but installation and updates are usually easier. I know not everyone will agree with me here, and will want to build the latest version. Regardless of the install method, most of the nagios principles I go over will still apply I am making the assumption that most listeners will be most familiar with Debian/Ubuntu, so I will go over installing nagios on Ubuntu using the nagios packages from the Ubuntu repository Hosts and Services Before I go over the installation, I'll talk a bit about some of the pieces that make up nagios Nagios checks are for either hosts or services. From the Nagios documentation A host definition is used to define a physical server, workstation, device, etc. that resides on your network. Also from the nagios documentation A service definition is used to identify a "service" that runs on a host. The term "service" is used very loosely. It can mean an actual service that runs on the host (POP, SMTP, HTTP, etc.) or some other type of metric associated with the host Normally, hosts are checked using ping. If the host responds to the ping with in the specified time frame, the host is considered up. Once a host is defined and determined to be UP, you can optionally check services on that host Installation and setup Install the packages apt install nagios4 One of the dependencies is the monitoring-plugins I'll talk more about the monitoring-plugins package when we dig in to the checks The primary UI for nagios is a cgi driven web app usually served via apache. Following the nagios4 installation, the web UI isn't functional. So we need to make a few configuration changes The nagios config file for apache contains a directive that is not enabled by default Enable 2 Apache modules a2enmod authz_groupfile a2enmod auth_digest systemctl restart apache2 Nagios authentication Enable users in the nagios UI In /etc/nagios4/cgi.cfg change the line 'use_authentication=0' to 'use_authentication=1' Modify Apache In /etc/apache2/conf-enabled/nagios4-cgi.conf change Require all granted to Require valid-user And if needed, remove the IP restriction by removing the line that starts with Require ip And finally we need to add a nagios basic auth user. I normally use nagiosadmin, but it can be any username htdigest -c /etc/nagios4/htdigest.users Nagios4 nagiosadmin Restarts Restart apache and nagios and the nagios UI will be fully functional Check commands Nagios uses a collection of small standalone executables to perform the checks. Checks are either OK, Warning, or Critical, depending on the exit code of the check. Exit Code Status 0 OK/UP 1 WARNING 2 CRITICAL The check commands are standalone applications that can be run independent from nagios. Running the checks from the shell is helpful to better understand how the nagios checks work. The location of the check commands can vary depending on how nagios was packaged. In this case, they are in /usr/lib/nagios/plugins Looking at the names on the files can give you an idea of their purpose. For example, it should be obvious what check_http and check_icmp are for. cd /usr/lib/nagios/plugins $ ./check_icmp localhost OK - localhost: rta 0.096ms, lost 0%|rta=0.096ms;200.000;500.000;0; pl=0%;40;80;; rtmax=0.218ms;;;; rtmin=0.064ms;;;; $ ./check_http localhost HTTP OK: HTTP/1.1 200 OK - 10977 bytes in 0.005 second response time |time=0.004558s;;;0.000000;10.000000 size=10977B;;;0 Most checks can be run with -h to print usage help The checks can be in any language as long as is it is executable by the nagios server. Many are compiled C but Perl and shell scripts are also common file check_icmp check_icmp: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=46badf6e4322515a70d5553c8018a20e1e9b8206, for GNU/Linux 3.2.0, stripped Nagios config files The primary nagios config file is /etc/nagios4/nagios.cfg nagios.cfg has a directive that will load additional user generated files cfg_dir=/etc/nagios4/conf.d I like to put all my additions to nagios in this directory and use git for both version control and backup. Nagios commands Nagios doesn't run the check executable directly The checks have to be explicitly defined in as a command Some predefined commands are in /etc/nagios4/objects/commands.cfg Debian package monitoring-plugins-basic contains several command definitions that are loaded by nagios.cfg cfg_dir=/etc/nagios-plugins/config Lets look in the /etc/nagios-plugins/config at ping.cfg for an example of how commands are defined # 'check-host-alive' command definition define command{ command_name check-host-alive command_line /usr/lib/nagios/plugins/check_ping -H '$HOSTADDRESS$' -w 5000,100% -c 5000,100% -p 1 } Commands require command_name and command_line The command line is that path to the executable that will perform the check and optional arguments. Most checks require -H for the host address to check The check-host-alive command also contains arguments to set the critical and warning thresholds with -c and -w The check_ping command is similar the check-host-alive command except it requires 2 arguments to set the critical and warning thresholds. define command{ command_name check_ping command_line /usr/lib/nagios/plugins/check_ping -H '$HOSTADDRESS$' -w '$ARG1$' -c '$ARG2$' } Templates Hosts and services require a lot of reused variables. Object definitions normally use templates to avoid having to repetitively set the same variables on each host. Nagios normally ships with predefined templates for hosts and services that will work for most cases. In Ubuntu, the templates are defined in /etc/nagios4/objects/templates.cfg. Template definitions are the same as other object definitions, except they contain register 0 which designates the object as a template. I'll show how the templates are used when I go over the host and service definitions. Notifications By default, notifications are sent via email to nagios@localhost. The easiest way to get notifications is to configure the nagios server to forward emails to a monitored email address. Since many networks block sending email directly via SMTP, email forwarding may be challenging. In a follow up episode I will cover setting up postfix to relay mail through a mail sending service and maybe some other methods for sending alerts Localhost By default, nagios is set to monitor localhost. Having the nagios server can be useful but you probably want to add some additional servers. Have a look at /etc/nagios4/objects/localhost.cfg if you want to see how the checks for localhost are defined Adding a new host to monitor We will use google.com as an example and create a file named google.cfg and place it in in the cfg_dir /etc/nagios4/conf.d. The files can be named anything that ends in .cfg. My preference is one file per host that contains all the checks for that host. The content of google.cfg is included new the end of the show notes. First, we need to define the host. host_name is the only field required to be set. The remaining requirements are met by using the generic-host template. We can add a service check to google.com using the same file. The easiest to add is a http check host_name, service_description, and check_command have to be set the remaining requirements are met by using the generic-service template. Restarting Nagios Nagios has to be reloaded to pick up the configuration changes. Prior to restarting nagios, you can verify the nagios configuration is valid by running: nagios4 -v /etc/nagios4/nagios.cfg This will print a summary of the configuration. Any warnings or errors will be printed at the end. Warnings are not fatal, but should probably be looked at. Errors will keep nagios from restarting; if there are no errors, it is safe to restart nagios Check the nagios UI at http://SERVER_IP/nagios4 and you should see 2 hosts, localhost and google.com as well as the service checks for the hosts Next Episode Since I have already made the mistake of mentioning a follow up episode, I know I am now committed to making additional episode, Next time I will try to cover some enhancements to nagios, including some notification options monitoring-plugins packages writing custom checks using SNMP to monitor load average and disk usage Leave a comment if there are other aspects of nagios you would like me to try to cover. No promises, but I will do my best. Thanks for listening and I will see you next time. Files Playbook --- - hosts: nagios tasks: - name: install nagios apt: name: - nagios4 update_cache: yes - name: Enable the Apache2 modules command: a2enmod "{{item}}" with_items: - authz_groupfile - auth_digest - name: modify nagios cgi config to require user replace: path: /etc/nagios4/cgi.cfg regexp: 'use_authentication=0' replace: 'use_authentication=1' - name: nagios require valid user replace: path: /etc/apache2/conf-enabled/nagios4-cgi.conf regexp: "Require all granted" replace: "Require valid-user" - name: remove IP restriction lineinfile: regexp: "Require ip" path: /etc/apache2/conf-enabled/nagios4-cgi.conf state: absent - name: move auth requirements out of File restrictions lineinfile: path: /etc/apache2/conf-enabled/nagios4-cgi.conf regexp: '^s*

7 Minute Security
7MS #435: Homecoming and Home ioT Security - Part 2

7 Minute Security

Play Episode Listen Later Oct 2, 2020 41:10


Hi again! It's sort of fun to release two episodes in one week for a change. If you missed part 1 on our ioT security series, check it out here. Today we dive into some free/cheap monitoring solutions you can use to keep tabs on your ioT network (or any network, really): Nagios - it's old school but gets the job done. This article helped me get it going on an RPi. SolarWinds IP monitor - it was quick and easy to get up and running, but the 40 monitors you're allotted get burned up pretty quick if you have a decent number of devices to monitor PRTG - this is the winner in my book. It has a generous amount of monitors, quick/easy install, and a native mobile app!

TechSNAP
422: Multipath Musings

TechSNAP

Play Episode Listen Later Feb 7, 2020 23:37


We take a look at a few exciting features coming to Linux kernel 5.6, including the first steps to multipath TCP. Plus the latest Intel speculative execution vulnerability, and Microsoft's troubled history with certificate renewal.

Devchat.tv Master Feed
DevOps 007: Monitoring in a Technical Environment

Devchat.tv Master Feed

Play Episode Listen Later Aug 27, 2019 53:13


Sponsors CacheFly Panel Nell Shamrell-Harrington Scott Nixon Episode Summary In this episode of the Adventures in DevOps podcast, panelists Nell Shamrell-Harrington and Scott Nixon talk about monitoring in the software world. They start the discussion by talking about the difference between monitoring and alerting. They discuss how logging comes into picture in monitoring, two main types of logs - structured and unstructured, log management in the DevOps environment, information storage, parsing logs and log aggregation. They list two major kinds of monitoring software - pull and push. Nell explains what they mean and how they work, and Scott gives examples of each including syslog, healthcheck, etc. They then talk about what it means by a "working" system, and consequently, when can something be considered to be non-functional. This leads to answering the important question of what exactly should be monitored. They explain to what extent should one go while monitoring and how to determine the significance level of the events in general. They discuss some concepts from Mike Julian's book "Practical Monitoring" including anti-patterns such as tool obsession, what not do do in monitoring and the fact that businesses need to customize their systems based on what works for them. They talk about the tool Nagios, benefits in using default monitoring tools provided by native cloud systems, using monitoring as a crutch, and manual configuration. They then discuss some good practices, namely composable monitoring, performance monitoring from the users' perspective, the mantra - 'buy not build', and continual improvements. They briefly touch on the topic of security in monitoring, and wrap up the episode with picks. Links Practical Monitoring-Mike Julian Accelerate: The State of DevOps with Dr. Nicole Forsgren High Output Management - Andrew S. Grove Picks Nell Shamrell-Harrington: Mauer Museum Tetris 99 Scott Nixon: Automated Reasoning Byron Cook, Amazon | AWS re:Inforce 2019 Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones The Obstacle Is the Way: The Timeless Art of Turning Trials into Triumph

Adventures in DevOps
DevOps 007: Monitoring in a Technical Environment

Adventures in DevOps

Play Episode Listen Later Aug 27, 2019 53:13


Sponsors CacheFly Panel Nell Shamrell-Harrington Scott Nixon Episode Summary In this episode of the Adventures in DevOps podcast, panelists Nell Shamrell-Harrington and Scott Nixon talk about monitoring in the software world. They start the discussion by talking about the difference between monitoring and alerting. They discuss how logging comes into picture in monitoring, two main types of logs - structured and unstructured, log management in the DevOps environment, information storage, parsing logs and log aggregation. They list two major kinds of monitoring software - pull and push. Nell explains what they mean and how they work, and Scott gives examples of each including syslog, healthcheck, etc. They then talk about what it means by a "working" system, and consequently, when can something be considered to be non-functional. This leads to answering the important question of what exactly should be monitored. They explain to what extent should one go while monitoring and how to determine the significance level of the events in general. They discuss some concepts from Mike Julian's book "Practical Monitoring" including anti-patterns such as tool obsession, what not do do in monitoring and the fact that businesses need to customize their systems based on what works for them. They talk about the tool Nagios, benefits in using default monitoring tools provided by native cloud systems, using monitoring as a crutch, and manual configuration. They then discuss some good practices, namely composable monitoring, performance monitoring from the users' perspective, the mantra - 'buy not build', and continual improvements. They briefly touch on the topic of security in monitoring, and wrap up the episode with picks. Links Practical Monitoring-Mike Julian Accelerate: The State of DevOps with Dr. Nicole Forsgren High Output Management - Andrew S. Grove Picks Nell Shamrell-Harrington: Mauer Museum Tetris 99 Scott Nixon: Automated Reasoning Byron Cook, Amazon | AWS re:Inforce 2019 Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones The Obstacle Is the Way: The Timeless Art of Turning Trials into Triumph

TechSNAP
407: Old School Outages

TechSNAP

Play Episode Listen Later Jul 11, 2019 42:31


Jim shares his Nagios tips and Wes chimes in with some modern tools as we chat monitoring in the wake of some high-profile outages. Plus we turn our eye to hardware and get excited about the latest Ryzen line from AMD.

TechSNAP Video
Old School Outages | TechSNAP 407

TechSNAP Video

Play Episode Listen Later Jul 10, 2019


Jim shares his Nagios tips and Wes chimes in with some modern tools as we chat monitoring in the wake of some high-profile outages.

All Jupiter Broadcasting Shows
Old School Outages | TechSNAP 407

All Jupiter Broadcasting Shows

Play Episode Listen Later Jul 10, 2019 42:31


Jim shares his Nagios tips and Wes chimes in with some modern tools as we chat monitoring in the wake of some high-profile outages.

TechSNAP Video
Old School Outages | TechSNAP 407

TechSNAP Video

Play Episode Listen Later Jul 10, 2019


Jim shares his Nagios tips and Wes chimes in with some modern tools as we chat monitoring in the wake of some high-profile outages.

Tempo Real Entrevistas Podcast
Zabbix, Nagios, Cacti ou outra ferramenta paga? - Entrevista com Lúcio Chiessi

Tempo Real Entrevistas Podcast

Play Episode Listen Later Apr 17, 2019 27:10


Nosso convidado Lúcio Chiessi nesta entrevista fala do Zabbix A entrevista trata dos seguintes assuntos: - O que é o Zabbix - Porque escolher o Zabbix - Zabbix monitora mesmo tudo? - Cuidados que precisam ser tomados? - Boas práticas - Principais Vantagens e Diferenciais - Dicas úteis - Dica de Documentação - Recursos considerados Cases de Sucesso (por exemplo, monitoramento utilizando o SNMP)

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Tracking DNS Changes https://isc.sans.edu/forums/diary/Tracking+Unexpected+DNS+Changes/24596/ SystemD/JournalD PoC Exploit https://capsule8.com/blog/exploiting-systemd-journald-part-1/ Windows Defender Boot Issues https://support.microsoft.com/en-us/help/4052623/update-for-windows-defender-antimalware-platform Mac Malware Steals Crytocurrency Exchange Cookies https://unit42.paloaltonetworks.com/mac-malware-steals-cryptocurrency-exchanges-cookies/

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Tracking DNS Changes https://isc.sans.edu/forums/diary/Tracking+Unexpected+DNS+Changes/24596/ SystemD/JournalD PoC Exploit https://capsule8.com/blog/exploiting-systemd-journald-part-1/ Windows Defender Boot Issues https://support.microsoft.com/en-us/help/4052623/update-for-windows-defender-antimalware-platform Mac Malware Steals Crytocurrency Exchange Cookies https://unit42.paloaltonetworks.com/mac-malware-steals-cryptocurrency-exchanges-cookies/

しがないラジオ
sp.49a【ゲスト: threetreeslight】SIerを飛び出したRepro VPoEと考える、楽しいスタートアップの歩き方

しがないラジオ

Play Episode Listen Later Jan 5, 2019 101:36


三木さんをゲストにお迎えして、高専、SIer、スタートアップ起業、COMP、などについて話しました。 【Show Notes】 Repro (リプロ) PC-98 - Wikipedia Startup Weekend Japan Open Network Lab COMP 完全食 少女終末旅行 - Wikipedia BLAME! - Wikipedia Genius | Song Lyrics & Knowledge GhostRec.jp Nagios - Wikipedia Zabbix - Wikipedia Chef - Automate IT Infrastructure | Chef 500 Startups Japan Repro Tech Meetup - connpass Repro Inc.の会社情報 - Wantedly 配信情報はtwitter ID @shiganaiRadio で確認することができます。 フィードバックは(#しがないラジオ)でつぶやいてください! 感想、話して欲しい話題、改善して欲しいことなどつぶやいてもらえると、今後のポッドキャストをより良いものにしていけるので、ぜひたくさんのフィードバックをお待ちしています。 【パーソナリティ】 gami@jumpei_ikegami zuckey@zuckey_17 【ゲスト】 threetreeslight@threetreeslight 【機材】 Blue Micro Yeti USB 2.0マイク 15374

The New Stack Context
The New Stack Context: Sensu Summit 2018

The New Stack Context

Play Episode Listen Later Aug 24, 2018 55:35


Hello, welcome to The New Stack Context, a podcast where we review the week's hottest news in cloud native technologies/at-scale application development and look ahead to topics we expect will gain more attention in coming weeks. This week, we recorded the podcast live at the Sensu Summit, which took place live in Portland. Joining us this week are Google Developer Advocate Kelsey Hightower and Sensu CEO Caleb Hailey, to discuss the challenges of microservices monitoring. Many traditional monitoring tools, such as Nagios, or even Amazon Web Services' CloudWatch, fail to capture the full picture of what is going on within a system, the panel agreed. This week, Hailey wrote a post for TNS how Sensu can automate the process of collecting sensor data from multiple tools, and multiple formats. "Hybrid cloud monitoring becomes a lot easier to reason about when everything is a 'workflow.' It's not only easier, but the possibilities for what you can monitor (and how) are virtually endless," he wrote.

TechSNAP
Episode 380: Terminal Fault

TechSNAP

Play Episode Listen Later Aug 16, 2018 32:31


Microsoft’s making radical changes to Windows 10, and a new type of speculative execution attack on Intel’s processors is targeting cloud providers.

Category5 Technology TV (HD Video)
Monitoring Your Digital Assets with NEMS Linux - Category5 Technology TV - Episode 568

Category5 Technology TV (HD Video)

Play Episode Listen Later Aug 8, 2018 78:37


Nobody does network asset monitoring as well as Nagios, and NEMS Linux brings a fully-configured Nagios server to your Raspberry Pi. Robbie and Bekah check out the latest release of the Nagios Enterprise Monitoring Server for Raspberry Pi 3. Read the complete show notes, comment or rate this episode, view pictures and obtain links from this episode at https://category5.tv/shows/technology/episode/568/ Running time: 1 Hour 18 Minutes 37 Seconds

Iron Sysadmin Podcast
Episode 42 – Monitoring

Iron Sysadmin Podcast

Play Episode Listen Later Jul 3, 2018 68:31


http://s3.amazonaws.com/ironsysadmin/episodes/IronSysadmin-EP42.mp3 Welcome to Episode #42  (Yes, really) News https://techcrunch.com/2018/06/28/disney-imagineering-has-created-autonomous-robot-stunt-doubles/ https://tech.slashdot.org/story/18/06/30/028220/is-googles-promotion-of-https-misguided https://thehackernews.com/2018/06/wordpress-hacking.html https://thehackernews.com/2018/06/wpa3-wifi-security-standard.html https://thehackernews.com/2018/06/gentoo-linux-github.html http://www.foxnews.com/science/2018/07/02/time-warp-discovered-outside-las-vegas-paranormal-investigator-claims.html   Announcements DC610 This Thursday!  Be there! Chat https://youtu.be/DGq7l_ZsQq4 Main topic Monitoring Pinger! WUG (What’s up gold) Nagios (https://www.nagios.org/) Icinga2 (https://www.icinga.com/) PNP4Nagios (https://docs.pnp4nagios.org/) Graphite(https://graphiteapp.org/) Grafana (https://grafana.com/) Intro and Outro music credit: Tri Tachyon, Digital MK 2http://freemusicarchive.org/music/Tri-Tachyon/  

Screaming in the Cloud
Episode 15: Nagios was the Original Call of Duty

Screaming in the Cloud

Play Episode Listen Later Jun 20, 2018 27:38


Let’s chat about the Cloud and everything in between. The people in this world are pretty comfortable with not running physical servers on their own, but trusting someone else to run them. Yet, people suffer from the psychological barrier of thinking they need to build, design, and run their own monitoring system. Fortunately, more companies are turning to Datadog. Today, we’re talking to Ilan Rabinovitch, Datadog’s vice president of product and community. He spends his days diving into container monitoring metrics, collaborating with Datadog’s open source community, and evangelizing observability best practices. Previously, Ilan led infrastructure and reliability engineering teams at various organizations, including Ooyala and Edmunds.com. He’s active in the open source and DevOps communities, where he is a co-organizer of events, such as SCALE and Texas Linux Fest. Some of the highlights of the show include: Datadog is well-known, especially because it is a frequent sponsor More organizations know their core competency is not monitoring or managing servers Monitoring/metrics is a big data problem; Datadog takes monitoring off your plate Alternate ways, other than using Nagios, to monitor instances and regenerate configurations Datadog is first to identify patterns when there is a widespread underlying infrastructure issue Trends of moving from on-premise to Cloud; serverless is on the horizon How trends affect evolution of Datadog; adjusting tools to monitor customers’ environments Datadog’s scope is enormous; the company tries to present relevant information as the scale of what it’s watching continues to grow Datadog’s pricing is straightforward and simple to understand; how much Cloud providers charge to use Datadog is less clear Single Pane of Glass: Too much data to gather in small areas (dashboards)   Why didn’t monitoring catch this? Alerts need to be actionable and relevant How to use Datadog’s workflow for setting alerts and work metrics Datadog’s first Dash user conference will be held in July in New York; addresses how to solve real business problems, how to scale/speed up your organization Links: Ilan Rabinovitch on Twitter Datadog Docker Adoption Survey Results   Rubric for Setting Alerts/Work Metrics Dash Conference re:Invent Nagios

Software Defined Talk
Episode 132: Capturing dumpling juice, the Pentagon selects AWS, & Thor

Software Defined Talk

Play Episode Listen Later Apr 30, 2018 65:17


Eating dumplings, it turns out, is more complicated than just sticking them in your dumpling hole, as Coté found out in Bangkok thanks to a Singaporean friend (https://www.linkedin.com/in/dennis-sze-641a5b85/). We’re live-to-tape from DevOpsDays Jakarta this episode, just Coté and Matt Ray. We discuss the Pentagon’s stubbornness of (seemingly) picking just one cloud provider for their major cloud project and then have an oddly lengthy discussion of the Marvel Cinematic Universe (https://en.wikipedia.org/wiki/Marvel_Cinematic_Universe). This episode brought to you by: Datadog! This episode is sponsored by Datadog, a monitoring platform for cloud-scale infrastructure and applications. Built by engineers, for engineers, Datadog provides visibility into more than 200 technologies, including AWS, Chef, and Docker, with built-in metric dashboards and automated alerts. With end-to-end request tracing, Datadog provides visibility into your applications and their underlying infrastructure—all in one place. Sign up for a free trial (https://www.datadoghq.com/ts/tshirt-landingpage/?utm_source=Advertisement&utm_medium=Advertisement&utm_campaign=SoftwareDefinedTalkRead-Tshirt) at www.datadog.com/sdt (http://www.datadog.com/sdt) Datadog wants you to know they monitor all kinds of data about Amazon EC2 instances (https://www.datadoghq.com/blog/ec2-monitoring/). You can try it out by signing up for a trial at www.datadog.com/sdt (http://www.datadog.com/sdt). Week’s wunderkammer SmugMug buys flickr (https://www.theregister.co.uk/2018/04/23/oath_sells_flickr_to_smugmug/) - Coté can’t download all his photos. DevOpsDays Jakarta, 2018 (https://www.devopsdays.org/events/2018-jakarta/). Relevant to your interests US military says “get of my lawn, you kids” to AWS mono-usage (https://www.bloomberg.com/news/articles/2018-04-16/pentagon-sticks-with-single-cloud-contract-but-doesn-t-say-why) ‘Rival contractors complain that the winner-take-all approach favors Amazon.com Inc., the biggest supplier of cloud services. But Pentagon officials made clear they have little patience for continuing debate over the issue. In response to a question on the “rationale for a single award for this contract,” the answer posted was blunt: “This rationale is not going to be published at this time.”’ More: “It was a decision the department made based on its needs, so adding context there doesn’t benefit us.” But, actually, it’s just one pick for now: “The contract, known as JEDI -- for the the Joint Enterprise Defense Infrastructure Cloud Program -- won’t prevent the Defense Department from working with other cloud vendors in the future.” IBM doth protest too much? Check out Bloomberg’s layman’s definition of cloud: “Cloud services -- in which computing power and storage are hosted in remote data centers run by a third-party company rather than on-site in locally owned machines -- can make it easier for large organizations to move and integrate data across different platforms, quickly expand the data storage it needs based on usage and make system-wide security upgrades to software.” Not too shabby, and it thankfully doesn’t mention THE CYBER. Conferences, et. al. May 15th to 18th, 2018 - Coté talking EA at Continuous Lifecycle London (https://continuouslifecycle.london/sessions/the-death-of-enterprise-architecture-defeating-the-devops-microservices-and-cloud-native-assassins/). May 16-17, 2018 - Matt presenting at Cloud Expo Hong Kong (https://www.cloudexpoasiahk.com/) May 22-25, 2018 - ChefConf 2018 (https://chefconf.chef.io/), in Chicago. June 1st, 2018 - Coté at Voxxed Days, Singapore (https://voxxeddays.com/singapore/). SDT news & hype Check out Software Defined Interviews (http://www.softwaredefinedinterviews.com/), our new podcast. Pretty self-descriptive, plus the #exegesis podcast we’ve been doing, all in one, for free. Keep up with the weekly newsletter (https://us1.campaign-archive.com/home/?u=ce6149b4008d62a08093a4fa6&id=5877922e21). Join us in Slack (http://www.softwaredefinedtalk.com/slack). Buy some t-shirts (https://fsgprints.myshopify.com/collections/software-defined-talk)! DISCOUNT CODE: SDTFSG (20% off) Send your name and address to stickers@softwaredefinedtalk.com (mailto:stickers@softwaredefinedtalk.com) and we will send you a sticker. If you run into Matt he’ll give you one too! Listener Feedback Icinga is the Nagios fork, so says Shaun. Recommendations Coté: Coté doesn’t remember what he recommended. Matt: Aphex Twin’s Cheetah EP (https://www.amazon.com/Cheetah-EP-Aphex-Twin/dp/B01GRA84K4/)

TechSNAP
Episode 364: The Case for Monitoring

TechSNAP

Play Episode Listen Later Apr 18, 2018 37:43


We cover all the bases this week in our TechSNAP introduction to server monitoring. Why you should monitor, what you should monitor, the basics of Nagios, the biggest drawbacks of Nagios, its alternatives, and our lessons learned from the trenches.

Paul's Security Weekly TV
Topic: Security Operations and Endpoint Protection - Enterprise Security Weekly #59

Paul's Security Weekly TV

Play Episode Listen Later Sep 2, 2017 22:19


Matt and Michael discuss JASK, Automox, and more vendors that have stood out to them in the realms of security operations and endpoint protection! Full Show Notes: https://wiki.securityweekly.com/ES_Episode59 Visit http://securityweekly.com/esw for all the latest episodes!

Enterprise Security Weekly (Video)
Topic: Security Operations and Endpoint Protection - Enterprise Security Weekly #59

Enterprise Security Weekly (Video)

Play Episode Listen Later Sep 2, 2017 22:19


Matt and Michael discuss JASK, Automox, and more vendors that have stood out to them in the realms of security operations and endpoint protection! Full Show Notes: https://wiki.securityweekly.com/ES_Episode59 Visit http://securityweekly.com/esw for all the latest episodes!

Paul's Security Weekly TV
Technical Segment: Enterprise Network Monitoring - Enterprise Security Weekly #59

Paul's Security Weekly TV

Play Episode Listen Later Sep 1, 2017 18:39


Paul asked our Twitter followers about their favorite open-source alternatives to Nagios for monitoring system and service availability, and we listened, of course! Hear Paul’s essential enterprise network monitoring tools in this segment. Full Show Notes: https://wiki.securityweekly.com/ES_Episode59 Visit http://securityweekly.com/esw for all the latest episodes!

network monitoring hear paul monitoringsystem jask santarcangelo nagios automox enterprise security weekly matt alderman technicalsegment
Enterprise Security Weekly (Video)
Technical Segment: Enterprise Network Monitoring - Enterprise Security Weekly #59

Enterprise Security Weekly (Video)

Play Episode Listen Later Sep 1, 2017 18:39


Paul asked our Twitter followers about their favorite open-source alternatives to Nagios for monitoring system and service availability, and we listened, of course! Hear Paul’s essential enterprise network monitoring tools in this segment. Full Show Notes: https://wiki.securityweekly.com/ES_Episode59 Visit http://securityweekly.com/esw for all the latest episodes!

network monitoring hear paul monitoringsystem jask santarcangelo nagios automox enterprise security weekly matt alderman technicalsegment
Paul's Security Weekly
Enterprise Security Weekly #59 - Protect the Data

Paul's Security Weekly

Play Episode Listen Later Aug 31, 2017 67:14


Michael and Matt join Paul to discuss security operations, endpoint protection, enterprise networking monitoring, and the latest enterprise security news on this episode of Enterprise Security Weekly!Full Show Notes: https://wiki.securityweekly.com/ES_Episode59Visit https://www.securityweekly.com for all the latest episodes!

Enterprise Security Weekly (Audio)
Enterprise Security Weekly #59 - Protect the Data

Enterprise Security Weekly (Audio)

Play Episode Listen Later Aug 31, 2017 67:14


Michael and Matt join Paul to discuss security operations, endpoint protection, enterprise networking monitoring, and the latest enterprise security news on this episode of Enterprise Security Weekly!Full Show Notes: https://wiki.securityweekly.com/ES_Episode59Visit https://www.securityweekly.com for all the latest episodes!

Enterprise Security Weekly (Video)
Tech Segment: Monitoring Infrastructure with Nagios - Enterprise Security Weekly #54

Enterprise Security Weekly (Video)

Play Episode Listen Later Jul 22, 2017 21:53


Where do the lines blur between monitoring, configuration, and vulnerability management? What is the best way to monitor systems in an enterprise? How to you manage machine to machine trust? Answers to all those questions and more in this segment! Full Show Notes: https://wiki.securityweekly.com/ES_Episode54 Visit http://securityweekly.com/esw for all the latest episodes!

Paul's Security Weekly TV
Tech Segment: Monitoring Infrastructure with Nagios - Enterprise Security Weekly #54

Paul's Security Weekly TV

Play Episode Listen Later Jul 22, 2017 21:53


Where do the lines blur between monitoring, configuration, and vulnerability management? What is the best way to monitor systems in an enterprise? How to you manage machine to machine trust? Answers to all those questions and more in this segment! Full Show Notes: https://wiki.securityweekly.com/ES_Episode54 Visit http://securityweekly.com/esw for all the latest episodes!

Enterprise Security Weekly (Audio)
Enterprise Security Weekly #54 - Complete Gibberish

Enterprise Security Weekly (Audio)

Play Episode Listen Later Jul 21, 2017 78:32


Thomas Fischer of Digital Guardian joins us to discuss GDPR, Paul talks about monitoring infrastructure with Nagios, and we discuss the latest enterprise security news!Full Show Notes: https://wiki.securityweekly.com/ES_Episode54Visit https://www.securityweekly.com for all the latest episodes!

Paul's Security Weekly
Enterprise Security Weekly #54 - Complete Gibberish

Paul's Security Weekly

Play Episode Listen Later Jul 21, 2017 78:32


Thomas Fischer of Digital Guardian joins us to discuss GDPR, Paul talks about monitoring infrastructure with Nagios, and we discuss the latest enterprise security news!Full Show Notes: https://wiki.securityweekly.com/ES_Episode54Visit https://www.securityweekly.com for all the latest episodes!

Software Defined Talk
Episode 97: The novel strategy of making money, and investing to do so - Amazon + Whole Foods

Software Defined Talk

Play Episode Listen Later Jun 29, 2017 63:52


Looks like we’ll be getting cheaper organic food what with Amazon buying Whole Foods. What exactly is the strategy at play here, though? Other than the obvious thing of doing online groceries, how is Amazon advantaged here such that others (like Wal-mart), can’t simply do this themselves. We go over these questions and how they related to M&A in general. Plus recommendations and some podcast meta talk. Mid-roll This episode is sponsored by Casper, who’s looking for some good senior SREs (https://boards.greenhouse.io/casper/jobs/649758?gh_jid=649758). If you’re into building out and managing infrastructure that keeps code running and makes sure you can sleep soundly at night, check out the job listing, apply (https://boards.greenhouse.io/casper/jobs/649758?gh_jid=649758), and be sure to mention that you heard about it on Software Defined Talk. According to Glassdoor reviews (https://www.glassdoor.com/Overview/Working-at-Casper-EI_IE990859.11,17.htm), it’s a damn fine place to work. You can also just email jobs@casper.com and browse all their openings at casper.com/jobs (https://casper.com/jobs/). LOOK, MA! I PUT IN DATES! DevOpsDays Minneapolis, July 25 to 26th: get 20% off registration with the code SDT (https://devopsdays-minneapolis-2017.eventbrite.com?discount=SDT) (Thanks, Bridget!). SpringDays (https://www.springdays.io/ehome/index.php?eventid=228094&) - Atlanta (July 18th to 19th) (https://www.springdays.io/ehome/spring-days/atlanta) Matt will be at: DevSecOps at RSA Conf APJ (http://www.alldaydevops.com/blog/all-you-need-to-know-about-devops-connect-devsecops-at-rsac-singapore) Sydney Chef Meetup August 1st (https://www.meetup.com/Chef-Sydney/events/240660647/) Auckland AWS User Community August 3rd (https://www.meetup.com/AWS_NZ/events/237833579/) Brisbane Azure User Group October 11 (https://www.meetup.com/Brisbane-Azure-User-Group/events/240477415/) Podcast meta-talk Podcasts.app to be able to track what you listen to (https://cote.io/2017/06/28/apple-makes-major-podcast-updates/). Just paying for podcasts. $220m+ estimated TAM (https://cote.io/2017/06/28/podcast-market-estimated-at-over-220m/). We have a Casper ad! Amazon Buys Whole Foods This was not covered in the Mary Meeker slide-fest. Coté’s notebook on the topic (https://cote.io/2017/06/28/amazon-buying-whole-foods-notebook/). Stratechery on WF Acquisition (https://stratechery.com/2017/amazons-new-customer/) Exponent Podcast (http://exponent.fm/) What exactly are the barriers to entry here for other grocery stores. The business: online, and just the grocery store on it’s own...plus the 460+ physical stores for other goods? Barriers to entry, Amazon buyers (Whole Foods looks good now?), culture clash?, HEB love, private label BONUS LINKS! Not covered in episode. Gartner Magic Quadrant for IAAS is Here! Larry D. (http://www.zdnet.com/article/gartner-puts-aws-microsoft-azure-top-of-its-magic-quadrant-for-iaas/) Once again, what a change from way back when: CRN (http://www.crn.com/slide-shows/cloud/300087321/heres-who-made-gartners-2017-magic-quadrant-for-cloud-iaas.htm) The Register (http://www.theregister.co.uk/2017/06/19/gartner_confirms_what_we_all_know_aws_and_microsoft_are_the_cloud_leaders_by_a_fair_way/) Johnny Leadgen can get a copy (https://pages.awscloud.com/mq-download-report.html). On Oracle: “Gartner warns potential customers to be cautious of high-pressure sales tactics.” How Microsoft Is Shifting Focus to Open Source Link (https://thenewstack.io/microsoft-shifting-emphasis-open-source/) “Chef is used to manage thousands of nodes internally across Azure, Office 365 and Bing.” Amazon Eyeing Slack? Link (https://www.bloomberg.com/news/articles/2017-06-15/messaging-startup-slack-said-to-draw-interest-from-amazon-com) “Buying Slack would help Seattle-based Amazon bolster its enterprise services as it seeks to compete with rivals like Microsoft Corp. and Alphabet Inc.’s Google.” Walmart Buys Bonobo I’ve got a Bonobo suit I really like (https://www.geekwire.com/2017/walmart-buy-bonobos-310m-continues-e-commerce-battle-amazon/). They had ModCloth and some others. Their M&A strategy has really shifted of late. Walmart Sez Get Off the AWS Finally a reason for multi-cloud (https://thenewstack.io/wal-mart-kicks-partners-off-amazons-cloud-implications-organizations/) BigCo’s gonna bully that supply-chain. What’s Wrong with Jenkins? Jenkins is the Nagios of CI/CD (https://thenewstack.io/many-problems-jenkins-continuous-delivery/) “No toolchain is perfect, but you can achieve software delivery perfection (or something close to it, at least) when you implement the right culture.” Tools don’t substitute culture. Oracle’s Swinging For the Fences (and missing) Link (https://www.theregister.co.uk/2017/06/13/specsavers_says_no_to_oracle_cloud/) “He was also unwilling for Specsavers to become a guinea pig for Oracle's cloud.” Ubuntu Mobile Post Mortem Not much strategy… (http://www.lieberbiber.de/2017/06/20/my-ubuntu-for-mobile-devices-post-mortem/) Serverless and the Death of DevOps Link (http://redmonk.com/jgovernor/2017/06/02/serverless-and-the-the-death-of-devops-can-you-not/) Spoiler: “DevOps is the ultimate reactive, or event-driven, tech use case. It’s not going anywhere” State of DevOps 2017 Report Johnny Leadgen to the rescue (https://puppet.com/resources/whitepaper/state-of-devops-report)! Commercial Open Source Software Companies Link (https://docs.google.com/spreadsheets/u/1/d/17nKMpi_Dh5slCqzLSFBoWMxNvWiwt2R-t4e_l7LPLhU/htmlview#gid=0) A bit of sourcing on the numbers would be valuable Glad Chef’s not on the list, wouldn’t want to comment on the numbers Cloud Foundry Summit A whole mess of videos! 121 of them. (https://www.youtube.com/watch?list=PLhuMOCWn4P9hTlDEWJZV8JbVsW01avHF1&v=em-W0rVbCLc) Heptio Out of Stealth Mode with K8s Management Tool TheNewStack covere (https://thenewstack.io/heptio-comes-stealth-mode-ksonnet/) Official page (http://ksonnet.heptio.com/) File under “It didn’t already do that. I see.” Not sure this qualifies as “coming out of stealth”, everyone knows they work on open source K8s. I’m not seeing a monetization strategy yet beyond support & training. Not that there’s anything wrong with that, but they raised $8.5 for their A-round BMC Software Exploring Merging with CA STOP THE PRESSES! TERRIBLE MEETS TERRIBLE (https://mobile-reuters-com.cdn.ampproject.org/c/mobile.reuters.com/article/amp/idUSKBN19C036)? So far, no confirmation, but (https://www.streetinsider.com/Analyst+Comments/CA+Technologies+%28CA%29%3A+Anti+Trust+Should+Not+Prevent+CABMC+Merger+-+Bernstein/13052376.html): “While the two companies were once dominant in the systems management industry, the analyst notes that CA and BMC have 7.5% and 8% share respectively as of FY16 which combined would put them on a near even footing with IBM, the largest vendor, at 15%.” “There are also many other vendors in the market including MSFT (7%) and NOW (5%) so anti trust concerns should not be an issue.” High Level Kubernetes Overview Link (https://jvns.ca/blog/2017/06/04/learning-about-kubernetes/) “Basically Kubernetes is a distributed system that runs programs (well, containers) on computers. You tell it what to run, and it schedules it onto your machines.” More on Service Meshes From James Governor, RedMonk (https://redmonk.com/jgovernor/2017/05/31/so-what-even-is-a-service-mesh-hot-take-on-istio-and-linkerd/) Recommendations Brandon: The Scholar and the Drop Out podcast (https://itunes.apple.com/us/podcast/the-scholar-and-the-drop-out/id1143931540?mt=2); Coté’s add-on: Karl Lagerfella’s day (http://www.harpersbazaar.com/fashion/trends/a865/24-hours-with-karl-lagerfeld-0412/), no exercise and long night-shirts. Matt: Commando: Johnny Ramone’s Autobiography (https://www.amazon.com/dp/B007IV89TS/) Coté: Gulf Shores, Alabama; Hillbilly Elegy (http://amzn.to/2tpO6Gm) and “The Dead Pig Collector.” (http://amzn.to/2tpoU2L)

Software Defined Talk
Episode 87: Snap's cloud billions, Google's social, Monitoring Startups considered hard, DHS wants your passwords

Software Defined Talk

Play Episode Listen Later Feb 11, 2017 59:11


Snap is looking to spend billions on AWS and Google Cloud over the next five years. We talk about what exactly that could be for, then check in with Google's social strategy and thermostat strategies; meanwhile, the America Fuck Yeah crew wants to start gathering passwords at the boarder. Also, Brandon lays out the case that an open-core monitoring startup is a hard row to hoe. Also, Baltimore is not in Maine. (But Coté is pretty sure it actually is.) Mid-roll Coté: we're a media sponsor for DevOpsDays Baltimore (https://www.devopsdays.org/events/2017-baltimore/welcome/), March 7th to 8th. No discount code yet, but we're getting one. Coté: Come see me talk at the Austin Cloud Meetup, Feb 22nd (https://www.meetup.com/Austin-Cloud-Native-Meetup/events/237172788/) Matt: Microsoft Ignite Australia: Chef will have a booth & a talk (https://events.chef.io/events/microsoft-ignite-australia/) ChefConf ChefConf 2017 Teaser (https://www.youtube.com/watch?v=DhHpt-Xhj84) Coté: check out Pivotal's DIY platform paper (http://softwaredefinedtalk.com/diyplatform). tl;dr: for $7m/year with a two year on-ramp, you could build you own, or just buy Pivotal Cloud Foundry. Many of our customers have gone down this path and ended up not wanting to support the life of their own platform...which doesn't match the pace of innovation that the Cloud Foundry community can follow. Check out softwaredefinedtalk.com/diyplatform (http://softwaredefinedtalk.com/diyplatform). SnapChat's S-1 The S1 (https://www.sec.gov/Archives/edgar/data/1564408/000119312517029199/d270216ds1.htm) "We had 158 million Daily Active Users on average in the quarter ended December 31, 2016" "We have committed to spend $2 billion with Google Cloud over the next five years." - perhaps 10% of their billing. Also (http://venturebeat.com/2017/02/09/snap-will-spend-1-billion-on-aws-through-2021/): "Snap will spend $1 billion on AWS through 2021." Coté Show interview with former cloud boy, JJ (http://www.cote.show/21). The McLaughlin Group covers Google: What's up with them! Robots opening doors (https://www.youtube.com/watch?v=NeFkrwagYfc&index=5&list=RDYEjQMMhDkjU). Google, Nest, and DropCam (https://arstechnica.com/gadgets/2016/06/nests-time-at-alphabet-a-virtually-unlimited-budget-with-no-results/) - despite rocky start, maybe it's just a slow ramp-up, they have 50% y/y growth. People think GCP is the shit. "Purity vs. pragmatism." Corrections "Barra-mundi" (https://twitter.com/owenhollands/status/826326363367837696) Pronunciation tips (https://www.youtube.com/results?search_query=barramundi+pronunciation) Thing to get angry about this week DHS considering asking foreigners for passwords (https://www.theregister.co.uk/2017/02/08/dhs_wants_enhanced_digital_vetting/) I mean, really? A criminal is just gonna let you see their stuff? They'll just delete it, set up fake accounts, etc. It's not like popping the trunk for a thief and finding lock picks and guns in the boot: with digital crime tools and weapons, you can hide and subterfuge. And then the only people getting harmed are innocent people. What the fuck is wrong with these people, and more importantly the shit-for brains who voted for them? (How can we de-shit those brains for 2018?) Tweet about 3D chess of this meaning the government can't hack into your stuff...or can they?!?! CNCF Buys RethinkDB's Code and Donates to the Linux Foundation Not just marketing, but actually "freeing" code (https://www.cncf.io/blog/2017/02/06/cncf-purchases-rethinkdb-source-code-contributes-linux-foundation-apache-license) Switched from AGPLv3 to ASLv2 "Abby," (https://twitter.com/ab415/) head of the Cloud Foundry Foundation. See a recent discussion (https://soundcloud.com/pivotalconversations/filling-the-developer-skills-gap-with-abby-kearns-and-james-governor) with her and RedMonk's James Governor on developer skills in large organizations. $2.5 million VC for Sensu! Nagios replacement!!! (https://sensuapp.org/blog/2017/01/30/introducing-sensu-inc.html) Brandon has some advice (https://sensuapp.org/features#compare). BONUS LINKS! Not covered in episode Microsoft does Azure Patent Indemnification "The system is supposed to help ease the transition to the cloud by giving companies extra peace of mind. Right now, lawsuits over intellectual property relating to open source technology in the cloud are rare" Link (http://www.cio.com/article/3167724/cloud-computing/microsoft-launches-new-azure-intellectual-property-protections.html) "those companies operating in a multi-cloud configuration won't be entirely covered" Attempting to Categorize the Cloud Native Landscape Project in GitHub (https://github.com/cncf/landscape) Cloud native Landscape diagram (https://raw.githubusercontent.com/cncf/landscape/master/landscape/CloudNativeLandscape_v0.9.3.jpg) Cloud Displacing Intel's Enterprise Sales "Tectonic shifts in the pattern of Intel's business show the devastating speed at which cloud is displacing traditional enterprise server sales" Link (http://diginomica.com/2017/02/01/tectonic-shifts-at-intel-as-cloud-rips-into-enterprise-server-sales/) Slack Enterprise Grid should make user management easier Link (http://www.itpro.co.uk/collaboration-software/28001/slack-enterprise-grid-should-make-user-management-easier) Uber Steers Away from Trump "More than 200,000 customers had deleted their accounts." (Link (https://www.nytimes.com/2017/02/02/technology/uber-ceo-travis-kalanick-trump-advisory-council.html)) "Many employees were not satisfied with his answer. On Wednesday, Uber staff members followed up by circulating a 25-page Google document titled "Letters to Travis" to tell the chief executive how and why his willingness to engage with the administration had affected them." Puppet adds two vice presidents, hiring from Hewlett-Packard and EMC "Puppet replaced nearly its entire executive team in 2016, including its chief executive and chief financial officers. It hired six vice presidents last year." (Link (http://www.oregonlive.com/silicon-forest/index.ssf/2017/02/puppet_adds_two_vice_president.html)) Rackspace lays off 6% "Since being taken private [by Apollo], Rackspace has been working to trim its annual budget by 7%, or $100 million, according to documents filed with the Securities and Exchange Commission." (Link (https://therivardreport.com/rackspace-lays-off-200-locals-in-companywide-cuts/)) More figures from Barb Darrow (http://fortune.com/2017/02/08/rackpace-layoffs/). Brief 451 coverage from Al (https://451research.com/report-short?entityId=91609&type=mis&alertid=299&contactid=0033200001wgKCKAA2&utm_source=sendgrid&utm_medium=email&utm_campaign=market-insight&utm_content=newsletter&utm_term=91609-Rackspace+lays+off+6%25+of+workforce): "After eight years as a public company, Rackspace went private in August 2016 in $4.3bn leveraged buyout with Apollo Global Management (https://451research.com/report-short?entityId=90092)." "Rackspace just announced a layoff of 6% of its 4,600 employees" "The company is expected to exceed $2bn in revenue and top 33% EBITDA margin for 2016." Meanwhile, AWS at ~$10bn for 2016 with something like 20-23% profit margin (OpInc based on 2016Q4 numbers (http://venturebeat.com/2017/02/02/aws-posts-3-53-billion-in-revenue-in-q4-2016-up-47-from-last-year/)), Azure and GCP catching up (http://www.geekwire.com/2017/cloud-report-card-amazon-web-services-12b-juggernaut-microsoft-google-gaining/): MSFT is probably $5-6bn, depending on how you categorize it. GCP probably $3bn at most (they don't break it out)? Other thing to get worked up about: eliminating remote work IBM on that colo shit (https://twitter.com/cote/status/829739491850022912) Brandon is safe! (He lives in Austin.) Coté: I won't deny that working in smelling range is the best. But, the gains never feel like enough to enforce it. Plus, mega-city congestion and resulting classist systems, cf. The Wealth of Humans (http://thenewstack.io/review-automation-wake-call-fill-vacuum-tech-ethics/). It's a problem that should be solved, not embraced. Recommendations Matt: Manly Daily newspaper, so much unbridled snark. Link (http://www.dailytelegraph.com.au/newslocal/manly-daily/naked-ice-suspect-swings-into-karate-kid-mode/news-story/03245799b7b0a33001c004e96c9765fd) RTJ on NPR (http://uproxx.com/realtalk/run-the-jewels-npr-tiny-desk-concert-video/). I'm not sure I can pull this off (http://uglyxmasrashie.com.au/). Coté: Ezra Klein interview with Kara Swisher (https://overcast.fm/+F_9GoG-WU). She's inspiring is several ways, not least of which in modeling a way to be politely strident and opinionated: the opposite of imposture syndrome (https://en.wikipedia.org/wiki/Impostor_syndrome). Also, his talk with the Hillbilly Elegy guy (http://www.vox.com/2017/2/2/14404770/jd-vance-trump-hillbilly-elegy-ezra-klein-show). I mean: most of the whole podcast, just skip the ones that look trivial and repetitive, e.g., we get it: Trump is a lunatic (https://twitter.com/TrumpDraws/status/830115871657910272). (See The Weeds (http://www.vox.com/the-weeds) and "The Nate Silver podcast." (https://fivethirtyeight.com/tag/politics-podcast/)) Brandon: Skiing in Solitude, Utah (https://skisolitude.com/). The Daily podcast (https://overcast.fm/itunes1200361736/the-daily), from NYT.

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Malicious JavaScript Bypasses UAC https://isc.sans.edu/forums/diary/UAC+Bypass+in+JScript+Dropper/21813/ Skype Unauthorized API Access Blocked https://www.trustwave.com/Resources/SpiderLabs-Blog/A-Backdoor-in-Skype-for-Mac-OS-X/?page=1&year=0&month=0 Facebook Anounces Certificate Transparency Monitoring Tool https://www.facebook.com/notes/protect-the-graph/introducing-our-certificate-transparency-monitoring-tool/1811919779048165 Another Tor Browser (and Firefox) Bug Fixed https://blog.torproject.org/blog/tor-browser-608-released Cheap Android Phones Arrive With Malware Preinstalled https://news.drweb.com/show/?i=10345&lng=en Exploit for Nagios https://legalhackers.com/advisories/Nagios-Exploit-Command-Injection-CVE-2016-9565-2008-4796.html

SANS Internet Stormcenter Daily Network/Cyber Security and Information Security Stormcast

Malicious JavaScript Bypasses UAC https://isc.sans.edu/forums/diary/UAC+Bypass+in+JScript+Dropper/21813/ Skype Unauthorized API Access Blocked https://www.trustwave.com/Resources/SpiderLabs-Blog/A-Backdoor-in-Skype-for-Mac-OS-X/?page=1&year=0&month=0 Facebook Anounces Certificate Transparency Monitoring Tool https://www.facebook.com/notes/protect-the-graph/introducing-our-certificate-transparency-monitoring-tool/1811919779048165 Another Tor Browser (and Firefox) Bug Fixed https://blog.torproject.org/blog/tor-browser-608-released Cheap Android Phones Arrive With Malware Preinstalled https://news.drweb.com/show/?i=10345&lng=en Exploit for Nagios https://legalhackers.com/advisories/Nagios-Exploit-Command-Injection-CVE-2016-9565-2008-4796.html

Category5 Technology TV (HD Video)
Nagios Enterprise Monitoring on a Raspberry Pi 3 - Category5 Technology TV - Episode 478

Category5 Technology TV (HD Video)

Play Episode Listen Later Nov 16, 2016 91:33


Nagios lets you keep a close eye on your network assets. Computers, web sites, services, disk space, and more. NEMS brings Nagios to the Raspberry Pi 3 micro computer making it more economical, energy efficient, and super easy to deploy... not to mention much easier to use. Read the complete show notes, comment or rate this episode, view pictures and obtain links from this episode at https://category5.tv/shows/technology/episode/478/ Running time: 1 Hour 31 Minutes 33 Seconds

Front End Happy Hour
Episode 016 - Spilled beer and epic fails

Front End Happy Hour

Play Episode Listen Later Sep 25, 2016 41:46


As an engineer we build things, but we also make mistakes or introduce bugs into the software we write. In this episode, we share horror stories of things that we’ve experienced in our careers and how we’ve learned from them. Hopefully, this episode helps you avoid making similar mistakes. Items mentioned in the episode: Flash, Sass, Python, PHP, Chartbeat, SRE, Reddit Marketplace, Full Stack TO, Wordpress, Nagios, Jenkins Panelists: Ryan Burgess - @burgessdryan Augustus Yuan - @augburto Jem Young - @JemYoung Brian Holt - @holtbt Sarah Federman - @sarah_federman Picks: Ryan Burgess - React plus X: Best Practices for Reusable UI Components - Mars Jullian Ryan Burgess - React Rally 2016 videos Augustus Yuan - I got scammed by a Silicon Valley startup Augustus Yuan - React Game Kit Jem Young - A Very Secret Service Jem Young - APEX: The Story of the Hypercar Brian Holt - Polarheart Brian Holt - Traveling Sarah Federman - GitHub Universe conference Sarah Federman - Oh Shit Git

RunAs Radio
Microsoft Operations Management Suite with Cameron Fuller

RunAs Radio

Play Episode Listen Later Jan 27, 2016 33:45


Have you looked at Microsoft Operations Management Suite (OMS)? Richard chats with Cameron Fuller about OMS and its relationship with System Center Operations Manager - other than caring about operations, they aren't the same thing at all. Cameron discusses the cloud-centric nature of OMS, but not necessarily an Azure-centric view: OMS can help with instrumentation and management of Amazon's cloud as well as other platforms as a whole. While OMS will integrate with Operations Manager, it will also take data from Nagios, which is a great instrumentation system from the open source world. The conversation also digs into different approaches to automation and backup - OMS covers a ton of important functionality!

DevOps Days Podcast
2015 - DevOpsDays Minneapolis - Helping developers monitor their own application

DevOps Days Podcast

Play Episode Listen Later Nov 9, 2015 31:06


I'm a developer. I barely know what Nagios is, let alone how to set it up or configure new alerts. But I do know a lot about the application I'm working on, and I know how to code. By building a framework for easily adding new monitoring rules, the operations team at Swiftype has opened up application-level monitoring for the whole development team. I'll talk about the tools we wrote and explain how they allow developers to easily add new monitoring checks that probe our application (including web services, queues, and database) and alert the team by email, chat, or phone. I'll show how to use the monitoring framework we wrote, but I'll also use this collaboration as a jumping off point to discuss how I think developers and operations can work together to build software faster and keep it reliable, based on our experiences at Swiftype.

Devops Mastery
Devops Mastery - Episode 15 Monitoring DevOps Tools

Devops Mastery

Play Episode Listen Later Jun 11, 2014 22:58


If you think there are a lot of tools to configure your systems you haven't looked at the tools available to monitor your stuff. The set is so large that it is easy to get overwhelmed. So again in this article I am going to give you a list that I use to narrow the field. Then I am going to give you a list of my favorites. Is it agent, agentless, or hybrid? As with most configuration management tools this question cuts both ways. The best in this class of tools with agents have well documented deployment paths that use various configuration management tools. For instance, they will have Chef or Puppet packages that cut down your time to deploy them tremendously. The choice on this question is how much time you have to deploy it and how fast a response do you need from the tool. Agent based tools are faster in most cases. Agentless tools rely on some form of remote execution tool like ssh or remote powershell and an SNMP(Simple Network Monitoring Protocol) agent. Because the server in an agentless system has to do all the work polling they tend to have more complex to scale. They also can require more risks to take because you have to allow more ports through your firewall. Hybrids allow you to deploy the tool in different ways dependent on the security requirements. So for medium and large companies they tend be a better choice. How does reporting work? This is what you need the tool to do so paying attention to it is critical. The tools vary widely with the number and type of standard reports they have. They also vary on how easy it is to do custom reports. In more and more cases monitoring tools are pared with a reporting system to handle this issue. Writing custom reports can be as simple as a gui interface. They can be as difficult as a DSL(Domain Specific Language) or traditional programming language to create the reports. If your business needs reports from your systems be sure to confirm that you can create the reports that your business needs to meet it's needs. For instance, can you easily get a report to tell you how many people failed to sign up for your mailing list? Can you tell where people are stopping or failing to complete an action. Does it do alerting and if so how can you be alerted? Alerting sounds like a no brainer but not all tools do it. Some tools are just built to display a set of stats for people to analyze. Which means they are normally easier to deploy and configure. At the other end of the scale are tools that will try to predict failures and alert you before the problem happens. This sounds great, and is cool, but you need to know a lot about your environment so that you can set the boundaries around good and bad events. That means that it will require a lot of time to tune properly to remove false positives. Also can an alert trigger an action? If it can then you can automate simple things and free a human to sleep or do something more productive. If it can tell you that the disk is filling up can it do the steps your team would do to free up space? This will help a lot with the work life balance. What dashboards are available out of the box? Can you customize them simply? Most of the tools come with a set of what are called canned dashboards. When you are starting off with monitoring tools it's best to choose a tool with as more than less as long as they tell you something. If the tool has a great dashboard for monitoring Java Applications but your company writes it's apps in Ruby then what good is it to you? All of them will let you customize these dashboards. You can roll up the stats so you can show the entire set of a stat in an environment(Development, Test, Production) in one chart. Be careful though as Techs we love our data and chart customization can get out of hand. Over time you will want to add custom dashboards to make it easier to troubleshoot your devices. How resource intensive is it on both the server, network, and clients being monitored? This is another one of those it depends discussion. If you are only monitoring a small number of things(computer, network equipment, etc) then this is less of a concern. You always need to be concerned about this because your first instinct is going to be to monitor everything. We can both from a device and data point perspective. I have seen, and caused, situations where we have monitored ourselves to death. For instance, at one company where we had limited bandwidth to our remote sites, we overwhelmed the network with just monitoring traffic. There was no bandwidth left for little things like file transfers and pulling up the company intranet. The problem with this is often until you do a proof of concept with the tool you may not be able to answer it. As a general rule agent based tools can help a lot because they only need to send changes and not everything. In all cases though you need to be sure that you are going to get something from the data you collect. It also makes it harder to filter when the time comes to create dashboards and reports. The best way to handle most of these issues is to define a set of things you know you need to monitor, things you think you want to monitor, and things you know you don't need to monitor. Then apply that list to the questions above. It's a simple flow of can I get the data, can I report on it, can I make a dashboard for it and finally will we have enough resources for all of it. The problem I have with this set of tools is that they all have a high level of complexity during the implementation phase. Even the simplest of them can talk several person weeks to get setup correctly and start returning on the investment. Once you do have it setup you will be amazed at how much it will help you become more efficient. You should also be open to deploying multiple tools in this class. Monitoring a multi-tiered applications completely may seem easy at first but it is difficult to do accurately. Keep in mind that monitoring anything is a complex process. It is not uncommon for companies to deploy two or three tools to meet all of their monitoring needs at a complexity level that makes sense for their company. You may have one tool for monitoring the basic information like disk space and cpu utilization, another to monitor application health, and a third to monitor user behavior. Ok so what are my recommendations? Here are a few of my favorites. Nagios(http://www.nagios.org/) This is where a lot of the following tools started. It is a great tool but as the number of your systems starts to increase configuration can be difficult to manage effectively. Which is why the next two started trying to do. Zenoss(http://www.zenoss.com/documents/datasheet_core_commercial_compare.pdf) - This a freemium modeled application with a community version that does the basics well. The Commercial version adds more analysis and optimization information. Check the linked PDF for a more detailed explanation. Groundwork (http://www.gwos.com/pricing/enterprise/) This tool takes a lot of the hard edges off of Nagios. They continue to add features and let you monitor your first 50 hosts for free with their enterprise tool. We will be adding a link in the coming months to the site listing as many as we can find and giving our opinion. These are a great starting point but before you make a decision look at other and make sure they don't better serve your monitoring needs. Our sister site has a review of Zenoss and Groundwork that are worth a look even if they are somewhat old at this point.

WebObjects Podcasts
Nagios for WO systems

WebObjects Podcasts

Play Episode Listen Later Apr 14, 2014 25:03


DevOps Дефлопе подкаст
007 — DevOps митап в Parallels

DevOps Дефлопе подкаст

Play Episode Listen Later Mar 26, 2014 51:13


Новости Vagrant 1.5 и Vagrant Cloud принес нам много вкусного! 6 марта зарелизился LXC 1.0.1, по этому поводу вышла серия из 10 статей Chef metal Мысли о NixOS Chef Sugars Закон Паркинсона и DevOps Vagrant Parallels «Не тот» Go стал бесплатным Getting Started with Docker Почему я не перестану использовать Nagios от инженера из Etsy Непрерывная интеграция с Jenkins и Docker Обсуждение Интервью с Олегом из Кёльна DevOps-митап в Parallels Сайт митапа Видео доклада про использование Vagrant в Express 42 Видео доклада про Vagrant для Parallels Desktop Видео с мастер-класса Сайт митапа, материалы появятся позже

Paul's Security Weekly
Paul's Security Weekly - Episode 174 Part 1 - November 5, 2009

Paul's Security Weekly

Play Episode Listen Later Nov 8, 2009 57:35


The Security Weekly Crew interviews Ethan Galstad, the founder of Nagios open source project! Full Show Notes Hosts: Larry "HaxorTheMatrix" Pesce, Paul Asadoorian, John Strand, Mick Douglas, Carlos "Dark0perator" Perez

security hacking network security john strand nagios paul asadoorian mick douglas pauldotcom paul's security weekly
Practical Operations Podcast Episode Feed
Episode 39 - Is Nagios Viable?

Practical Operations Podcast Episode Feed

Play Episode Listen Later Dec 31, 1969 25:55


Where we discuss the Nagios monitoring project, and how relevant it is in this day and age. (Hint: we think the answer is YES) Comments for the episode are welcome - at the bottom of the show notes for the episode there is a Disqus setup, or you can email us at feedback@operations.fm Links for Episode 39: Nagios Official Website Nagios Wikipedia Thruk pnp4nagios Livestatus Ninja Icinga2 Distributed Monitoring