Software at Scale

Follow Software at Scale

Share on

Software at Scale is where we discuss the technical stories behind large software applications. www.softwareatscale.dev

Utsav Shah

Aug 5, 2024 LATEST EPISODE
infrequent NEW EPISODES
57m AVG DURATION
60 EPISODES

Search for episodes from Software at Scale with a specific topic:

Latest episodes from Software at Scale

Software at Scale 60 - Data Platforms with Aravind Suresh

Play Episode Listen Later Aug 5, 2024 34:51

Aravind was a Staff Software Engineer at Uber, and currently works at OpenAI.Apple Podcasts | Spotify | Google PodcastsEdited TranscriptCan you tell us about the scale of data Uber was dealing with when you joined in 2018, and how it evolved?When I joined Uber in mid-2018, we were handling a few petabytes of data. The company was going through a significant scaling journey, both in terms of launching in new cities and the corresponding increase in data volume. By the time I left, our data had grown to over an exabyte. To put it in perspective, the amount of data grew by a factor of about 20 in just a three to four-year period.Currently, Uber ingests roughly a petabyte of data daily. This includes some replication, but it's still an enormous amount. About 60-70% of this is raw data, coming directly from online systems or message buses. The rest is derived data sets and model data sets built on top of the raw data.That's an incredible amount of data. What kinds of insights and decisions does this enable for Uber?This scale of data enables a wide range of complex analytics and data-driven decisions. For instance, we can analyze how many concurrent trips we're handling throughout the year globally. This is crucial for determining how many workers and CPUs we need running at any given time to serve trips worldwide.We can also identify trends like the fastest growing cities or seasonal patterns in traffic. The vast amount of historical data allows us to make more accurate predictions and spot long-term trends that might not be visible in shorter time frames.Another key use is identifying anomalous user patterns. For example, we can detect potentially fraudulent activities like a single user account logging in from multiple locations across the globe. We can also analyze user behavior patterns, such as which cities have higher rates of trip cancellations compared to completed trips.These insights don't just inform day-to-day operations; they can lead to key product decisions. For instance, by plotting heat maps of trip coordinates over a year, we could see overlapping patterns that eventually led to the concept of Uber Pool.How does Uber manage real-time versus batch data processing, and what are the trade-offs?We use both offline (batch) and online (real-time) data processing systems, each optimized for different use cases. For real-time analytics, we use tools like Apache Pinot. These systems are optimized for low latency and quick response times, which is crucial for certain applications.For example, our restaurant manager system uses Pinot to provide near-real-time insights. Data flows from the serving stack to Kafka, then to Pinot, where it can be queried quickly. This allows for rapid decision-making based on very recent data.On the other hand, our offline flow uses the Hadoop stack for batch processing. This is where we store and process the bulk of our historical data. It's optimized for throughput – processing large amounts of data over time.The trade-off is that real-time systems are generally 10 to 100 times more expensive than batch systems. They require careful tuning of indexes and partitioning to work efficiently. However, they enable us to answer queries in milliseconds or seconds, whereas batch jobs might take minutes or hours.The choice between batch and real-time depends on the specific use case. We always ask ourselves: Does this really need to be real-time, or can it be done in batch? The answer to this question goes a long way in deciding which approach to use and in building maintainable systems.What challenges come with maintaining such large-scale data systems, especially as they mature?As data systems mature, we face a range of challenges beyond just handling the growing volume of data. One major challenge is the need for additional tools and systems to manage the complexity.For instance, we needed to build tools for data discovery. When you have thousands of tables and hundreds of users, you need a way for people to find the right data for their needs. We built a tool called Data Book at Uber to solve this problem.Governance and compliance are also huge challenges. When you're dealing with sensitive customer data, you need robust systems to enforce data retention policies and handle data deletion requests. This is particularly challenging in a distributed system where data might be replicated across multiple tables and derived data sets.We built an in-house lineage system to track which workloads derive from what data. This is crucial for tasks like deleting specific data across the entire system. It's not just about deleting from one table – you need to track down and update all derived data sets as well.Data deletion itself is a complex process. Because most files in the batch world are kept immutable for efficiency, deleting data often means rewriting entire files. We have to batch these operations and perform them carefully to maintain system performance.Cost optimization is an ongoing challenge. We're constantly looking for ways to make our systems more efficient, whether that's by optimizing our storage formats, improving our query performance, or finding better ways to manage our compute resources.How do you see the future of data infrastructure evolving, especially with recent AI advancements?The rise of AI and particularly generative AI is opening up new dimensions in data infrastructure. One area we're seeing a lot of activity in is vector databases and semantic search capabilities. Traditional keyword-based search is being supplemented or replaced by embedding-based semantic search, which requires new types of databases and indexing strategies.We're also seeing increased demand for real-time processing. As AI models become more integrated into production systems, there's a need to handle more GPUs in the serving flow, which presents its own set of challenges.Another interesting trend is the convergence of traditional data analytics with AI workloads. We're starting to see use cases where people want to perform complex queries that involve both structured data analytics and AI model inference.Overall, I think we're moving towards more integrated, real-time, and AI-aware data infrastructure. The challenge will be balancing the need for advanced capabilities with concerns around cost, efficiency, and maintainability. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ai data cost uber software scale traditional governance platforms openai kafka gpus pinot suresh cpus hadoop aravind uber pool staff software engineer apache pinot

Software at Scale 59 - Incident Management with Nora Jones

Play Episode Listen Later Jul 5, 2023 44:06

Nora is the CEO and co-founder of Jeli, an incident management platform.Apple Podcasts | Spotify | Google PodcastsNora provides an in-depth look into incident management within the software industry and discusses the incident management platform Jeli.Nora's fascination with risk and its influence on human behavior stems from her early career in hardware and her involvement with a home security company. These experiences revealed the high stakes associated with software failures, uncovering the importance of learning from incidents and fostering a blame-aware culture that prioritizes continuous improvement. In contrast to the traditional blameless approach, which seeks to eliminate blame entirely, a blame-aware culture acknowledges that mistakes happen and focuses on learning from them instead of assigning blame. This approach encourages open discussions about incidents, creating a sense of safety and driving superior long-term outcomes.We also discuss chaos engineering - the practice of deliberately creating turbulent conditions in production to simulate real-world scenarios. This approach allows teams to experiment and acquire the necessary skills to effectively respond to incidents.Nora then introduces Jeli, an incident management platform that places a high priority on the human aspects of incidents. Unlike other platforms that solely concentrate on technology, Jeli aims to bridge the gap between technology and people. By emphasizing coordination, communication, and learning, Jeli helps organizations reduce incident costs and cultivate a healthier incident management culture. We discuss how customer expectations in the software industry have evolved over time, with users becoming increasingly intolerant of low reliability, particularly in critical services (Dan Luu has an incredible blog on the incidence of bugs in day-to-day software). This shift in priorities has compelled organizations to place greater importance on reliability and invest in incident management practices. We conclude by discussing how incident management will further evolve and how leaders can set their organizations up for success. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo software scale incident management nora jones jeli

Software at Scale 58 - Measuring Developer Productivity with Abi Noda

Play Episode Listen Later Jun 13, 2023 49:29

Abi Noda is the CEO and co-founder of DX, a developer productivity platform.Apple Podcasts | Spotify | Google PodcastsMy view on developer experience and productivity measurement aligns extremely closely with DX's view. The productivity of a group of engineers cannot be measured by tools alone - there's too many qualitative factors like cross-functional stakeholder beuracracy or inefficiency, and inherent domain/codebase complexity that cannot be measured by tools. At the same time, there are some metrics, like whether an engineer has committed any code-changes in their first week/month, that serve as useful guardrails for engineering leadership. A combination of tools and metrics may provide the holistic view and insights into the engineering organization's throughput.In this episode, we discuss the DX platform, and Abi's recently published research paper on developer experience. We talk about how organizations can use tools and surveys to iterate and improve upon developer experience, and ultimately, engineering throughput.GPT-4 generated summaryIn this episode, Abi Noda and I explore the landscape of engineering metrics and a quantifiable approach towards developer experience. Our discussion goes from the value of developer surveys and system-based metrics to the tangible ways in which DX is innovating the field.We initiate our conversation with a comparison of developer surveys and system-based metrics. Abi explains that while developer surveys offer a qualitative perspective on tool efficacy and user sentiment, system-based metrics present a quantitative analysis of productivity and code quality.The discussion then moves to the real-world applications of these metrics, with Pfizer and eBay as case studies. Pfizer, for example, uses a model where they employ metrics for a detailed understanding of developer needs, subsequently driving strategic decision-making processes. They have used these metrics to identify bottlenecks in their development cycle, and strategically address these pain points. eBay, on the other hand, uses the insights from developer sentiment surveys to design tools that directly enhance developer satisfaction and productivity.Next, our dialogue around survey development centered on the dilemma between standardization and customization. While standardization offers cost efficiency and benchmarking opportunities, customization acknowledges the unique nature of every organization. Abi proposes a blend of both to cater to different aspects of developer sentiment and productivity metrics.The highlight of the conversation was the introduction of DX's innovative data platform. The platform consolidates data across internal and third-party tools in a ready-to-analyze format, giving users the freedom to build their queries, reports, and metrics. The ability to combine survey and system data allows the unearthing of unique insights, marking a distinctive advantage of DX's approach.In this episode, Abi Noda shares enlightening perspectives on engineering metrics and the role they play in shaping the developer experience. We delve into how DX's unique approach to data aggregation and its potential applications can lead organizations toward more data-driven and effective decision-making processes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo software productivity scale ebay developers pfizer measuring gpt abi dx summaryin noda developer productivity abi noda

Software at Scale 57 - Scalable Frontends with Robert Cooke

Play Episode Listen Later May 16, 2023 55:42

Robert Cooke is the CTO and co-founder of 3Forge, a real-time data visualization platform.Apple Podcasts | Spotify | Google PodcastsIn this episode, we delve into Wall Street's high-frequency trading evolution and the importance of high-volume trading data observability. We examine traditional software observability tools, such as Datadog, and contrast them with 3Forge's financial observability platform, AMI.GPT-4 generated summaryIn this episode of the Software at Scale podcast, Robert Cooke, CTO and Co-founder of 3Forge, a comprehensive internal tools platform, shares his journey and insights. He outlines his career trajectory, which includes prominent positions such as the Infrastructure Lead at Bear Stearns and the Head of Infrastructure at Liquidnet, and his work on high-frequency trading systems that employ software and hardware to perform rapid, automated trading decisions based on market data.Cooke elucidates how 3Forge empowers subject matter experts to automate trading decisions by encoding business logic. He underscores the criticality of robust monitoring systems around these automated trading systems, drawing an analogy with nuclear reactors due to the potential catastrophic repercussions of any malfunction.The dialogue then shifts to the impact of significant events like the COVID-19 pandemic on high-frequency trading systems. Cooke postulates that these systems can falter under such conditions, as they are designed to follow developer-encoded instructions and lack the flexibility to adjust to unforeseen macro events. He refers to past instances like the Facebook IPO and Knight Capital's downfall, where automated trading systems were unable to handle atypical market conditions, highlighting the necessity for human intervention in such scenarios.Cooke then delves into how 3Forge designs software for mission-critical scenarios, making an analogy with military strategy. Utilizing the OODA loop concept - Observe, Orient, Decide, and Act, they can swiftly respond to situations like outages. He argues that traditional observability tools only address the first step, whereas their solution facilitates quick orientation and decision-making, substantially reducing reaction time.He cites a scenario involving a sudden surge in Facebook orders where their tool allows operators to detect the problem in real time, comprehend the context, decide on the response, and promptly act on it. He extends this example to situations like government incidents or emergencies where an expedited response is paramount.Additionally, Cooke emphasizes the significance of low latency UI updates in their tool. He explains that their software uses an online programming approach, reacting to changes in real-time and only updating the altered components. As data size increases and reaction time becomes more critical, this feature becomes increasingly important.Cooke concludes this segment by discussing the evolution of their clients' use cases, from initially needing static data overviews to progressively demanding real-time information and interactive workflows. He gives the example of users being able to comment on a chart and that comment being immediately visible to others, akin to the real-time collaboration features in tools like Google Docs.In the subsequent segment, Cooke shares his perspective on choosing the right technology to drive business decisions. He stresses the importance of understanding the history and trends of technology, having experienced several shifts in the tech industry since his early software writing days in the 1980s. He projects that while computer speeds might plateau, parallel computing will proliferate, leading to CPUs with more cores. He also predicts continued growth in memory, both in terms of RAM and disk space.He further elucidates his preference for web-based applications due to their security and absence of installation requirements. He underscores the necessity of minimizing the data in the web browser and shares how they have built every component from scratch to achieve this. Their components are designed to handle as much data as possible, constantly pulling in data based on user interaction.He also emphasizes the importance of constructing a high-performing component library that integrates seamlessly with different components, providing a consistent user experience. He asserts that developers often face confusion when required to amalgamate different components since these components tend to behave differently. He envisions a future where software development involves no JavaScript or HTML, a concept that he acknowledges may be unsettling to some developers.Using the example of a dropdown menu, Cooke explains how a component initially designed for a small amount of data might eventually need to handle much larger data sets. He emphasizes the need to design components to handle the maximum possible data from the outset to avoid such issues.The conversation then pivots to the concept of over-engineering. Cooke argues that building a robust and universal solution from the start is not over-engineering but an efficient approach. He notes the significant overlap in applications use cases, making it advantageous to create a component that can cater to a wide variety of needs.In response to the host's query about selling software to Wall Street, Cooke advocates targeting the most demanding customers first. He believes that if a product can satisfy such customers, it's easier to sell to others. They argue that it's challenging to start with a simple product and then scale it up for more complex use cases, but it's feasible to start with a complex product and tailor it for simpler use cases.Cooke further describes their process of creating a software product. Their strategy was to focus on core components, striving to make them as efficient and effective as possible. This involved investing years on foundational elements like string libraries and data marshalling. After establishing a robust foundation, they could then layer on additional features and enhancements. This approach allowed them to produce a mature and capable product eventually.They also underscore the inevitability of users pushing software to its limits, regardless of its optimization. Thus, they argue for creating software that is as fast as possible right from the start. They refer to an interview with Steve Jobs, who argued that the best developers can create software that's substantially faster than others. Cooke's team continually seeks ways to refine and improve the efficiency of their platform.Next, the discussion shifts to team composition and the necessary attributes for software engineers. Cooke emphasizes the importance of a strong work ethic and a passion for crafting good software. He explains how his ambition to become the best software developer from a young age has shaped his company's culture, fostering a virtuous cycle of hard work and dedication among his team.The host then emphasizes the importance of engineers working on high-quality products, suggesting that problems and bugs can sap energy and demotivate a team. Cooke concurs, comparing the experience of working on high-quality software to working on an F1 race car, and how the pursuit of refinement and optimization is a dream for engineers.The conversation then turns to the importance of having a team with diverse thought processes and skillsets. Cooke recounts how the introduction of different disciplines and perspectives in 2019 profoundly transformed his company.The dialogue then transitions to the state of software solutions before the introduction of their high-quality software, touching upon the compartmentalized nature of systems in large corporations and the problems that arise from it. Cooke explains how their solution offers a more comprehensive and holistic overview that cuts across different risk categories.Finally, in response to the host's question about open-source systems, Cooke expresses reservations about the use of open-source software in a corporate setting. However, he acknowledges the extensive overlap and redundancy among the many new systems being developed. Although he does not identify any specific groundbreaking technology, he believes the rapid proliferation of similar technologies might lead to considerable technical debt in the future.Host Utsav wraps up the conversation by asking Cooke about his expectations and concerns for the future of technology and the industry. Cooke voices his concern about the continually growing number of different systems and technologies that companies are adopting, which makes integrating and orchestrating all these components a challenge. He advises companies to exercise caution when adopting multiple technologies simultaneously.However, Cooke also expresses enthusiasm about the future of 3Forge, a platform he has devoted a decade of his life to developing. He expresses confidence in the unique approach and discipline employed in building the platform. Cooke is optimistic about the company's growth and marketing efforts and their focus on fostering a developer community. He believes that the platform will thrive as developers share their experiences, and the product gains momentum.Utsav acknowledges the excitement and potential challenges that lie ahead, especially in managing community-driven systems. They conclude the conversation by inviting Cooke to return for another discussion in the future to review the progression and evolution of the topic. Both express their appreciation for the fruitful discussion before ending the podcast. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

covid-19 head wall street software act scale decide infrastructure steve jobs cto ram utilizing f1 ui observe gpt javascript html cooke google docs orient scalable cpus datadog summaryin bear stearns ooda utsav frontends liquidnet robert cooke

Software at Scale 56 - SaaS cost with Roi Rav-Hon

Play Episode Listen Later Apr 17, 2023 28:29

Roi Rav-Hon is the co-founder and CEO of Finout, a SaaS cost management platform.Apple Podcasts | Spotify | Google PodcastsIn this episode, we review the challenge of maintaining reasonable SaaS costs for tech companies. Usage-based pricing models of infrastructure costs lead to a gradual ramp-up of costs and always have sneakily come up as a priority in my career as an infrastructure/platform engineer. So I'm particularly interested in how engineering teams can better understand, track, and “shift left” infrastructure cost tracking and prevent regressions.We specifically go over Kubernetes cost management, and why cost management needs to be attributable to the most specific teams in order to be self-governing in an organization. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo cost software scale saas usage kubernetes

Software at Scale 55 - Troubleshooting and Operating K8s with Ben Ofiri

Play Episode Listen Later Mar 15, 2023 44:11

Ben Ofiri is the CEO and Co-Founder of Komodor, a Kubernetes troubleshooting platform. Apple Podcasts | Spotify | Google PodcastsWe had an episode with the other founder of Komodor, Itiel, in 2021, and I thought it would be fun to revisit the topic.Highlights (ChatGPT Generated)[0:00] Introduction to the Software At Scale podcast and the guest speaker, Ben Ofiri, CEO and co-founder of Komodor.- Discussion of why Ben decided to work on a Kubernetes platform and the potential impact of Kubernetes becoming the standard for managing microservices.- Reasons why companies are interested in adopting Kubernetes, including the ability to scale quickly and cost-effectively, and the enterprise-ready features it offers.- The different ways companies migrate to Kubernetes, either starting from a small team and gradually increasing usage, or a strategic decision from the top down.- The flexibility of Kubernetes is its strength, but it also comes with complexity that can lead to increased time spent on alerts and managing incidents.- The learning curve for developers to be able to efficiently troubleshoot and operate Kubernetes can be steep and is a concern for many organizations.[8:17] Tools for Managing Kubernetes.- The challenges that arise when trying to operate and manage Kubernetes.- DevOps and SRE teams become the bottleneck due to their expertise in managing Kubernetes, leading to frustration for other teams.- A report by the cloud native observability organization found that one out of five developers felt frustrated enough to want to quit their job due to friction between different teams.- Ben's idea for Komodor was to take the knowledge and expertise of the DevOps and SRE teams and democratize it to the entire organization.- The platform simplifies the operation, management, and troubleshooting aspects of Kubernetes for every engineer in the company, from junior developers to the head of engineering.- One of the most frustrating issues for customers is identifying which teams should care about which issues in Kubernetes, which Komodor helps solve with automated checks and reports that indicate whether the problem is an infrastructure or application issue, among other things.- Komodor provides suggestions for actions to take but leaves the decision-making and responsibility for taking the action to the users.- The platform allows users to track how many times they take an action and how useful it is, allowing for optimization over time.[8:17] Tools for Managing Kubernetes.[12:03] The Challenge of Balancing Standardization and Flexibility.- Kubernetes provides a lot of flexibility, but this can lead to fragmented infrastructure and inconsistent usage patterns.- Komodor aims to strike a balance between standardization and flexibility, allowing for best practices and guidelines to be established while still allowing for customization and unique needs.[16:14] Using Data to Improve Kubernetes Management.- The platform tracks user actions and the effectiveness of those actions to make suggestions and fine-tune recommendations over time.- The goal is to build a machine that knows what actions to take for almost all scenarios in Kubernetes, providing maximum benefit to customers.[20:40] Why Kubernetes Doesn't Include All Management Functionality.- Kubernetes is an open-source project with many different directions it can go in terms of adding functionality.- Reliability, observability, and operational functionality are typically provided by vendors or cloud providers and not organically from the Kubernetes community.- Different players in the ecosystem contribute different pieces to create a comprehensive experience for the end user.[25:05] Keeping Up with Kubernetes Development and Adoption.- How Komodor keeps up with Kubernetes development and adoption.- The team is data-driven and closely tracks user feedback and needs, as well as new developments and changes in the ecosystem.- The use and adoption of custom resources is a constantly evolving and rapidly changing area, requiring quick research and translation into product specs.- The company hires deeply technical people, including those with backgrounds in DevOps and SRE, to ensure a deep understanding of the complex problem they are trying to solve.[32:12] The Effects of the Economy on Komodor.- The effects of the economy pivot on Komodor.- Companiesmust be more cost-efficient, leading to increased interest in Kubernetes and tools like Komodor.- The pandemic has also highlighted the need for remote work and cloud-based infrastructure, further fueling demand.- Komodor has seen growth as a result of these factors and believes it is well-positioned for continued success.[36:17] The Future of Kubernetes and Komodor.- Kubernetes will continue to evolve and be adopted more widely by organizations of all sizes and industries.- The team is excited about the potential of rule engines and other tools to improve management and automation within Kubernetes. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo future co founders tools economy chatgpt software effects scale adoption flexibility operating keeping up devops reliability using data troubleshooting kubernetes sre itiel

Software at Scale 54 - Community Trust with Vikas Agarwal

Play Episode Listen Later Feb 1, 2023 40:48

Vikas Agarwal is an engineering leader with over twenty years of experience leading engineering teams. We focused this episode on his experience as the Head of Community Trust at Amazon and dealing with the various challenges of fake reviews on Amazon products.Apple Podcasts | Spotify | Google PodcastsHighlights (GPT-3 generated)[0:00:17] Vikas Agarwal's origin story.[0:00:52] How Vikas learned to code.[0:03:24] Vikas's first job out of college.[0:04:30] Vikas' experience with the review business and community trust.[0:06:10] Mission of the community trust team.[0:07:14] How to start off with a problem.[0:09:30] Different flavors of review abuse.[0:10:15] The program for gift cards and fake reviews.[0:12:10] Google search and FinTech.[0:14:00] Fraud and ML models.[0:15:51] Other things to consider when it comes to trust.[0:17:42] Ryan Reynolds' funny review on his product.[0:18:10] Reddit-like problems.[0:21:03] Activism filters.[0:23:03] Elon Musk's changing policy.[0:23:59] False positives and appeals process.[0:28:29] Stress levels and question mark emails from Jeff Bezos.[0:30:32] Jeff Bezos' mathematical skills.[0:31:45] Amazon's closed loop auditing process.[0:32:24] Amazon's success and leadership principles.[0:33:35] Operationalizing appeals at scale.[0:35:45] Data science, metrics, and hackathons.[0:37:14] Developer experience and iterating changes.[0:37:52] Advice for tackling a problem of this scale.[0:39:19] Striving for trust and external validation.[0:40:01] Amazon's efforts to combat abuse.[0:40:32] Conclusion. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Software at Scale 53 - Testing Culture with Mike Bland

Play Episode Listen Later Dec 28, 2022 66:52

Mike Bland is a software instigator - he helped drive adoption of automated testing at Google, and the Quality Culture Initiative at Apple.Apple Podcasts | Spotify | Google PodcastsMike's blog was instrumental towards my decision to pick a job in developer productivity/platform engineering. We talk about the Rainbow of Death - the idea of driving cultural change in large engineering organizations - one of the key challenges of platform engineering teams. And we deep dive into the value and common pushbacks against automated testing. Highlights (GPT-3 generated)[0:00 - 0:29] Welcome[0:29 - 0:38] Explanation of Rainbow of Death [0:38 - 0:52] Story of Testing Grouplet at Google[0:52 - 5:52] Benefits of Writing Blogs and Engineering Culture Change [5:52 - 6:48] Impact of Mike's Blog[6:48 - 7:45] Automated Testing at Scale [7:45 - 8:10] "I'm a Snowflake" Mentality [8:10 - 8:59] Instigator Theory and Crossing the Chasm Model [8:59 - 9:55] Discussion of Dependency Injection and Functional Decomposition[9:55 - 16:19] Discussion of Testing and Testable Code [16:19 - 24:30] Impact of Organizational and Cultural Change on Writing Tests [24:30 - 26:04] Instigator Theory [26:04 - 32:47] Strategies for Leaders to Foster and Support Testing [32:47 - 38:50] Role of Leadership in Promoting Testing [38:50 - 43:29] Philosophical Implications of Testing Practices This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Software at Scale 52 - Building Build Systems with Benjy Weinberger

Play Episode Listen Later Nov 17, 2022 62:57

Benjy Weinberger is the co-founder of Toolchain, a build tool platform. He is one of the creators of the original Pants, an in-house Twitter build system focused on Scala, and was the VP of Infrastructure at Foursquare. Toolchain now focuses on Pants 2, a revamped build system.Apple Podcasts | Spotify | Google PodcastsIn this episode, we go back to the basics, and discuss the technical details of scalable build systems, like Pants, Bazel and Buck. A common challenge with these build systems is that it is extremely hard to migrate to them, and have them interoperate with open source tools that are built differently. Benjy's team redesigned Pants with an initial hyper-focus on Python to fix these shortcomings, in an attempt to create a third generation of build tools - one that easily interoperates with differently built packages, but still fast and scalable.Machine-generated Transcript[0:00] Hey, welcome to another episode of the Software at Scale podcast. Joining me here today is Benji Weinberger, previously a software engineer at Google and Twitter, VP of Infrastructure at Foursquare, and now the founder and CEO of Toolchain.Thank you for joining us.Thanks for having me. It's great to be here. Yes. Right from the beginning, I saw that you worked at Google in 2002, which is forever ago, like 20 years ago at this point.What was that experience like? What kind of change did you see as you worked there for a few years?[0:37] As you can imagine, it was absolutely fascinating. And I should mention that while I was at Google from 2002, but that was not my first job.I have been a software engineer for over 25 years. And so there were five years before that where I worked at a couple of companies.One was, and I was living in Israel at the time. So my first job out of college was at Check Point, which was a big successful network security company. And then I worked for a small startup.And then I moved to California and started working at Google. And so I had the experience that I think many people had in those days, and many people still do, of the work you're doing is fascinating, but the tools you're given to do it with as a software engineer are not great.This, I'd had five years of experience of sort of struggling with builds being slow, builds being flaky with everything requiring a lot of effort. There was almost a hazing,ritual quality to it. Like, this is what makes you a great software engineer is struggling through the mud and through the quicksand with this like awful substandard tooling. And,We are not users, we are not people for whom products are meant, right?We make products for other people. Then I got to Google.[2:03] And Google, when I joined, it was actually struggling with a very massive, very slow make file that took forever to parse, let alone run.But the difference was that I had not seen anywhere else was that Google paid a lot of attention to this problem and Google devoted a lot of resources to solving it.And Google was the first place I'd worked and I still I think in many ways the gold standard of developers are first class participants in the business and deserve the best products and the best tools and we will if there's nothing out there for them to use, we will build it in house and we will put a lot of energy into that.And so it was for me, specifically as an engineer.[2:53] A big part of watching that growth from in the sort of early to late 2000s was. The growth of engineering process and best practices and the tools to enforce it and the thing i personally am passionate about is building ci but i'm also talking about.Code review tools and all the tooling around source code management and revision control and just everything to do with engineering process.It really was an object lesson and so very, very fascinating and really inspired a big chunk of the rest of my career.I've heard all sorts of things like Python scripts that had to generate make files and finally they move the Python to your first version of Blaze. So it's like, it's a fascinating history.[3:48] Maybe can you tell us one example of something that was like paradigm changing that you saw, like something that created like a magnitude, like order of magnitude difference,in your experience there and maybe your first aha moment on this is how good like developer tools can be?[4:09] Sure. I think I had been used to using make basically up till that point. And Google again was, as you mentioned, using make and really squeezing everything it was possible to squeeze out of that lemon and then some.[4:25] But when the very early versions of what became blaze which was that big internal build system which inspired basil which is the open source variant of that today. Hey one thing that really struck me was the integration with the revision controls system which was and i think still is performance.I imagine many listeners are very familiar with Git. Perforce is very different. I can only partly remember all of the intricacies of it, because it's been so long since I've used it.But one interesting aspect of it was you could do partial checkouts. It really was designed for giant code bases.There was this concept of partial checkouts where you could check out just the bits of the code that you needed. But of course, then the question is, how do you know what those bits are?But of course the build system knows because the build system knows about dependencies. And so there was this integration, this back and forth between the, um.[5:32] Perforce client and the build system that was very creative and very effective.And allowed you to only have locally on your machine, the code that you actually needed to work on the piece of the codebase you're working on,basically the files you cared about and all of their transitive dependencies. And that to me was a very creative solution to a problem that involved some lateral thinking about how,seemingly completely unrelated parts of the tool chain could interact. And that's kind of been that made me realize, oh, there's a lot of creative thought at work here and I love it.[6:17] Yeah, no, I think that makes sense. Like I interned there way back in 2016. And I was just fascinated by, I remember by mistake, I ran like a grep across the code base and it just took forever. And that's when I realized, you know, none of this stuff is local.First of all, like half the source code is not even checked out to my machine.And my poor grep command is trying to check that out. But also how seamlessly it would work most of the times behind the scenes.Did you have any experience or did you start working on developer tools then? Or is that just what inspired you towards thinking about developer tools?I did not work on the developer tools at Google. worked on ads and search and sort of Google products, but I was a big user of the developer tools.Exception which was that I made some contributions to the.[7:21] Protocol buffer compiler which i think many people may be familiar with and that is. You know if i very deep part of the toolchain that is very integrated into everything there and so that gave me.Some experience with what it's like to hack on a tool that's everyone in every engineer is using and it's the sort of very deep part of their workflow.But it wasn't until after google when i went to twitter.[7:56] I noticed that the in my time of google my is there the rest of the industry had not. What's up and suddenly i was sort of stressed ten years into the past and was back to using very slow very clunky flaky.Tools that were not designed for the tasks we were trying to use them for. And so that made me realize, wait a minute, I spent eight years using these great tools.They don't exist outside of these giant companies. I mean, I sort of assumed that maybe, you know, Microsoft and Amazon and some other giants probably have similar internal tools, but there's something out there for everyone else.And so that's when I started hacking on that problem more directly was at Twitter together with John, who is now my co-founder at Toolchain, who was actually ahead of me and ahead ofthe game at Twitter and already begun working on some solutions and I joined him in that.Could you maybe describe some of the problems you ran into? Like were the bills just taking forever or was there something else?[9:09] So there were...[9:13] A big part of the problem was that Twitter at the time, the codebase I was interested in and that John was interested in was using Scala. Scala is a fascinating, very rich language.[9:30] Its compiler is very slow. And we were in a situation where, you know, you'd make some small change to a file and then builds would take just,10 minutes, 20 minutes, 40 minutes. The iteration time on your desktop was incredibly slow.And then CI times, where there was CI in place, were also incredibly slow because of this huge amount of repetitive or near repetitive work. And this is because the build tools,etc. were pretty naive about understanding what work actually needs to be done given a set of changes.There's been a ton of work specifically on SBT since then.[10:22] It has incremental compilation and things like that, but nonetheless, that still doesn't really scale well to large corporate codebases that are what people often refer to as monorepos.If you don't want to fragment your codebase with all of the immense problems that that brings, you end up needing tooling that can handle that situation.Some of the biggest challenges are, how do I do less than recompile the entire codebase every time. How can tooling help me be smart about what is the correct minimal amount of work to do.[11:05] To make compiling and testing as fast as it can be?[11:12] And I should mention that I dabbled in this problem at Twitter with John. It was when I went to Foursquare that I really got into it because Foursquare similarly had this big Scala codebase with a very similar problem of incredibly slow builds.[11:29] The interim solution there was to just upgrade everybody's laptops with more RAM and try and brute force the problem. It was very obvious to everyone there, tons of,force-creation pattern still has lots of very, very smart engineers.And it was very obvious to them that this was not a permanent solution and we were casting around for...[11:54] You know what can be smart about scala builds and i remember this thing that i had hacked on twitter and. I reached out to twitter and ask them to open source it so we could use it and collaborate on it wasn't obviously some secret sauce and that is how the very first version of the pants open source build system came to be.I was very much designed around scarlet did eventually.Support other languages. And we hacked on it a lot at Foursquare to get it to...[12:32] To get the codebase into a state where we could build it sensibly. So the one big challenge is build speed, build performance.The other big one is managing dependencies, keeping your codebase sane as it scales.Everything to do with How can I audit internal dependencies?How do I make sure that it is very, very easy to accidentally create all sorts of dependency tangles and cycles and create a code base whose dependency structure is unintelligible, really,hard to work with and actually impacts performance negatively, right?If you have a big tangle of dependencies, you're more likely to invalidate a large chunk of your code base with a small change.And so tooling that allows you to reason about the dependencies in your code base and.[13:24] Make it more tractable was the other big problem that we were trying to solve. Mm-hmm. No, I think that makes sense.I'm guessing you already have a good understanding of other build systems like Bazel and Buck.Maybe could you walk us through what are the difference for PANs, Veevan? What is the major design differences? And even maybe before that, like, how was Pants designed?And is it something similar to like creating a dependency graph? You need to explicitly include your dependencies.Is there something else that's going on?[14:07] Maybe just a primer. Yeah. Absolutely. So I should mention, I was careful to mention, you mentioned Pants V1.The version of Pants that we use today and base our entire technology stack around is what we very unimaginatively call Pants V2, which we launched two years ago almost to the day.That is radically different from Pants V1, from Buck, from Bazel. It is quite a departure in ways that we can talk about later.One thing that I would say Panacea V1 and Buck and Bazel have in common is that they were designed around the use cases of a single organization. is a.[14:56] Open source variant or inspired by blaze its design was very much inspired by. Here's how google does engineering and a buck similarly for facebook and pansy one frankly very similar for.[15:11] Twitter and we sort of because Foursquare also contributed a lot to it, we sort of nudged it in that direction quite a bit. But it's still very much if you did engineering in this one company's specific image, then this might be a good tool for you.But you had to be very much in that lane.But what these systems all look like is, and the way they are different from much earlier systems is.[15:46] They're designed to work in large scalable code bases that have many moving parts and share a lot of code and that builds a lot of different deployables, different, say, binaries or DockerDocker images or AWS lambdas or cloud functions or whatever it is you're deploying, Python distributions, Java files, whatever it is you're building, typically you have many of them in this code base.Could be lots of microservices, could be just lots of different things that you're deploying.And they live in the same repo because you want that unity. You want to be able to share code easily. you don't want to introduce dependency hell problems in your own code. It's bad enough that we have dependency hell problems third-party code.[16:34] And so these systems are all if you squint at them from thirty thousand feet today all very similar in that they make that the problem of. Managing and building and testing and packaging in a code base like that much more tractable and the way they do this is by applying information about the dependencies in your code base.So the important ingredient there is that these systems understand the find the relatively fine grained dependencies in your code base.And they can use that information to reason about work that needs to happen. So a naive build system, you'd say, run all the tests in the repo or in this part of the repo.So a naive system would literally just do that, and first they would compile all the code.[17:23] But a scalable build system like these would say, well, you've asked me to run these tests, but some of them have already been cached and these others, okay, haven't.So I need to look at these ones I actually need to run. So let me see what needs to be done before I can run them.Oh, so these source files need to be compiled, but some of those already in cache and then these other ones I need to compile. But I can apply concurrency because there are multiple cores on this machine.So I can know through dependency analysis which compile jobs can run concurrently and which cannot. And then when it actually comes time to run the tests, again, I can apply that sort of concurrency logic.[18:03] And so these systems, what they have in common is that they use dependency information to make your building testing packaging more tractable in a large code base.They allow you to not have to do the thing that unfortunately many organizations find themselves doing, which is fragmenting the code base into lots of different bits andsaying, well, every little team or sub team works in its own code base and they consume each other's code through, um, so it was third party dependencies in which case you are introducing a dependency versioning hell problem.Yeah. And I think that's also what I've seen that makes the migration to a tool like this hard. Cause if you have an existing code base that doesn't lay out dependencies explicitly.[18:56] That migration becomes challenging. If you already have an import cycle, for example.[19:01] Bazel is not going to work with you. You need to clean that up or you need to create one large target where the benefits of using a tool like Bazel just goes away. And I think that's a key,bit, which is so fascinating because it's the same thing over several years. And I'm hoping that,it sounds like newer tools like Go, at least, they force you to not have circular dependencies and they force you to keep your code base clean so that it's easy to migrate to like a scalable build system.[19:33] Yes exactly so it's funny that is the exact observation that let us to pans to see to so they said pans to be one like base like buck was very much inspired by and developed for the needs of a single company and other companies were using it a little bit.But it also suffered from any of the problems you just mentioned with pans to for the first time by this time i left for square and i started to chain with the exact mission of every company every team of any size should have this kind of tooling should have this ability this revolutionary ability to make the code base is fast and tractable at any scale.And that made me realize.We have to design for that we have to design for not for. What a single company's code base looks like but we have to design.To support thousands of code bases of all sorts of different challenges and sizes and shapes and languages and frameworks so.We actually had to sit down and figure out what does it mean to make a tool.Like this assistant like this adoptable over and over again thousands of times you mentioned.[20:48] Correctly, that it is very hard to adopt one of those earlier tools because you have to first make your codebase conform to whatever it is that tool expects, and then you have to write huge amounts of manual metadata to describe all of the dependencies in your,the structure and dependencies of your codebase in these so-called build files.If anyone ever sees this written down, it's usually build with all capital letters, like it's yelling at you and that those files typically are huge and contain a huge amount of information your.[21:27] I'm describing your code base to the tool with pans be to eat very different approaches first of all we said this needs to handle code bases as they are so if you have circular dependencies it should handle them if you have. I'm going to handle them gracefully and automatically and if you have multiple conflicting external dependencies in different parts of your code base this is pretty common right like you need this version of whatever.Hadoop or NumPy or whatever it is in this part of the code base, and you have a different conflicting version in this other part of the code base, it should be able to handle that.If you have all sorts of dependency tangles and criss-crossing and all sorts of things that are unpleasant, and better not to have, but you have them, the tool should handle that.It should help you remove them if you want to, but it should not let those get in the way of adopting it.It needs to handle real-world code bases. The second thing is it should not require you to write all this crazy amount of metadata.And so with Panzer V2, we leaned in very hard on dependency inference, which means you don't write these crazy build files.You write like very tiny ones that just sort of say, you know, here is some code in this language for the build tool to pay attention to.[22:44] But you don't have to edit the added dependencies to them and edit them every time you change dependencies.Instead, the system infers dependencies by static analysis. So it looks at your, and it does this at runtime.So you, you know, almost all your dependencies, 99% of the time, the dependencies are obvious from import statements.[23:05] And there are occasional and you can obviously customize this because sometimes there are runtime dependencies that have to be inferred from like a string. So from a json file or whatever is so there are various ways to customize this and of course you can always override it manually.If you have to be generally speaking ninety.Seven percent of the boilerplate that used to going to build files in those old systems including pans v1 no. You know not claiming we did not make the same choice but we goes away with pans v2 for exactly the reason that you mentioned these tools,because they were designed to be adopted once by a captive audience that has no choice in the matter.And it was designed for how that code base that adopting code base already is. is these tools are very hard to adopt.They are massive, sometimes multi-year projects outside of that organization. And we wanted to build something that you could adopt in days to weeks and would be very easy,to customize to your code base and would not require these massive wholesale changes or huge amounts of metadata.And I think we've achieved that. Yeah, I've always wondered like, why couldn't constructing the build file be a part of the build. In many ways, I know it's expensive to do that every time. So just like.[24:28] Parts of the build that are expensive, you cache it and then you redo it when things change.And it sounds like you've done exactly that with BANs V2.[24:37] We have done exactly that. The results are cached on a profile basis. So the very first time you run something, then dependency inference can take some time. And we are looking at ways to to speed that up.I mean, like no software system has ever done, right? Like it's extremely rare to declare something finished. So we are obviously always looking at ways to speed things up.But yeah, we have done exactly what you mentioned. We don't, I should mention, we don't generate the dependencies into build for, we don't edit build files and then you check them in.We do that a little bit. So I mentioned you do still with PANSTL V2, you need these little tiny build files that just say, here is some code.They typically can literally be one line sometimes, almost like a marker file just to say, here is some code for you to pay attention to.We're even working on getting rid of those.We do have a little script that generates those one time just to help you onboard.But...[25:41] The dependencies really are just generated a runtime as on demand as needed and used a runtime so we don't have this problem of. Trying to automatically add or edit a otherwise human authored file that is then checked in like this generating and checking in files is.Problematic in many ways, especially when those files also have to take human written edits.So we just do away with all of that and the dependency inference is at runtime, on demand, as needed, sort of lazily done, and the information is cached. So both cached in memory in the surpassed V2 has this daemon that runs and caches a huge amount of state in memory.And the results of running dependency inference are also cached on disk. So they survive a daemon restart, etc.I think that makes sense to me. My next question is going to be around why would I want to use panthv2 for a smaller code base, right? Like, usually with the smaller codebase, I'm not running into a ton of problems around the build.[26:55] I guess, do you notice these inflection points that people run into? It's like, okay, my current build setup is not enough. What's the smallest codebase that you've seen that you think could benefit? Or is it like any codebase in the world? And I should start with,a better build system rather than just Python setup.py or whatever.I think the dividing line is, will this code base ever be used for more than one thing?[27:24] So if you have a, let's take the Python example, if literally all this code base will ever do is build this one distribution and a top level setup pie is all I need. And that is, you know, this,sometimes you see this with open source projects and the code base is going to remain relatively small, say it's only ever going to be a few thousand lines and the tests, even if I runthe tests from scratch every single time, it takes under five minutes, then you're probably fine.But I think two things I would look at are, am I going to be building multiple things in this code base in the future, or certainly if I'm doing it now.And that is much more common with corporate code bases. You have to ask yourself, okay, my team is growing, more and more people are cooperating on this code base.I want to be able to deploy multiple microservices. I want to be able to deploy multiple cloud functions.I want to be able to deploy multiple distributions or third-party artifacts.I want to be able to.[28:41] You know, multiple sort of data science jobs, whatever it is that you're building. If you want, if you ever think you might have more than one, now's the time to think about,okay, how do I structure the code base and what tooling allows me to do this effectively?And then the other thing to look at is build times. If you're using compiled languages, then obviously compilation, in all cases testing, if you start to see like, I can already see that that tests are taking five minutes, 10 minutes, 15 minutes, 20 minutes.Surely, I want some technology that allows me to speed that up through caching, through concurrency, through fine-grained invalidation, namely, don't even attempt to do work that isn't necessary for the result that was asked for.Then it's probably time to start thinking about tools like this, because the earlier you adopt it, the easier it is to adopt.So if you wait until you've got a tangle of multiple setup pies in the repo and it's unclear how you manage them and how you keep their dependencies synchronized,so there aren't version conflicts across these different projects, specifically with Python,this is an interesting problem.I would say with other languages, there is more because of the compilation step in jvm languages or go you.[30:10] Encounter the need for a build system much much earlier a bill system of some kind and then you will ask yourself what kind with python because you can get a bite for a while just running. What are the play gate and pie test and i directly and all everything is all together in a single virtual and.But the Python tooling, as mighty as it is, mostly is not designed for larger code bases with multiple, that deploy multiple things and have multiple different sets of.[30:52] Internal and external dependencies the tooling generally implicitly assume sort of one top level set up i want top level. Hi project dot com all you know how are you configuring things and so especially using python let's say for jango flask apps or for data scienceand your code base is growing and you've hired a bunch of data scientists and there's more and more code going in there. With Python, you need to start thinking about what tooling allows me to scale this code base. No, I think I mostly resonate with that. The first question that comes to my mind is,let's talk specifically about the deployment problem. If you're deployed to multiple AWS lambdas or cloud functions or whatever, the first thought that would come to my mind isis I can use separate Docker images that can let me easily produce this container image that I can ship independently.Would you say that's not enough? I totally get that for the build time problem.A Docker image is not going to solve anything. But how about the deployment step?[32:02] So again, with deployments, I think there are two ways where a tool like this can really speed things up.One is only build the things that actually need to be redeployed. And because the tool understands dependencies and can do change analysis, it can figure that out.So one of the things that HansB2 does is it integrates with Git.And so it natively understands how to figure out Git diffs. So you can say something like, show me all the whatever, lambdas, let's say, that are affected by changes between these two branches.[32:46] And it knows and it understands it can say, well, these files changed and you know, we, I understand the transitive dependencies of those files.So I can see what actually needs to be deployed. And, you know, many cases, many things will not need to be redeployed because they haven't changed.The other thing is there's a lot of performance improvements and process improvements around building those images. So, for example, we have for Python specifically, we have an executable format called PEX,which stands for Python executable, which is a single file that embeds all of your Python code that is needed for your deployable and all of its external requirements, transitive external requirements, all bundled up into this single sort of self-executing file.This allows you to do things like if you have to deploy 50 of these, you can basically have a single docker image.[33:52] The different then on top of that you add one layer for each of these fifty and the only difference in that layer is the presence of this pecs file. Where is without all this typically what you would do is.You have fifty docker images each one of which contains a in each one of which you have to build a virtual and which means running.[34:15] Pip as part of building the image, and that gets slow and repetitive, and you have to do it 50 times.We have a lot of ways to speed up. Even if you are deploying 50 different Docker images, we have ways of speeding that up quite dramatically.Because again, of things like dependency analysis, the PECS format, and the ability to build incrementally.Yeah, I think I remember that at Dropbox, we came up with our own, like, par format to basically bundle up a Python binary with, I think par stood for Python archive. I'm notentirely sure. But it did something remarkably similar to solve exactly this problem. It just takes so long, especially if you have a large Python code base. I think that makes sense to me. The other thing that one might ask is, with Python, you don't really have,too long of a build time, is what you would guess, because there's nothing to build. Maybe myPy takes some time to do some static analysis, and, of course, your tests can take forever,and you don't want to rerun them. But there isn't that much of a build time that you have to think about. Would you say that you agree with this, or there's some issues that end,up happening on real-world code basis.[35:37] Well that's a good question the word builds means different things to different people and we recently taken to using the time see i more. Because i think that is clear to people what that means but when i say build or see i mean it in the law in in the extended sense everything you do to go from.Human written source code to a verified.Test did. deployable artifact and so it's true that for python there's no compilation step although arguably. Running my pie is really important and now that i'm really in the habit of using.My pie i will probably never not use it on python code ever again but so that are.[36:28] Sort of build-ish steps for Python such as type checking, such as running code generators like Thrift or Protobuf.And obviously a big, big one is running, resolving third-party dependencies such as running PIP or poetry or whatever it is you're using. So those are all build steps.But with Python, really the big, big, big thing is testing and packaging and primarily testing.And so with Python, you have to be even more rigorous about unit testing than you do with other languages because you don't have a compiler that is catching whole classes of bugs.So and again, MyPy and type checking does really help with that. And so when I build to me includes, build in the large sense includes running tests,includes packaging and includes everything, all the quality control that you run typically in CI or on your desktop in order to go say, well, I've made some edits and here's the proof that these edits are good and I can merge them or deploy them.[37:35] I think that makes sense to me. And like, I certainly saw it with the limited number of testing, the limited amount of type checking you can do with Python, like MyPy is definitelyimproving on this. You just need to unit test a lot to get the same amount of confidence in your own code and then unit tests are not cheap. The biggest question that comes tomy mind is that is BANs V2 focused on Python? Because I have a TypeScript code base at my workplace and I would love to replace the TypeScript compiler with something that was slightly smarter and could tell me, you know what, you don't need to run every unit test every change.[38:16] Great question so when we launched a pass me to which was two years ago. The.We focused on python and that was the initial language we launched with because you had to start somewhere and in the city ten years in between the very scarlet centric work we were doing on pansy one. And the launch of hands be to something really major happened in the industry which was the python skyrocketed in popularity sky python went from.Mostly the little scripting language around the edges of your quote unquote real code, I can use python like fancy bash to people are building massive multi billion dollar businesses entirely on python code bases and there are a few things that drove this one was.I would say the biggest one probably was the python became the. Language of choice for data science and we have strong support for those use cases. There was another was the,Django and Flask became very popular for writing web apps more and more people were used there were more in Intricate DevOps use cases and Python is very popular for DevOps for various good reasons. So.[39:28] Python became super popular. So that was the first thing we supported in pants v2, but we've since added support for or Go, Java, Scala, Kotlin, Shell.Definitely what we don't have yet is JavaScript TypeScript. We are looking at that very closely right now, because that is the very obvious next thing we want to add.Actually, if any listeners have strong opinions about what that should look like, we would love to hear from them or from you on our Slack channels or on our GitHub discussions where we are having some lively discussions about exactly this because the JavaScript.[40:09] And TypeScript ecosystem is already very rich with tools and we want to provide only value add, right? We don't want to say, you have to, oh, you know, here's another paradigm you have to adopt.And here's, you know, you have to replace, you've just done replacing whatever this with this, you know, NPM with yarn. And now you have to do this thing. And now we're, we don't want to beanother flavor of the month. We only want to do the work that uses those tools, leverages the existing ecosystem but adds value. This is what we do with Python and this is one of the reasons why our Python support is very, very strong, much stronger than any other comparable tool out there is.[40:49] A lot of leaning in on the existing Python tool ecosystem but orchestrating them in a way that brings rigor and speed to your builds.And I haven't used the word we a lot. And I just kind of want to clarify who we is here.So there is tool chain, the company, and we're working on, um, uh, SAS and commercial, um, solutions around pants, which we can talk about in a bit.But there is a very robust open source community around pants that is not. tightly held by Toolchain, the company in a way that some other companies open source projects are.So we have a lot of contributors and maintainers on Pants V2 who are not working at Toolchain, but are using Pants in their own companies and their own organizations.And so we have a very wide range of use cases and opinions that are brought to bear. And this is very important because, as I mentioned earlier,we are not trying to design a system for one use case, for one company or a team's use case.We are trying, you know, we are working on a system we want.[42:05] Adoption for over and over and over again at a wide variety of companies. And so it's very important for us to have the contributions and the input from a wide variety of teams and companiesand people. And it's very fortunate that we now do. I mean, on that note, the thing that comes to my mind is another benefit of your scalable build system like Vance or Bazel or Buck is that youYou don't have to learn various different commands when you are spelunking through the code base, whether it's like a Go code base or like a Java code base or TypeScript code base.You just have to run pants build X, Y, Z, and it can construct the appropriate artifacts for you. At least that was my experience with Bazel.Is that something that you are interested in or is that something that pants V2 does kind of act as this meta layer for various other build systems or is it much more specific and knowledgeable about languages itself?[43:09] It's, I think your intuition is correct. The idea is we want you to be able to do something like pants test or pants test, you know, give it a path to a directory and it understands what that means.Oh, this directory contains Python code. Therefore, I should run PyTest in this way. And oh, Oh, it also contains some JavaScript code, so I should run the JavaScript test in this way.And it basically provides a conceptual layer above all the individual tools that gives you this uniformity across frameworks, across languages.One way to think about this is.[43:52] The tools are all very imperative. say you have to run them with a whole set of flags and inputs and you have to know how to use each one separately. So it's like having just the blades of a Swiss Army knife withno actual Swiss Army knife. A tool like Pants will say, okay, we will encapsulate all of that complexity into a much more simple command line interface. So you can do, like I said,test or pants lint or pants format and it understands, oh, you asked me to format your code. I see that you have the black and I sort configured as formatters. So I will run them. And I happen to know that formatting, because formatting can change the source files,I have to run them sequentially. But when you ask for lint, it's not changing the source files. So I know that I can run them multiple lint as concurrently, that sort of logic. And And different tools have different ways of being configured or of telling you what they want to do, but we...[44:58] Can't be to sort of encapsulate all that away from you and so you get this uniform simple command line interface that abstract away a lot of the specifics of these tools and let you run simple commands and the reason this is important is that. This extra layer of indirection is partly what allows pants to apply things like cashing.And invalidation and concurrency because what you're saying is.[45:25] Hey, the way to think about it is not, I am telling pants to run tests. It is I am telling pants that I want the results of tests, which is a subtle difference.But pants then has the ability to say, well, I don't actually need to run pi test on all these tests because I have results from some of them already cached. So I will return them from cache.So that layer of indirection not only simplifies the UI, but provides the point where you can apply things like caching and concurrency.Yeah, I think every programmer wants to work with declarative tools. I think SQL is one of those things where you don't have to know how the database works. If SQL were somewhat easier, that dream would be fulfilled. But I think we're all getting there.I guess the next question that I have is, what benefit do I get by using the tool chain, like SaaS product versus Pants V2?When I think about build systems, I think about local development, I think about CI.[46:29] Why would I want to use the SaaS product? That's a great question.So Pants does a huge amount of heavy lifting, but in the end it is restricted to the resources is on the machine on which it's running. So when I talk about cash, I'm talking about the local cash on that machine. When I talk about concurrency, I'm talking about using,the cores on your machine. So maybe your CI machine has four cores and your laptop has eight cores. So that's the amount of concurrency you get, which is not nothing at all, which is great.[47:04] Thanks for watching![47:04] You know as i mentioned i worked at google for many years and then other companies where distributed systems were saying like i come from a distributed systems background and it really. Here is a problem.All of a piece of work taking a long time because of. Single machine resource constraints the obvious answer here is distributed distributed the work user distributed system and so that's what tool chain offers essentially.[47:30] You configure Pants to point to the toolchain system, which is currently SAS.And we will have some news soon about some on-prem solutions.And now the cache that I mentioned is not just did this test run with these exact inputs before on my machine by me me while I was iterating, but has anyone in my organization or any CI run this test before,with these exact inputs?So imagine a very common situation where you come in in the morning and you pull all the changes that have happened since you last pulled.Those changes presumably passed CI, right? And the CI populated the cache.So now when I run tests, I can get cache hits from the CI machine.[48:29] Now pretty much, yeah. And then with concurrency, again, so let's say, you know, post cache, there are still 200 tests that need to be run.I could run them eight at a time on my machine or the CI machine could run them, you know, say, four at a time on four cores, or I could run 50 or 100 at a time on a cluster of machines.That's where, again, as your code base gets bigger and bigger, that's where some massive, massive speedups come in.The other aspects of the... I should mention that the remote execution that I just mentioned is something we're about to launch. It is not available today. The remote caching is.The other aspects are things like observability. So when you run builds on your laptop or CI, they're ephemeral.Like the output gets lost in the scroll back.And it's just a wall of text that gets lost with them.[49:39] Toolchain all of that information is captured and stored in structured form so you have. Hey the ability to see past bills and see build behavior over time and drill death search builds and drill down into individual builds and see well.How often does this test fail and you know when did this get slow all this kind of information and so you get.This more enterprise level.Observability into a very core piece of developer productivity, which is the iteration time.The time it takes to run tests and build deployables and parcel the quality control checks so that you can merge and deploy code directly relates to time to release.It directly relates to some of the core metrics of developer productivity. How long is it going to take to get this thing out the door?And so having the ability to both speed that up dramatically through distributing the work and having observability into what work is going on, that is what toolchain provides,on top of the already, if I may say, pretty robust open source offering.[51:01] So yeah, that's kind of it.[51:07] Pants on its own gives you a lot of advantages, but it runs standalone. Plugging it into a larger distributed system really unleashes the full power of Pants as a client to that system.[51:21] No, I think what I'm seeing is this interesting convergence. There's several companies trying to do this for Bazel, like BuildBuddy and Edgeflow. So, and it really sounds like the build system of the future, like 10 years from now.[51:36] No one will really be developing on their local machines anymore. Like there's GitHub code spaces on one side. It's like you're doing all your development remotely.[51:46] I've always found it somewhat odd that development that happens locally and whatever scripts you need to run to provision your CI machine to run the same set of testsare so different sometimes that you can never tell why something's passing locally and failing in in CI or vice versa. And there really should just be this one execution layer that can say, you know what, I'm going to build at a certain commit or run at a certain commit.And that's shared between the local user and the CI user. And your CI script is something as simple as pants build slash slash dot dot dot. And it builds the whole code base for,you. So yeah, I certainly feel like the industry is moving in that direction. I'm curious whether You think that's the same.Do you have an even stronger vision of how folks will be developing 10 years from now? What do you think it's going to look like?Oh, no, I think you're absolutely right. I think if anything, you're underselling it. I think this is how all development should be and will be in the future for multiple reasons.One is performance.[52:51] Two is the problem of different platforms. And so today, big thorny problem is I want to, you know, I want to,I'm developing on my Mac book, but the production, so I'm running, when I run tests locally and when I run anything locally, it's running on my Mac book, but that's not our deployable, right?Typically your deploy platform is some flavor of Linux. So...[53:17] With the distributed system approach you can run the work in. Containers that exactly match your production environments you don't even have to care about can this run.On will my test pass on mac os do i need ci the runs on mac os just to make sure the developers can. past test on Mac OS and that is somehow correlated with success on the production environment.You can cut away a whole suite of those problems, which today, frankly, I had mentioned earlier, you can get cache hits on your desktop from remote, from CI populating the cache.That is hampered by differences in platform.Is hampered by other differences in local setup that we are working to mitigate. But imagine a world in which build logic is not actually running on your MacBook, or if it is,it's running in a container that exactly matches the container that you're targeting.It cuts away a whole suite of problems around platform differences and allows you to focus because on just a platform you're actually going to deploy too.[54:35] And the...[54:42] And just the speed and the performance of being able to work and deploy and the visibility that it gives you into the productivity and the operational work of your development team,I really think this absolutely is the future.There is something very strange about how in the last 15 years or so, so many business functions have had the distributed systems treatment applied to them.Function is now that there are these massive valuable companies providing systems that support sales and systems that support marketing and systems that support HR and systems supportoperations and systems support product management and systems that support every business function,and there need to be more of these that support engineering as a business function.[55:48] And so i absolutely think the idea that i need a really powerful laptop so that my running tests can take thirty minutes instead of forty minutes when in reality it should take three minutes is. That's not the future right the future is to as it has been for so many other systems to the web the laptop is that i can take anywhere is.Particularly in these work from home times, is a work from anywhere times, is just a portal into the system that is doing the actual work.[56:27] Yeah. And there's all these improvements across the stack, right? When I see companies like Versel, they're like, what if you use Next.js, we provide the best developer platform forthat and we want to provide caching. Then there's like the lower level systems with build systems, of course, like bands and Bazel and all that. And at each layer, we're kindof trying to abstract the problem out. So to me, it still feels like there is a lot of innovation to be done. And I'm also going to be really curious to know, you know, there'sgoing to be like a few winners of this space, or if it's going to be pretty broken up. And like everyone's using different tools. It's going to be fascinating, like either way.Yeah, that's really hard to know. I think one thing you mentioned that I think is really important is you said your CI should be as simple as just pants build colon, colon, or whatever.That's our syntax would be sort of pants test lint or whatever.I think that's really important. So.[57:30] Today one of the big problems with see i. Which is still growing right now home market is still growing is more more teams realize the value and importance of automated.Very aggressive automated quality control. But configuring CI is really, really complicated. Every CI provider have their own configuration language,and you have to reason about caching, and you have to manually construct cache keys to the extent,that caching is even possible or useful.There's just a lot of figuring out how to configure and set up CI, And even then it's just doing the naive thing.[58:18] So there are a couple of interesting companies, Dagger and Earthly, or interesting technologies around simplifying that, but again, you still have to manually,so they are providing a, I would say, better config and more uniform config language that allows you to, for example, run build steps in containers.And that's not nothing at all.[58:43] Um, but you still manually creating a lot of, uh, configuration to run these very coarse grained large scale, long running build steps, you know, I thinkthe future is something like my entire CI config post cloning the repo is basically pants build colon, colon, because the system does the configuration for you.[59:09] It figures out what that means in a very fast, very fine grained way and does not require you to manually decide on workflows and steps and jobs and how they all fit together.And if I want to speed this thing up, then I have to manually partition the work somehow and write extra config to implement that partitioning.That is the future, I think, is rather than there's the CI layer, say, which would be the CI providers proprietary config or theodagger and then underneath that there is the buildtool, which would be Bazel or Pants V2 or whatever it is you're using, could still be we make for many companies today or Maven or Gradle or whatever, I really think the future is the integration of those two layers.In the same way that I referenced much, much earlier in our conversation, how one thing that stood out to me at Google was that they had the insight to integrate the version control layer and the build tool to provide really effective functionality there.I think the build tool being the thing that knows about your dependencies.[1:00:29] Can take over many of the jobs of the c i configuration layer in a really smart really fast. Where is the future where essentially more and more of how do i set up and configure and run c i is delegated to the thing that knows about your dependencies and knows about cashing and knows about concurrency and is able,to make smarter decisions than you can in a YAML config file.[1:01:02] Yeah, I'm excited for the time that me as a platform engineer has to spend less than 5% of my time thinking about CI and CD and I can focus on other things like improving our data models rather than mucking with the YAML and Terraform configs. Well, yeah.Yeah. Yeah. Today you have to, we're still a little bit in that state because we are engineers and because we, the tools that we use are themselves made out of software. There's,a strong impulse to tinker and there's a strong impulse sake. Well, I want to solve this problem myself or I want to hack on it or I should be able to hack on it. And that's, you should be able to hack on it for sure. But we do deserve more tooling that requires less hacking,and more things and paradigms that have tested and have survived a lot of tire kicking.[1:02:00] Will we always need to hack on them a little bit? Yes, absolutely, because of the nature of what we do. I think there's a lot of interesting things still to happen in this space.Yeah, I think we should end on that happy note as we go back to our day jobs mucking with YAML. Well, thanks so much for being a guest. I think this was a great conversation and I hope to have you again for the show sometime.Would love that. Thanks for having me. It was fascinating. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Software at Scale 51 - Usage based Pricing with Puneet Gupta

Play Episode Listen Later Oct 13, 2022 65:05

Puneet Gupta is the co-founder and CEO of Amberflo, a cloud metering and usage based pricing platform.Apple Podcasts | Spotify | Google PodcastsIn this episode, we discuss Puneet's fascinating background early at AWS as a GM and his early experience at Oracle Cloud. We initially discuss why AWS shipped S3 as its first product before any other services. After, we go over the cultural differences between AWS and Oracle, and how usage based pricing and sales tied into the organization's culture and efficiency.Our episode covers all the different ways organizations align themselves better when pricing is directly tied to the usage metrics of customers. We discuss how SaaS subscription models are simply reworking of traditional software licenses, how vendors can dispel fears around overages due to dynamic pricing models, and even why Netflix should be a usage-based-priced service :-)We don't have a show notes, but I thought it would be interesting to link the initial PR newsletter for S3's launch, to reflect on how our industry has completely changed over the last few years. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo netflix pr software scale gm pricing oracle saas aws gupta usage s3 puneet oracle cloud

Software at Scale 50 - Redefining Labor with Akshay Buddiga

Play Episode Listen Later Sep 8, 2022 75:46

Akshay Buddiga is the co-founder and CTO of Traba, a labor management platform.Apple Podcasts | Spotify | Google PodcastsSorry for the long hiatus in episodes! Today's episode covers a myriad of interesting topics - from being the star of one of the internet's first viral videos, to experiencing the hyper-growth at the somewhat controversial Zenefits, scaling out the technology platform at Fanatics, starting a company, picking an accelerator, only permitting in-person work, facilitating career growth of gig workers, and more!Highlights[0:00] - The infamous Spelling Bee incident.[06:30] - Why pivot to Computer Science after an undergraduate focus in biomedical engineering?[09:30] - Going to Stanford for Management Science and getting an education in Computer Science.[13:00] - Zenefits during hyper-growth. Learning from Parker Conrad.[18:30] - Building an e-commerce platform with reasonably high scale (powering all NFL gear) as a first software engineering gig. Dealing with lots of constraints from the beginning - like multi-currency support - and delivering a complete solution over several years.The interesting seasonality - like Game 7 of the NBA finals - and the implications on the software engineers maintaining e-commerce systems. Watching all the super-bowls with coworkers.[26:00] - A large outage, obviously due to DNS routing.[31:00] - Why start a company?[37:30] - Why join OnDeck?[41:00] - Contrary to the current trend, Traba only allows in-person work. Why is that?We go on to talk about the implications of remote work and other decisions in an early startup's product velocity.[57:00] - On being competitive.[58:30] - Velocity is really about not working on the incorrect stuff.[68:00] - What's next for Traba? What's the vision?[72:30] - Building two-sided marketplaces, and the career path for gig workers. This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

Software at Scale 49 - State Management with James Cowling

Play Episode Listen Later Jun 23, 2022 53:18

James Cowling is the co-founder of Convex, a state management platform for web developers.Apple Podcasts | Spotify | Google PodcastsWe discuss the state of web development in the industry today, and the various different approaches to make it easier. Contrasting the Hasura and Convex approach as a good way to illustrate some of the ideas. Hasura lets you skip the web-app, and run queries against the database through GraphQL queries. Convex, on the other hand, helps you stop worrying about databases. No setup or scaling concerns. It’s interesting to see how various systems are evolving to help developers with reducing the busywork around more and more layers of the stack, and just focus on delivering business value instead.Convex also excels at the developer experience portion - they provide a deep integration with React, use hooks (just like Apollo GraphQL) and seem to have a fully typed (and therefore auto-completable) SDK. I expect more companies will move “up the stack” to provide deeper integrations with popular tools like React.Episode Reading ListThe co-founders of this company led Dropbox’s Magic Pocket project.Convex → NetlifyConvex vs. FirebasePrisma This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

software scale react dropbox contrasting sdks graphql convex cowling state management hasura apollo graphql

Software at Scale 48 - API Gateway Management with Josh Twist

Play Episode Listen Later Jun 9, 2022 49:36

Josh Twist is the co-founder and CEO of Zuplo, a programmable, developer friendly API Gateway Management Platform.Apple Podcasts | Spotify | Google PodcastsWe discuss a new category of developer tools startups - API Gateway Management Platforms. We go over what an API Gateway is, why do companies use gateways, common pain-points in gateway management, building reliable systems that serve billions of requests at scale. But most importantly, we dive into the story of Josh’s UK Developer of the Year 2009 award.Recently, I’ve been working on the Vanta API and was surprised at how poor the performance and developer experience around Amazon’s API Gateway is. It has poor support for rate limiting, and has very high edge latency. So I’m excited for a new crop of companies to provide good solutions in this space.Episode Reading ListAmazon’s API GatewayStripe’s API - The first ten yearsEnvoyThe Award This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo amazon management software scale twist api gateway

Software at Scale 47 - OpenTelemetry with Ted Young

Play Episode Listen Later May 26, 2022 93:41

Ted Young is the Director of Developer Education at Lightstep and a co-founder of the OpenTelemetry project.Apple Podcasts | Spotify | Google PodcastsThis episode dives deep into the history of OpenTelemetry, why we need a new telemetry standard, all the work that goes into building generic telemetry processing infrastructure, and the vision for unified logging, metrics and traces.Episode Reading ListInstead of highlights, I’ve attached links to some of our discussion points.HTTP Trace Context - new headers to support a standard way to preserve state across HTTP requests.OpenTelemetry Data CollectionZipkinOpenCensus and OpenTracing - the precursor projects to OpenTelemetry This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

director young software scale lightstep opentracing

Software at Scale 46 - Authorization with Or Weis

Play Episode Listen Later May 10, 2022 49:05

Or Weis is the CEO and founder of Permit.io, a Permission as a Service platform. Previously, he founded Rookout, a cloud-debugging tool.Apple Podcasts | Spotify | Google PodcastsMany of us have struggled (or are struggling) with permission management in the various applications we’ve built. The complexity of these systems always tends to increase through business requirements - for example, some content should only be accessed by paid users or users in a certain geography. Certain architectures like filesystems have hierarchical permissions that efficient evaluation, and there’s technical complexity that’s often unique to the specific application.We talk about all the complexity around permission management, and techniques to solve it in this episode. We also explore how Permit tries to solve this as a product and abstract this problem out for everyone.Highlights[0:00] - Why work on access control?[02:00] - Sources of complexity in permission management[08:00] - Which cloud system manages permissions well?[11:00] - Product-izing a solution to this problem[17:00] - What kind of companies approach you for solutions to this problem?[22:00] - Why are there research papers written about permission management?[38:00] - Permission management across the technology stack (inter-service communication)[42:00] - What are you excited about building next? This is a public episode. If you would like to discuss this with other subscribers or get access to bonus episodes, visit www.softwareatscale.dev

ceo service product software scale permission permit authorization weis

Software at Scale 45 - Q/A with Jon Skeet

Play Episode Listen Later Apr 20, 2022 50:17

Jon Skeet is a Staff Developer Platform Engineer at Google, working on Google Cloud Platform client libraries for .NET. He's best known for contributions to Stack Overflow as well as his book, C# in Depth. Additionally he is the primary maintainer of the Noda Time date/time library for .NET. You may also be interested in Jon Skeet Facts.Apple Podcasts | Spotify | Google PodcastsWe discuss the intricacies of timezones, how to attempt to store time correctly, how storing UTC is not a silver bullet, asynchronous help on the internet, the implications of new tools like GitHub Copilot, remote work, Jon’s upcoming book on software diagnostics, and more.Highlights[01:00] - What exactly is a Developer Platform Engineer? [05:00] - Why is date and time management so tricky?[13:00] - How should I store my timestamps? We discuss reservation systems, leap seconds, timezone changes, and more.[21:00] - StackOverflow, software development, and more.[27:00] - Software diagnostics[32:00] - The evolution of StackOverflow[34:00] - Remote work for software developers[41:00] - Github Copilot and the future of software development tools[44:00] - What’s your most controversial programming opinion? Subscribe at www.softwareatscale.dev

google software scale remote depth utc stack overflow github copilot google cloud platform jon skeet

Software at Scale 44 - Building GraphQL with Lee Byron

Play Episode Listen Later Mar 22, 2022 64:33

Lee Byron is the co-creator of GraphQL, a senior engineering manager at Robinhood, and the executive director of the GraphQL foundation.Apple Podcasts | Spotify | Google PodcastsWe discuss the GraphQL origin story, early technical decisions at Facebook, the experience of deploying GraphQL today, and the future of the project.Highlights(some tidbits)[01:00] - The origin story of GraphQL.Initially, the Facebook application was an HTML web-view wrapper. It seemed like the right choice at the time, with the iPhone releasing without an app-store, Steve Jobs calling it an “internet device”, and Android phones coming out soon after, with Chrome, a brand-new browser. But the application had horrendous performance, high crash rates, used up a lot of RAM on devices and animations would lock the phone up. Zuckerberg called the bet Facebook’s biggest mistake. The idea was to rebuild the app from scratch using native technologies. A team built up a prototype for the news feed, but they quickly realized that there weren’t any clean APIs to retrieve data in a palatable format for phones - the relevant APIs all returned HTML. But Facebook had a nice ORM-like library in PHP to access data quickly, and there was a parallel effort to speed up the application by using this library. There was another project to declaratively declare data requirements for this ORM for increased performance and a better developer experience.Another factor was that mobile data networks were pretty slow, and having a chatty REST API for the newsfeed would lead to extremely slow round-trip times and tens of seconds to load the newsfeed. So GraphQL started off as a little library that could make declarative calls to the PHP ORM library from external sources and was originally called SuperGraph. Finally, the last piece was to make this language strongly typed, from the lessons of other RPC frameworks like gRPC and Thrift.[16:00] So there weren’t any data-loaders or any such pieces at the time.GraphQL has generally been agnostic to how the data actually gets loaded, and there are plugins to manage things like quick data loading, authorization, etc. Also, Facebook didn’t need data-loading, since its internal ORM managed de-duplication, so it didn’t need to be built until there was sufficient external feedback.[28:00] - GraphQL for public APIs - what to keep in mind. Query costing, and other differences from REST.[42:00] - GraphQL as an open-source project[58:00] - The evolution of the language, new features that Lee is most excited about, like Client-side nullability.Client-side nullability is an interesting proposal - where clients can explicitly state how important retrieving a certain field is, and on the flip side, allow partial failures for fields that aren’t critical. Subscribe at www.softwareatscale.dev

spotify iphone clients software android scale mark zuckerberg steve jobs robin hood ram chrome apis html php thrift query graphql orm rest apis rpc grpc lee byron

Software at Scale 43 - Growth at Loom with Harshyt Goel

Play Episode Listen Later Mar 1, 2022 43:58

Harshyt Goel is a founding engineer and engineering manager of Platform and Integrations at Loom, a video-messaging tool for workplaces. He’s also an angel investor, so if you’re looking for startup advice, investments, hiring advice, or a software engineering job, please reach out to him on Twitter.Apple Podcasts | Spotify | Google PodcastsWe discuss Loom’s story, from when it had six people and a completely different product, to the unicorn it is today. We focus on driving growth, complicated product launches, and successfully launching the Loom SDK.Highlights[00:30] - How it all began[03:00] - Who is a founding engineer? Coming from Facebook to a 5 person startup[06:00] - Company inflection points.[10:30] - Pricing & packaging iterations.[14:30] - Running growth for a freemium product, and the evolution of growth efforts at Loom[30:00] - Summing up the opportunities unlocked by a growth team[33:00] - Sometimes, reducing user friction isn’t what you want.[34:30] - The Loom SDK, from idea to launch. Subscribe at www.softwareatscale.dev

spotify growth running software scale platform integration pricing loom summing goel

Software at Scale 42 - Daniel Stenberg, founder of curl

Play Episode Listen Later Feb 10, 2022 46:40

Daniel Stenberg is the founder and lead developer of curl and libcurl.Apple Podcasts | Spotify | Google PodcastsThis episode, along with others like this one, reminds me of this XKCD:We dive into all the complexity of transferring data across the internet.Highlights[00:30] - The complexity behind HTTP. What goes on behind the scenes when I make a web request?[11:30] - The organizational work behind internet-wide RFCs, like HTTP/3.[20:00] - Rust in curl. The developer experience, and the overall experience of integrating Hyper.[30:00] - Web socket support in curl[34:00] - Fostering an open-source community.[38:00] - People around the world think Daniel has hacked their system, because of the curl license often included in malicious tools.[41:00] - Does curl have a next big thing? Subscribe at www.softwareatscale.dev

spotify founders web software scale fostering rust hyper curl rfcs daniel stenberg

Software at Scale 41 - Minimal Entrepreneurship with Sahil Lavingia

Play Episode Listen Later Jan 25, 2022 59:05

Sahil Lavingia is the founder of Gumroad, an e-commerce platform that helps you sell digital services. He also runs SHL Capital, a rolling fund for early-stage startups.Apple Podcasts | Spotify | Google PodcastsSahil’s recent book, Minimal Entrepreneurship, explores a framework for building profitable, sustainable companies. I’ve often explored the trade-off between software engineering and trying to build and launch my own company, so this conversation takes up that theme and explores what it means to be a minimal entrepreneur for a software engineer.Highlights(edited)Utsav: Let’s talk about VCs (referencing your popular blog post “Reflecting on My Failure to Build a Billion-Dollar Company”). Are startups pushed to grow faster and faster due to VC dynamics, or is there something else going on behind the scenes?It’s a combination of things. People who get caught up in this anti-VC mentality are missing larger forces at play because I don't really think it's just VCs who are making all of these things happen. Firstly, there’s definitely a status game being played. When I first moved to the Bay Area, as soon as you mention you’re working on your own, the first question people ask you is how far along your company is, who you raised money with, how many employees you have, and comparing you with other people they know. You can’t really get too upset at that, since that’s the nature of the people coming to a boomtown like San Francisco.The way I think about it, there’s a high failure rate in being able to build a billion-dollar company, so you want to find out reasonably quickly whether you will succeed or not. Secondly, we’re in a very unique industry, where equity is basically the primary source of compensation. 90% of Americans don’t have some sort of equity component in the businesses they work for, but giving equity has a ton of benefits. It’s great to have that alignment, and folks who take an early risk for your company should get rewarded. The downside of equity is that it creates this very strong desire and incentive to make your company as valuable as possible, as quickly as possible. In order to get your equity to be considered valuable to investors, you need to grow quickly, so that investors use these models that project your growth rate to form your valuation.Many people took my blog to say - it’s the VC’s fault, but that’s not true. The VCs let me do what I wanted, they don’t really have that much power. The issue was that in order for employees to see a large outcome, you need the company to have a large exit. As a founder, you’d do pretty well if the company sold for $50 million dollars, but that’s not true for employees, they really need this thing to work, otherwise, the best ones can just go work for the next Stripe. So you have this winner-take-all behavior for employees, and it’s ultimately why I ended up shrinking the company to just me for a while.Utsav: So do you give employees equity in the minimalist entrepreneurship framework?Firstly: avoid hiring anyone else for as long as possible, until you know you have some kind of product-market fit. I think It depends on your liquidity strategy. How are you as a founder about to make money from this business? The way you incentivize your employees should align with that. If you want to sell your company for a hundred million dollars, consider sharing that and giving equity. If you plan to create a cash cow business, consider profit sharing.Utsav: What, if any, is the difference between indie-hacking and minimalist entrepreneurship?They’re pretty similar. Indie hacker seems like a personality, perhaps similar to a Digital Nomad, where the lifestyle seems to be the precedent. I went to MicroConf in Las Vegas, and the attendee’s goals were fairly consistent - to buy a nice house and spend more time with their family. In that case, your goal should be to build the most boring but profitable business possible, for a community you don’t particularly care about because your goals have nothing to do with serving that community, which is totally fine. No value judgments from me. With indie-hacking, it seems more geared around independence. I tried living the digital nomad life - work solo, travel the world, no schedule, but I didn’t actually enjoy it. It wasn’t really satisfying. I like working on a project with many people, and things improve, and I get to learn from others, they learn from me, I like talking to my customers, who I can talk to frequently, and their lives are getting better because of my work. I enjoy that. So I wanted a middle-ground between the “live on a beach” mentality and the blitzscaling, build the next Facebook mentality. I like to think that with things like crowdfunding, this will get more and more feasible.Even though my article went viral and the ideas often resonated, there’s this aspirational aspect to many humans - they want to build something amazing and big. It’s kind of the Steve Jobs “make a dent in the universe” idea, even though he might not have actually said that. To account for that, I think incorporating some of the indiehacker principles in the startup path might actually be the most applicable and accessible solution for people.Utsav: One of the key ideas in the book that, that strikes out to me as someone who's a software engineer is that you can keep trying projects on the side. And eventually, if you're doing things right, if you're talking to customers, you will hit something that people want to buy or to use, right? You're not going to get it right the first time probably. Um, but I think that's a really important idea in this. Could you elaborate on that?There are two kinds of people: one, who builds a lot of stuff but don’t know who for. Another to-do-list app, a meditation app, you name it. So you build it, but then you can’t figure out who’ll use it. The other kind is stuck in analysis paralysis, and can’t really hone in on an idea that they want to commit to. The solution to both these personas is to forget about business and immerse yourself in the communities you care about, and try to help them. Focus on contributing to these communities. These could be slack/discord communities. For me, it was Hacker News, Dribbble, and IndieHackers. There’s a bunch of subreddits for everything.Start being a part of these communities, first by listening, and eventually by contributing. I can guarantee that if you become a useful part of the community, you share ideas, people will come up to you and talk about problems that they’re facing. For example, they’re getting paid by YouTube to produce fitness videos, but have to wait for the end of the month, and they’d really like to get paid instantly. Once a community trusts you, and you solve a problem for a specific set of people, you instantly can validate good ideas and deliver value. And iterating over ideas with this community can give you a good chance of success.Listen to the audio for the full interview! Subscribe at www.softwareatscale.dev

Software at Scale 40 - Talent Management with Nikita Gupta

Play Episode Listen Later Jan 7, 2022 35:35

Nikita Gupta is a Co-Founder & CTO at Symba, a platform that helps manage talent development programs like internships.Internships are one of the most effective ways for hiring at a software company, but there’s a lot of work that goes into managing successful interns. With hiring getting harder across the industry due to increased competition and funding, I thought it would be interesting to dive into understanding how to manage successful internship programs.Highlights0:30 - What is Symba?1:30 - Starting with the hot-takes. So, are college degrees overrated now?5:30 - Why do I need a software platform to manage internships?8:50 - Why do companies generally need to manage 8 - 10 platforms for internships? What have you seen?10:30 - As a software engineer or manager, how do I make my intern successful?13:30 - Cadence of check-ins16:30 - With remote interns, how do you build a successful community?18:50 - How do I measure the success/efficacy of my internship program?21:00 - How do I know that my intern mentors/hosts are doing a good job?25:00 - What are some concrete steps that I can take to increase my intern pool’s diversity? What should I track?27:30 - What are some trends in the intern hiring space?32:00 - Government investments in internship programs33:00 - What’s your advice to the first-time intern mentor/host? Subscribe at www.softwareatscale.dev

starting government software scale gupta nikita internships talent management symba co founder cto highlights0

Software at Scale 39 - Infrastructure Security with Guy Eisenkot

Play Episode Listen Later Dec 16, 2021 45:25

Guy Eisenkot is a Senior Director of Product Management at BridgeCrew by Prisma Cloud and was the co-founder of BridgeCrew, an infrastructure security platform.We deep dive into infrastructure security, Checkov, and BridgeCrew in this episode. I’ve personally been writing Terraform for the last few weeks, and it often feels like I’m flying blind from a reliability/security perspective. For example, it’s all too easy to create an unencrypted S3 bucket in Terraform which you’ll only find out about when it hits production (via security tools). So I see the need for tools that lint my infrastructure as code more meaningfully, and we spend some time talking about that need.We also investigate “how did we get here”, unravel some infrastructure as code history and the story behind Checkov’s quick popularity. We talk about how ShiftLeft is often a painfully overused term, the security process in modern companies, and the future of security, in a world with ever-more infrastructure complexity.Highlights00:00 - Why is infrastructure security important to me as a developer?05:00 - The story of Checkov09:00 - What need did Checkov fulfil when it was released?10:30 - Why don’t tools like Terraform enforce good security by default?15:30 - Why ShiftLeft is a tired, not wired concept.20:00 - When should I make my first security hire?24:00 - Productizing what a security hire would do.27:00 - Amazon CodeGuru but for security fixes - Smart Fixes.33:00 - Is it possible to write infrastructure as code checks in frameworks like Pulumi?37:00 - Not being an early adopter when it comes to infrastructure tools.40:00 - The Log4J vulnerability, and the security world moving forward. Subscribe at www.softwareatscale.dev

software scale senior director product management s3 terraform productizing checkov infrastructure security bridge crew

Software at Scale 38 - Hasura with Tanmai Gopal

Play Episode Listen Later Dec 2, 2021 69:09

Tanmai Gopal is the founder of Hasura, an API as a service platform. Hasura lets you skip writing API layers and exposes automatic GraphQL APIs that talk to your database, trigger external actions, and much more.We talk about the implementation of a “compiler as a service”, the implications of primarily having a Haskell production codebase, their experience with GraphQL, hiring product managers technical enough to build useful features, some new and upcoming Hasura features, and riff about the current state of front-end development and `npm` hell.Highlights00:20 - What does the name Hasura mean?02:00 - What does Hasura do?04:00 - Why build this layer of the stack? 08:00 - How to deal with authentication if APIs are exposed directly via the database.26:00 - Does Hasura make production applications faster?33:00 - JSON Aggregation in modern databases38:00 - Why Haskell?44:00 - How do you write quality Haskell? How does hiring for Haskell positions work out in practice?55:00 - Application servers do much more than just talk to databases. How does Hasura provide escape hatches so that non-database interactions (for eg: talking to Stripe) work out? Subscribe at www.softwareatscale.dev

software scale application api apis stripe haskell graphql graphql apis hasura tanmai gopal

Software at Scale 37 - Building Zerodha with Kailash Nadh

Play Episode Listen Later Nov 16, 2021 48:59

Kailash Nadh is the CTO of Zerodha, India’s largest retail stockbroker. Zerodha powers a large volume of stock trades - ~15-20% of India’s daily volume which is significantly more daily transactions than Robinhood.Apple Podcasts | Spotify | Google PodcastsThe focus of this episode is the technology and mindset behind Zerodha - the key technology choices, challenges faced, and lessons learned while building the platform over several years. As described on the company’s tech blog, Zerodha has an unconventional approach to building software - open source centric, relatively few deadlines, an incessant focus on resolving technical debt, and extreme autonomy to the small but efficient technology team. We dig into these and learn about the inner workings of one of India’s premier fintech companies.Highlights[00:43]: Can you describe the Zerodha product? Could you also share any metrics that demonstrate the scale, like the number of transactions or number of users?Zerodha is an online stockbroker. You can download one of the apps and sign to buy and sell shares in the stock market, and invest. We have over 7 million customers, on any given day we have over 2 million concurrent users, and this week, we broke our record for a number of trades handled in a day - 14 million trades in a day, which represented over 20% of all Indian stock-trading activity.[03:00] When a user opens the app at 9:15 in the morning to see trade activity and purchase a trade, what happens behind the scenes? Life of a Query, Zerodha Edition[05:00] What exactly is the risk management system doing? Can you give an example of where it will block a trade?What is the risk management system doing?The most critical check is a margin check - whether you have enough purchasing power margins in your account. With equities, it’s a simple linear check of whether you have enough, but for derivatives, it’s about figuring out if you have enough margins. If you already have some futures and options in your account, the risk is variable based on that pre-existing amount.What does the reconciliation process look like with the exchange?We have a joke in our engineering team that we’re just CSV engineers since re-conciliation in our industry happens via several CSV files that are distributed at the end of the trading day.[08:40] Are you still using PostgreSQL for storing data?We still use (abuse) PostgreSQL with hundreds of billions of rows of data, sharded several ways[09:40] In general, how has Zerodha evolved over time, from the v0 of the tech product to today?From 2010 to 2013, there was no tech team, and Zerodha’s prime value add was a discount pricing model. We had vendor products that let users log in and trade, and the competition was on pricing. But they worked on 1/10,000th the scale that we operate on today, for a tiny fraction of the userbase. To give a sense of their maturity, they only worked on Internet Explorer 6.So in late 2014, we built a reporting platform that replaced this vendor-based system. We kept on replacing systems and dependencies, and the last piece left is the OMS - the Order Management System. We’ve had a project to replace this OMS ongoing for 2.5 years and are currently an running internal beta, and once this is complete, we will have no external dependencies.The first version of Kite, written in Python, came out in 2015. Then, we rewrote some of the services in Go. We now have a ton of services that do all sorts of things like document verification, KYC, payments, banking, integrations, trading, PNL, number crunching and analytics, visualizations, mutual funds, absolutely everything you can imagine.[13:55] Why is it so tricky to rebuild an Order Management System?There’s no spec out there to build an Order or a Risk Management System. A margin check is based on mathematical models that take a lot of different parameters into account.We’re doing complex checks that are based on mathematical models that we’ve reverse-engineered after years of experience with the system, as well as developing deep domain knowledge in the area.And once we build out the system, we cannot simply migrate away from the old system due to the high consequences of potential errors. So we need to test and migrate piecemeal from the system. [13:55] One thing you notice when Zerodha is how fast it feels compared to standard web applications. This needs focus on both backend and frontend systems. To start with, how do you optimize your backends for speed?When an application is slow (data takes more than a second to load), it’s perceptible, and can be annoying for users. So we're very particular about making everything as fast as possible, and we’ve set high benchmarks for ourselves. We set an upper limit of mean latency for users to be no more than 40 milliseconds, which seems to work well for us, given all the randomness from the internet. Then, all the code we write has to meet this benchmark.In order to make this work, there’s no black magic, just common sense principles. For the core flow of the product, everything is retrieved from in-memory databases, and nothing touches disk in the hot path of a request.Serialization is expensive. If you have a bunch of orders and you need to send those back, serializing and deserializing takes time. So when events take place, like a new order being placed, we serialize once and store the result in an in-memory database. And then when an HTTP request comes in from a user, instead of a database lookup and various transforms, the application reads directly from in-memory databases and writes it to the browser.Then, we have a few heuristics. For fetching really old reports that

spotify indian software scale robin hood cto python oms kite pnl internet explorer kyc query csv postgresql kailash nadh serialization

Software at Scale 36 - Decomposing Monoliths with Ganesh Datta

Play Episode Listen Later Nov 2, 2021 43:28

Ganesh Datta is the CTO and co-founder of Cortex, a microservice management platform.Apple Podcasts | Spotify | Google PodcastsWe continue the age-old monolith/microservice debate and dig into why companies seem to like services so much (I’m generally cautious about such migrations). Ganesh has a ton of insights into developer productivity and tooling to make engineering teams successful that we dive into.Highlights00:00 - Why solve the service management problem?06:00 - When to drive a monolith → service migration? What inflection points should one think about to make that decision?08:30 - What would Ganesh do his next service migration?10:30 - What tools are useful when migrating to services?12:00 - Standardizing infrastructure to facilitate migrations. How much to standardize (à la Google), to letting team make their own decisions (à la Amazon)?17:30 - How does a tool like Cortex help with these problems?21:30 - How opinionated should such tools be? How much user education is part of building such tools?27:00 - What are the key cultural components of successful engineering teams?31:00 - Tactically, what does good service management look like today?37:00 - What’s the cost/benefit ratio of shipping an on-prem product vs. a SaaS tool?41:30 - What would your advice be for the next software engineer embarking on their monolith → microservice migration? Subscribe at www.softwareatscale.dev

amazon spotify google software scale saas cto ganesh cortex monoliths standardizing datta tactically decomposing

Software at Scale 35 - Maintaining Git with Johannes Schindelin

Play Episode Listen Later Oct 20, 2021 55:40

Johannes Schindelin is the maintainer (BDFL) of Git for Windows.Apple Podcasts | Spotify | Google PodcastsGit is a fundamental piece of the software community, and we get to learn the history and inner workings of the project in this episode. Maintaining a widely-used open source project involves a ton of expected complexity around handling bug reports, deprecations, and inclusive culture, but also requires management of inter-personal relationships, ease of contribution, and other aspects that are fascinating to learn about.Highlights00:06 - How did Johannes end up as the maintainer of Git for Windows?06:30 - The Git community in the early days. Fun fact: Git used to be called `dircache`08:30 - How many downloads does Git for Windows get today?10:15 - Why does Git for Windows a separate project? Why not make improvements to Git itself?24:00 - How do you deprecate functionality when there are millions of users of your product and you have no telemetry?30:00 - What does being the BDFL of a project mean? What does Johannes day-to-day look like?33:00 - What is GitGitGadget? How does it make contributions easier?41:00 - How do foster an inclusive community of an open-source project?50:00 - What’s next for Git? Get on the email list at www.softwareatscale.dev

spotify fun software scale maintaining windows johannes git bdfl

Software at Scale 34 - Faster Python with Guido van Rossum

Play Episode Listen Later Oct 5, 2021 31:11

Guido van Rossum is the creator of the Python programming language and a Distinguished Engineer at Microsoft. Apple Podcasts | Spotify | Google PodcastsWe discuss Guido’s new work on making CPython faster (PEP 659), Tiers of Python Interpreter Execution, and high impact, low hanging fruit performance improvements.Highlights(an edited summary)[00:21] What got you interested in working on Python performance?Guido: In some sense, it was probably a topic that was fairly comfortable to me because it means working with a core of Python, where I still feel I know my way around. When I started at Microsoft, I briefly looked at Azure but realized I never enjoyed that kind of work at Google or Dropbox. Then I looked at Machine Learning, but it would take a lot of time to do something interested with the non-Python, and even Python-related bits.[02:31] What was different about the set of Mark Shannon’s ideas on Python performance that convinced you to go after them?Guido: I liked how he was thinking about the problem. Most of the other approaches around Python performance like PyPy and Cinder are not suitable for all use cases since they aren’t backward compatible with extension modules. Mark has the perspective and experience of a CPython developer, as well as a viable approach that would maintain backward compatibility, which is the hardest problem to solve. The Python Bytecode interpreter is modified often across minor releases (for eg: 3.8 → 3.9) for various reasons like new opcodes, so modifying that is a relatively safe approach. Utsav: [09:45] Could you walk us through the idea of the tiers of execution of the Python Interpreter?Guido: When you execute a program, you don't know if it's going to crash after running a fraction of a millisecond, or whether it's going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that's going to be run. But obviously, especially in dynamic languages like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing code as quickly as you can. So that if it's a small script, or a large program that happens to fail early, or just exits early for a good reason, you don't spend any time being distracted by optimizing all that code.So, what we try to do there is keep the bytecode compiler simple so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and some definition of “hot”. For some purposes, maybe it's a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it's only hot if it's been called 1000 times.”The specializing adaptive compiler (PEP 659) then tries to replace certain bytecodes with bytecodes that are faster, but only work if the types of the arguments are specific types. A simple hypothetical example is the plus operator in Python. It can add lots of things like integers, strings, lists, or even tuples. On the other hand, you can't add an integer to a string. So, the optimization step - often called quickening, but usually in our context, we call it specializing - is to have a separate “binary add” integer bytecode, a second-tier bytecode hidden from the user. This opcode assumes that both of its arguments are actual Python integer objects, reaches directly into those objects to find the values, adds those values together in machine registers, and pushes the result back on the stack. The binary adds integer operation still has to make a type check on the arguments. So, it's not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for most generic add operations. Finally, it's always possible that a function is called millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. At that point, the interpreter will simply execute the original bytecode. That's an important part so that you still have the full Python semantics.Utsav [18:20] Generally you hear of these techniques in the context of JIT, a Just-In-Time compiler, but that’s not being implemented right now.Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we're trying to avoid. In our case, it’s unclear what and when we’re exactly compiling. At some point ahead of program execution, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which part would you call Just-In-Time? Also, it’s often assumed that Just-In-Time compilation automatically makes all your code better. Unfortunately, you often can't actually predict what the performance of your code is going to be. And we have enough of that with modern CPUs and their fantastic branch prediction. For example, we write code in a way that we think will clearly reduce the number of memory accesses. When we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured out access patterns without any of our help. I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic. Full TranscriptUtsav: [00:14] Thank you, Guido, for joining me on another episode of the Software at Scale podcast. It's great to have you here. Guido: [00:20] Great to be here on the show. Utsav: [00:21] Yeah. And it's just fun to talk to you again. So, the last time we spoke was at Dropbox many, many years ago. And you got retired, and then you decided that you wanted to do something new. And you work on performance now at Microsoft, and that's amazing. So, to start off with, I just want to ask you, you could pick any project that you wanted to, based on some slides that I've seen. So, what got you interested in working on Python performance?Guido: [00:47] In some sense, it was probably a topic that was fairly comfortable to me because it means working with a core of Python, where I still feel I know my way around. Some other things I considered briefly in my first month at Microsoft, I looked into, “Well, what can I do with Azure?”, and I almost immediately remembered that I was not cut out to be a cloud engineer. That was never the fun part of my job at Dropbox. It wasn't the fun part of my job before that at Google either. And it wouldn't be any fun to do that at Microsoft. So, I gave up on that quickly. I looked in machine learning, which I knew absolutely nothing about when I joined Microsoft. I still know nothing, but I've at least sat through a brief course and talked to a bunch of people who know a lot about it. And my conclusion was actually that it's a huge field. It is mostly mathematics and statistics and there is very little Python content in the field. And it would take me years to do anything interesting with the non-Python part and probably even with the Python part, given that people just write very simple functions and classes, at best in their machine learning code. But at least I know a bit more about the terminology that people use. And when people say kernel, I now know what they mean. Or at least I'm not confused anymore as I was before.Utsav: [02:31] That makes sense. And that is very similar to my experience with machine learning. Okay, so then you decided that you want to work on Python performance, right? And then you are probably familiar with Mark Shannon's ideas?Guido: [02:43] Very much so. Yeah.Utsav: [02:44] Yeah. So, was there anything different about the set of ideas that you decided that this makes sense and I should work on a project to implement these ideas?Guido: [02:55] Mark Shannon's ideas are not unique, perhaps, but I know he's been working on for a long time. I remember many years ago, I went to one of the earlier Python UK conferences, where he gave a talk about his PhD work, which was also about making Python faster. And over the years, he's never stopped thinking about it. And he sort of has a holistic attitude about it. Obviously, the results remain to be seen, but I liked what he was saying about how he was thinking about it. And if you take PyPy, it has always sounded like PyPy is sort of a magical solution that only a few people in the world understand how it works. And those people built that and then decided to do other things. And then they left it to a team of engineers to solve the real problems with PyPy, which are all in the realm of compatibility with extension modules. And they never really solved that. [04:09] So you may remember that there was some usage of PyPy at Dropbox because there was one tiny process where someone had discovered that PyPy was actually so much faster that it was worth it. But it had to run in its own little process and there was no maintenance. And it was a pain, of course, to make sure that there was a version of PyPy available on every machine. Because for the main Dropbox application, we could never switch to PyPy because that depended on 100 different extension modules. And just testing all that code would take forever. [04:49] I think since we're talking about Dropbox, Pyston was also an interesting example. They've come back actually; you've probably heard that. The Pyston people were much more pragmatic, and they've learned from PyPy’s failures. [05:04] But they have always taken this attitude of, again, “we're going to start with CPython,” which is good because that way they are sort of guaranteed compatibility with extension modules. But still, they make these huge sets of changes, at least Pyston one, and they had to roll back a whole bunch of things because, again, of compatibility issues, where I think one of the things, they had a bunch of very interesting improvements to the garbage collection. I think they got rid of the reference counting, though. And because of that, the behavior of many real-world Python programs was completely changed. [05:53] So why do I think that Mark's work will be different or Mark's ideas? Well, for one, because Mark has been in Python core developer for a long time. And so, he knows what we're up against. He knows how careful we have with backwards compatibility. And he knows that we cannot just say get rid of reference counting or change the object layout. Like there was a project that was recently released by Facebook basically, was born dead, or at least it was revealed to the world in its dead form, CI Python (Cinder), which was a significantly faster Python implementation, but using sort of many of the optimizations came from changes in object layout that just aren't compatible with extension modules. And Mark has sort of carved out these ideas that work on the bytecode interpreter itself. [06:58] Now, the bytecode is something where we know that it's not going to sort of affect third-party extension modules too much if we change it, because the bytecode changes in every Python release. And internals of the interpreter of the bytecode interpreter, change in every Python release. And yes, we still run into the occasional issue. Every release, there is some esoteric hack that someone is using that breaks. And they file an issue in the bug tracker because they don't want to research or they haven't yet researched what exactly is the root cause of the problem, because all they know is their users say, “My program worked in Python 3.7, and it broke in Python 3.8. So clearly, Python 3.8 broke something.” And since it only breaks when we're using Library X, it must be maybe Library X's fault. But Library X, the maintainers don't know exactly what's going on because the user just says it doesn't work or give them a thousand-line traceback. And they bounce it back to core Python, and they say, “Python 3.8 broke our library for all our users, or 10% of our users,” or whatever. [08:16] And it takes a long time to find out, “Oh, yeah, they're just poking inside one of the standard objects, using maybe information they gleaned from internal headers, or they're calling a C API that starts with an underscore.” And you're not supposed to do that. Well, you can do that but then you pay the price, which is you have to fix your code at every next Python release. And in between, sort of for bug fix releases like if you go from 3.8.0 to 3.8.1, all the way up to 3.8.9, we guarantee a lot more - the bytecodes stay stable. But 3.9 may break all your hacks and it changes the bytecode. One thing we did I think in 3.10, was all the jumps in the bytecode are now counted in instructions rather than bytes, and instructions are two bytes. Otherwise, the instruction format is the same, but all the jumps jump a different distance if you don't update your bytecode. And of course, the Python bytecode compiler knows about this. But people who generate their own bytecode as a sort of the ultimate Python hack would suffer.Utsav: [09:30] So the biggest challenge by far is backwards compatibility.Guido: [09:34] It always is. Yeah, everybody wants their Python to be faster until they find out that making it faster also breaks some corner case in their code.Utsav: [09:45] So maybe you can walk us through the idea of the tiers of execution or tiers of the Python interpreter that have been described in some of those slides.Guido: [09:54] Yeah, so that is a fairly arbitrary set of goals that you can use for most interpreted languages. Guido: [10:02] And it's actually a useful way to think about it. And it's something that we sort of plan to implement, it's not that there are actually currently tiers like that. At best, we have two tiers, and they don't map perfectly to what you saw in that document. But the basic idea is-- I think this also is implemented in .NET Core. But again, I don't know if it's sort of something documented, or if it's just this is how their optimizer works. So, when you just start executing a program, you don't know if it's going to crash after running a fraction of a millisecond, or whether it's going to be a three-week-long computation. Because it could be the same code, just in the first case, it has a bug. And so, if it takes three weeks to run the program, maybe it would make sense to spend half an hour ahead of time optimizing all the code that's going to be run. But obviously, especially in dynamic language, and something like Python, where we do as much as we can without asking the user to tell us exactly how they need it done, you just want to start executing the code as quickly as you can. So that if it's a small script, or a large program that happens to fail early, or just exits early for a good reason, you don't spend any time being distracted by optimizing all that code. [11:38] And so if this was a statically compiled language, the user would have to specify that basically, when they run the compiler, they say, “Well, run a sort of optimize for speed or optimize for time, or O2, O3 or maybe optimized for debugging O0.” In Python, we try not to bother the user with those decisions. So, you have to generate bytecode before you can execute even the first line of code. So, what we try to do there is keep the bytecode compiler simple, keep the bytecode interpreter simple, so that we get to execute the beginning of the code as soon as possible. If we see that certain functions are being executed many times over, then we call that a hot function, and you can sort of define what's hot. For some purposes, maybe it's a hot function if it gets called more than once, or more than twice, or more than 10 times. For other purposes, you want to be more conservative, and you can say, “Well, it's only hot if it's been called 1000 times.” [12:48] But anyway, for a hot function, you want to do more work. And so, the specializing adaptive compiler, at that point, tries to replace certain bytecodes with bytecodes that are faster, but that work only if the types of the arguments are specific types. A simple example but pretty hypothetical is the plus operator in Python at least, can add lots of things. It can add integers, it can add floats, it can add strings, it can list or tuples. On the other hand, you can't add an integer to a string, for example. So, what we do there, the optimization step - and it's also called quickening, but usually in our context, we call it specializing - is we have a separate binary add integer bytecode. And it's sort of a second-tier bytecode that is hidden from the user. If the user asked for the disassembly of their function, they will never see binary add integer, they will also always see just binary add. But what the interpreter sees once the function has been quickened, the interpreter may see binary add integers. And the binary add integer just assumes that both of its arguments, that's both the numbers on the stack, are actual Python integer objects. It just reaches directly into those objects to find the values, adds those values together in machine registers, and push the result back on the stack. [14:35] Now, there are all sorts of things that make that difficult to do. For example, if the value doesn't fit in a register for the result, or either of the input values, or maybe even though you expected it was going to be adding two integers, this particular time it's going to add to an integer and a floating-point or maybe even two strings. [15:00] So the first stage of specialization is actually… I'm blanking out on the term, but there is an intermediate step where we record the types of arguments. And during that intermediate step, the bytecode actually executes slightly slower than the default bytecode. But that only happens for a few executions of a function because then it knows this place is always called with integers on the stack, this place is always called with strings on the stack, and maybe this place, we still don't know or it's a mixed bag. And so then, the one where every time it was called during this recording phase, it was two integers, we replace it with that binary add integer operation. The binary adds integer operation, then, before it reaches into the object, still has to make a type check on the arguments. So, it's not completely free but a type check can be implemented much faster than a sort of completely generic object-oriented dispatch, like what normally happens for the most generic binary add operations. [16:14] So once we've recorded the types, we specialize it based on the types, and the interpreter then puts in guards. So, the interpreter code for the specialized instruction has guards that check whether all the conditions that will make the specialized instruction work, are actually met. If one of the conditions is not met, it's not going to fail, it's just going to execute the original bytecode. So, it's going to fall back to the slow path rather than failing. That's an important part so that you still have the full Python semantics. And it's always possible that a function is called hundreds or millions of times with integer arguments, and then suddenly a piece of data calls it with a floating-point argument, or something worse. And the semantics still say, “Well, then it has to do with the floating-point way.Utsav: [17:12] It has to deoptimize, in a sense.Guido: [17:14] Yeah. And there are various counters in all the mechanisms where, if you encounter something that fails the guard once, that doesn't deoptimize the whole instruction. But if you sort of keep encountering mismatches of the guards, then eventually, the specialized instruction is just deoptimized and we go back to, “Oh, yeah, we'll just do it the slow way because the slow way is apparently the fastest, we can do.” Utsav: [17:45] It's kind of like branch prediction.Guido: [17:47] I wish I knew what went on in modern CPUs when it comes to branch prediction and inline caching because that is absolute magic. And it's actually one of the things we're up against with this project, because we write code in a way that we think will clearly reduce the number of memory accesses, for example. And when we benchmark it, we find that it runs just as fast as the old unoptimized code because the CPU figured it out without any of our help. Utsav: [18:20] Yeah. I mean, these techniques, generally you hear them in a context of JIT, a Just-In-Time compiler, but y’all are not implementing that right now.Guido: [18:30] JIT is like, yeah, in our case, it would be a misnomer. What we do expect to eventually be doing is, in addition to specialization, we may be generating machine code. That's probably going to be well past 3.11, maybe past 3.12. So, the release that we still have until October next year is going to be 3.11, and that's where the specializing interpreters going to make its first entry. I don't think that we're going to do anything with machine code unless we get extremely lucky with our results halfway through the year. But eventually, that will be another tier. But I don't know, Just-In-Time compilation has a whole bunch of emotional baggage with it at this point that we're trying to avoid.Utsav: [19:25] Is it baggage from other projects trying it?Guido: [19:29] People assume that Just-In-Time compilation automatically makes all your code better. It turns out that it's not that simple. In our case, compilation is like, “What exactly is it that we compile?” At some point ahead of time, we compile your source code into bytecode. Then we translate the bytecode into specialized bytecode. I mean, everything happens at some point during runtime, so which thing would you call Just-In-Time? Guido: [20:04] So I'm not a big fan of using that term. And it usually makes people think of feats of magical optimization that have been touted by the Java community for a long time. And unfortunately, the magic is often such that you can't actually predict what the performance of your code is going to be. And we have enough of that, for example, with the modern CPUs and their fantastic branch prediction.Utsav: [20:35] Speaking of that, I saw that there's also a bunch of small wins y'all spoke about, that y’all can use to just improve performance, things like fixing the place of __dict__ in objects and changing the way integers are represented. What is just maybe one interesting story that came out of that?Guido: [20:53] Well, I would say calling Python functions is something that we actually are currently working on. And I have to say that this is not just the Microsoft team, but also other people in the core dev team, who are very excited about this and helping us in many ways. So, the idea is that in the Python interpreter, up to and including version 3.10, which is going to be released next week, actually, whenever you call a Python function, the first thing you do is create a frame object. And a frame object contains a bunch of state that is specific to that call that you're making. So, it points to the code object that represents the function that's being called, it points to the globals, it has a space for the local variables of the call, it has space for the arguments, it has space for the anonymous values on the evaluation stack. But the key thing is that it’s still a Python object. And there are some use cases where people actually inspect the Python frame objects, for example, if they want to do weird stuff with local variables. [22:18] Now, if you're a debugger, it makes total sense that you want to actually look at what are all the local variables in this frame? What are their names? What are their values and types? A debugger may even want to modify a local variable while the code is stopped in a breakpoint. That's all great. But for the execution of most code, most of the time, certainly, when you're not using a debugger, there's no reason that that frame needs to be a Python object. Because a Python object has a header, it has a reference count, it has a type, it is allocated as its own small segment of memory on the heap. It's all fairly inefficient. Also, if you call a function, then you create a few objects, then from that function, you call another function, all those frame objects end up scattered throughout the entire heap of the program. [23:17] What we have implemented in our version of 3.11, which is currently just the main branch of the CPython repo, is an allocation scheme where when we call a function, we still create something that holds the frame, but we allocate that in an array of frame structures. So, I can't call them frame objects because they don't have an object header, they don't have a reference count or type, it's just an array of structures. This means that unless that array runs out of space, calls can be slightly faster because you don't jump around on the heap. And allocation sort of is to allocate the next frame, you compare two pointers, and then you bump one counter, and now you have a new frame structure. And so, creation, and also deallocation of frames is faster. Frames are smaller because you don't have the object header. You also don't have the malloc overhead or the garbage collection overhead. And of course, it's backwards incompatible. So, what do we do now? Fortunately, there aren't that many ways that people access frames. And what we do is when people call an API that returns a frame object, we say, “Okay, well sure. Here's the frame in our array. Now we're going to allocate an object and we're going to copy some values to the frame object,” and we give that to the Python code. So, you can still introspect it and you can look at the locals as if nothing has changed. [25:04] But most of the time, people don't look at add frames. And this is actually an old optimization. I remember that the same idea existed in IronPython. And they did it differently. I think for them, it was like a compile-time choice when the bytecode equivalent in IronPython was generated for a function, it would dynamically make a choice whether to allocate a frame object or just a frame structure for that call. And their big bugaboo was, well, there is a function you can call sys dunder __getFrame__ and it just gives you the frame object. So, in the compiler, they were looking, were you using the exact thing named system dunder __getFrame__ and then they would say, “Oh, that's getFrame, now we're going to compile you slightly slower so you use a frame object.” We have the advantage that we can just always allocate the frame object on the fly. But we get similar benefits. And oh, yeah, I mentioned that the frame objects are allocated in array, what happens if that array runs out? Well, it's actually sort of a linked list of arrays. So, we can still create a new array of frames, like we have space for 100 or so which, in many programs, that's plenty. And if your call stack is more than 100 deep, we'll just have one discontinuity, but the semantics are still the same and we still have most of the benefits.Utsav: [26:39] Yeah, and maybe as a wrap-up question, there are a bunch of other improvements happening in the Python community for performance as well, right? There's Mypyc, which we're familiar with, which is using types, Mypy types to maybe compiled code to basically speed up. Are there any other improvements like that, that you're excited about, or you're interested in just following?Guido: [27:01] Well, Mypyc is very interesting. It gives much better performance boost, but only when you fully annotate your code and only when you actually follow the annotations precisely at runtime. In Mypy, if you say, “This function takes two integers,” and it returns an integer, then if you call it with something else, it's going to immediately blow up. It'll give you a traceback. But the standard Python semantics are that type annotations are optional, and sometimes they're white lies. And so, the types that you see at runtime may not actually be compatible with the types that were specified in the annotations. And it doesn't affect how your program executes. Unless you sort of start introspecting the annotations, your program runs exactly the same with or without annotations. [28:05] I mean, there are a couple of big holes that are in the type system, like any. And the type checker will say, “Oh, if you put any, everything is going to be fine.” And so, using that, it's very easy to have something that is passed, an object of an invalid type, and the type checker will never complain about it. And our promise is that the runtime will not complain about it either unless it really is a runtime error. Obviously, if you're somehow adding an integer to a string at runtime, it's still going to be a problem. But if you have a function that, say, computes the greatest common divisor of two numbers, which is this really cute little loop, if you define the percent operator in just the right way, you can pass in anything. I think there are examples where you can actually pass it to strings, and it will return a string without ever failing. [29:07] And so basically, Mypyc does things like the instance attributes are always represented in a compact way where there is no dunder __dict__. The best that we can do, which we are working on designing how we're actually going to do that, is make it so that if you don't look at the dunder __dict__ attribute, we don't necessarily have to store the instance attributes in a dictionary as long as we preserve the exact semantics. But if you use the dunder __dict__, at some later point, again, just like the frame objects, we have to materialize a dictionary. And Mypyc doesn't do that. It's super-fast if you don't use dunder __dict__. If you do use dunder __dict__, it just says, “dunder __dict__ not supported in this case.” [29:59] Mypyc really only compiles a small subset of the Python language. And that's great if that's the subset you're interested in. But I'm sure you can imagine how complex that is in practice for a large program.Utsav: [30:17] It reminds me of JavaScript performance when everything is working fast and then you use this one function, which you're not supposed to use to introspect an object or something, and then performance just breaks down. Guido: [30:29] Yeah, that will happen. Utsav: [30:31] But it's still super exciting. And I'm also super thankful that Python fails loudly when you try to add a number in the string, not like JavaScript,Guido: [30:41] Or PHP, or Perl.Utsav: [30:44] But yeah, thank you so much for being a guest. I think this was a lot of fun. And I think it walked through the performance improvement y’all are trying to make in an accessible way. So, I think it’s going to be useful for a lot of people. Yeah, thank you for being a guest.Guido: [30:58] My pleasure. It’s been a fun chat. Get on the email list at www.softwareatscale.dev

spotify google speaking phd microsoft software scale machine learning api generally python dropbox java tiers guido pep javascript azure cpu frames o2 cpus capi cinder rossum jit just in time distinguished engineer net core utsav o3 guido van rossum pypy cpython mark shannon ironpython

Software at Scale 33 - Drone Engineering with Abhay Venkatesh

Play Episode Listen Later Sep 28, 2021 41:06

Abhay Venkatesh is a Software Engineer at Anduril Industries where he focuses on infrastructure and platform engineering. Apple Podcasts | Spotify | Google PodcastsWe focus this episode on drone engineering - exploring the theme of “If I wanted to start my own technology project/company that manages drones, what technology bits would I need to know?” We discuss the commoditization of drone hardware, the perception stack, testing and release cycles, simulation software, software invariants, defensive software architecture, and wrap up with discussing the business models behind hardware companies.Highlights1:56 - Are we getting robot cleaners (other than Roomba) anytime soon?5:00 - What should I do if I want to build a technology project/company that leverages drones? Where should I be innovating?7:30 - What does the perception stack for a drone look like?13:30 - Are drones/robots still programmed in C++? How is Rust looked at in their world?18:30 - What does software development look like for a company that deploys software on drones? What are the testing/release processes like?20:30 - How are simulations used? Can game engines be used for simulations to test drones? Interestingly - since neural networks perceive objects and images very differently from how brains do it, adapting drone perception to work on a game engine is actually really hard.26:30 - Drone programming can be similar to client-side app development. But you have to write your own app store/auto-update infrastructure. Testing new releases manually is the largest bottleneck in releases.30:00 - Defensive programming for drones - how do you ensure safety? What is the base safety layer that needs to be built for a drone? “Return to Base” logic - often separated out into a different CPU.33:00 - How do hardware businesses look different from traditional SaaS businesses? 38:00 - What are some interesting trends in hardware that Abhay is excited about? Get on the email list at www.softwareatscale.dev

spotify testing software engineering scale base drones saas defensive rust software engineers cpu roomba venkatesh abhay

Software at Scale 32 - Derrick Stolee: Principal Software Engineer, GitHub

Play Episode Listen Later Sep 15, 2021 66:41

Derrick Stolee is a Principal Software Engineer at GitHub, where he focuses on the client experience of large Git repositories.Apple Podcasts | Spotify | Google PodcastsSubscribers might be aware that I’ve done some work on client-side Git in the past, so I was pretty excited for this episode. We discuss the Microsoft Windows and Office repository’s migrations to Git, recent performance improvements to Git for large monorepo, and more.Highlightslightly edited[06:00] Utsav: How and why did you transition from academia to software engineering?Derrick Stolee: I was teaching and doing research at a high level and working with really great people. And I found myself not finding the time to do the work I was doing as a graduate student. I wasn't finding time to do the programming and do these really deep projects. I found that the only time I could find to do that was in the evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then, I had a child and suddenly my evenings and weekends aren't available for that anymore.And so the individual things I was doing just for myself and for, you know, that was more programming oriented, fell by the wayside. I'd found myself a lot less happy with that career. And so I decided, you know what, there are two approaches I could take here. One is I could spend the next year or two winding down my collaborations and spinning up more of this time to be working on my own during regular work hours. Or I could find another job and I was going to set out.And, I lucked out that Microsoft has an office here in Raleigh, North Carolina, where we now live. This is where Azure DevOps was being built and they needed someone to help solve some graph problems. So it was really nice that it happened to work out that way. I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry.[21:00] Utsav: What drove the decision to migrate Windows to Git?The Windows repository moving to Git was a big project driven by Brian Harry, who was the CVP of Azure DevOps at the time. Previously, Windows used this source control system called Source Depot, which was a fork of Perforce. No one knew how to use this version control system until they got there and learned on the job, and that caused some friction in terms of onboarding people.But also if you have people working in the windows code base for a long time, they only learn this version control system. They don't know Git and they don't know what everyone else is using. And so they're feeling like they're falling behind and they're not speaking the same language when they talk to somebody else who's working commonly used version control tools. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically allow this more free exchange of ideas and understanding. The Windows Git repository is going to be big and have some little tweaks here and there, but at the end of the day, you're just running Git commands and you can go look at StackOverflow to solve questions as opposed to needing to talk to specific people within the Windows organization and how to use this version control tool.TranscriptUtsav Shah: Welcome to another episode of the Software at Scale Podcast, joining me today is Derek Stolee, who is a principal software engineer at GitHub. Previously, he was a principal software engineer at Microsoft, and he has a Ph.D. in Mathematics and Computer Science from the University of Nebraska, welcome. Derek Stolee: Thanks, happy to be here. Utsav Shah: So a lot of work that you do on Git, from my understanding, it's similar to the work you did in your Ph.D. around graph theory and stuff. So maybe you can just walk through the initial like, what got you interested in graphs and math in general?Derek Stolee: My love of graph theory came from my first algorithms class in college my sophomore year, just doing simple things like path-finding algorithms. And I got so excited about it, I started clicking around Wikipedia constantly, I just read every single article I could find on graph theory. So I learned about the four-color theorem, and I learned about different things like cliques, and all sorts of different graphs, the Peterson graph, and I just kept on discovering more. I thought this is interesting to me, it works well with the way my brain works and I could just model these things while [unclear 01:32]. And as I kept on doing more, for instance, graph theory, and combinatorics, my junior year for my math major, and it's like I want to pursue this. Instead of going into the software, I had planned with my undergraduate degree, I decided to pursue a Ph.D. in first math, then I split over to the joint math and CS program, and just worked on very theoretical math problems but I also would always pair it with the fact that I had this programming background and algorithmic background. So I was solving pure math problems using programming, and creating these computational experiments, the thing I call it was computational competent works. Because I would write these algorithms to help me solve these problems that were hard to reason about because the cases just became too complicated to hold in your head. But if you could quickly write a program, to then over the course of a day of computation, discover lots of small examples that can either answer it for you or even just give us a more intuitive understanding of the problem you're trying to solve and that was my specialty as I was working in academia.Utsav Shah: You hear a lot about proofs that are just computer-assisted today and you could just walk us through, I'm guessing, listeners are not math experts. So why is that becoming a thing and just walk through your thesis read in super layman terms, what do you do?Derek Stolee: There are two very different ways what you can mean when you say I have automated proof, there are some things like Coke, which are completely automated formal logic proofs, which specify all the different axioms and the different things I know to be true. And the statement I want to prove and constructs the sequence of proof steps, what I was focused more on was taking a combinatorial problem. For instance, do graphs with certain sub-structures exist, and trying to discover those examples using an algorithm that was finely tuned to solve those things, so one problem was called uniquely Kr saturated graphs. A Kr was essentially a set of our vertices where every single pair was adjacent to each other and to be saturated means I don't have one inside my graph but if I add any missing edge, I'll get one. And then the uniquely part was, I'll get exactly one and now we're at this fine line of doing these things even exist and can I find some interesting examples. And so you can just do, [unclear 04:03] generate every graph of a certain size, but that blows up in size. And so you end up where you can get maybe to 12 vertices, every graph of up to 12 vertices or so you can just enumerate and test. But to get beyond that, and find the interesting examples, you have to be zooming in on the search space to focus on the examples you're looking for. And so I generate an algorithm that said, Well, I know I'm not going to have every edge, so it's fixed one, parents say, this isn't an edge. And then we find our minus two other vertices and put all the other edges in and that's the one unique completion of that missing edge. And then let's continue building in that way, by building up all the possible ways you can create those sub-structures because they need to exist as opposed to just generating random little bits and that focus the search space enough that we can get to 20 or 21 vertices and see this interesting shapes show up. From those examples, we found some infinite families and then used regular old-school math to prove that these families were infinite once we had those small examples to start from.Utsav Shah: That makes a lot of sense and that tells me a little bit about how might someone use this in a computer science way? When would I need to use this in let's say, not my day job but just like, what computer science problems would I solve given something like that?Derek Stolee: It's always asking a mathematician what the applications of the theoretical work are. But I find whenever you see yourself dealing with a finite problem, and you want to know what different ways can this data be up here? Is it possible with some constraints? So a lot of things I was running into were similar problems to things like integer programming, trying to find solutions to an integer program is a very general thing and having those types of tools in your back pocket to solve these problems is extremely beneficial. And also knowing integer programming is still NP-hard. So if you have the right data shape, it will take an exponential amount of time to work, even though there are a lot of tools to solve most cases, when your data looks aren't particularly structured to have that exponential blow up. So knowing where those data shapes can arise and how to take a different approach can be beneficial.Utsav Shah: And you've had a fairly diverse career after this. I'm curious, what was the difference? What was the transition from doing this stuff to get or like developer tools? How did that end up happening?Derek Stolee: I was lucky enough that after my Ph.D. was complete, I landed a tenure track job in a math and computer science department, where I was teaching and doing research at a high level and working with great people. I had the best possible accountant’s workgroup, I could ask for doing interesting stuff, working with graduate students. And I found myself not finding the time to do the work I was doing as a graduate student, I wasn't finding time to do the programming and do these deep projects I wanted, I had a lot of interesting math project projects, I was collaborating with a lot of people, I was doing a lot of teaching. But I was finding that the only time I could find to do that was in evenings and weekends because that's when other people weren't working, who could collaborate with me on their projects and move those projects forward. And then I had a child and suddenly, my evenings and weekends aren't available for that anymore. And so the individual things I was doing just for myself, and for that we're more programming oriented, fell by the wayside and found myself a lot less happy with that career. And so I decided, there are two approaches I could take here; one is I could spend the next year or two, winding down my collaborations and spinning up more of this time to be working on my own during regular work hours, or I could find another job. And I was going to set out, but let's face it, my spouse is also an academic and she had an opportunity to move to a new institution and that happened to be soon after I made this decision. And so I said, great, let's not do the two-body problem anymore, you take this job, and we move right in between semesters, during the Christmas break, and I said, I will find my job, I will go and I will try to find a programming job, hopefully, someone will be interested. And I lucked out that, Microsoft has an office here in Raleigh, North Carolina, where we now live and they happen to be the place where what is now known as Azure DevOps was being built. And they needed someone to help solve some graph theory problems in the Git space. So it was nice that it happened to work out that way and I know for a fact that they took a chance on me because of their particular need. I didn't have significant professional experience in the industry, I just said, I did academics, so I'm smart and I did programming as part of my job, but it was always about myself. So, I came with a lot of humility, saying, I know I'm going to learn to work with a team. in a professional setting. I did teamwork with undergrad, but it's been a while. So I just come in here trying to learn as much as I can, as quickly as I can, and contribute in this very specific area you want me to go into, and it turns out that area they needed was to revamp the way Azure Repos computed Git commit history, which is a graph theory problem. The thing that was interesting about that is the previous solution is that they did everything in the sequel they'd when you created a new commit, he would say, what is your parent, let me take its commit history out of the sequel, and then add this new commit, and then put that back into the sequel. And it took essentially a sequel table of commit IDs and squashes it into a varbinary max column of this table, which ended up growing quadratically. And also, if you had a merge commit, it would have to take both parents and interestingly merge them, in a way that never matched what Git log was saying. And so it was technically interesting that they were able to do this at all with a sequel before I came by. But we need to have the graph data structure available, we need to dynamically compute by walking commits, and finding out how these things work, which led to creating a serialized commit-graph, which had that topological relationship encoded in concise data, into data. That was a data file that would be read into memory and very quickly, we could operate on it and do things topologically sorted. And we could do interesting File History operations on that instead of the database and by deleting these Database entries that are growing quadratically, we saved something like 83 gigabytes, just on the one server that was hosting the Azure DevOps code. And so it was great to see that come into fruition.Utsav Shah: First of all, that's such an inspiring story that you could get into this, and then they give you a chance as well. Did you reach out to a manager? Did you apply online? I'm just curious how that ended up working? Derek Stolee: I do need to say I have a lot of luck and privilege going into this because I applied and waited a month and didn't hear anything. I had applied to the same group and said, here's my cover letter, I heard nothing but then I have a friend who was from undergrad, who was one of the first people I knew to work at Microsoft. And I knew he worked at this little studio as the Visual Studio client editor and I said, well, this thing, that's now Azure DevOps was called Visual Studio online at the time, do you know anybody from this Visual Studio online group, I've applied there, haven't heard anything I'd love if you could get my resume on the top list. And it turns out that he had worked with somebody who had done the Git integration in Visual Studio, who happened to be located at this office, who then got my name on the top of the pile. And then that got me to the point where I was having a conversation with who would be my skip-level manager, and honestly had a conversation with me to try to suss out, am I going to be a good team player? There's not a good history of PhDs working well with engineers, probably because they just want to do their academic work and work in their space. I remember one particular question is like, sometimes we ship software and before we do that, we all get together, and everyone spends an entire day trying to find bugs, and then we spend a couple of weeks trying to fix them, they call it a bug bash, is that something you're interested in doing? I'm 100% wanting to be a good citizen, good team member, I am up for that. I that's what it takes to be a good software engineer, I will do it. I could sense the hesitation and the trepidation about looking at me more closely but it was overall, once I got into the interview, they were still doing Blackboard interviews at that time and I felt unfair because my phone screen interview was a problem. I had assigned my C Programming students as homework, so it's like sure you want to ask me this, I have a little bit of experience doing problems like this. So I was eager to show up and prove myself, I know I made some very junior mistakes at the beginning, just what's it like to work on a team? What's it like to check in a change and commit that pull request at 5 pm? And then go and get in your car and go home and realize when you are out there that you had a problem? And you've caused the bill to go red? Oh, no, don't do that. So I had those mistakes, but I only needed to learn them once. Utsav Shah: That's amazing and going to your second point around [inaudible 14:17], get committed history and storing all of that and sequel he also go, we had to deal with an extremely similar problem because we maintain a custom CI server and we try doing Git [inaudible 14:26] and try to implement that on our own and that did not turn out well. So maybe you can walk listeners through like, why is that so tricky? Why it is so tricky to say, is this commit before another commit is that after another commit, what's the parent of this commit? What's going on, I guess?Derek Stolee: Yes the thing to keep in mind is that each commit has a list of a parent or multiple parents in the case of emerging, and that just tells you what happened immediately before this. But if you have to go back weeks or months, you're going to be traversing hundreds or 1000s of commits and these merge commits are branching. And so not only are we going deep in time in terms of you just think about the first parent history is all the merge all the pull requests that have merged in that time. But imagine that you're also traversing all of the commits that were in the topic branches of those merges and so you go both deep and wide when you're doing this search. And by default, Git is storing all of these commits as just plain text objects, in their object database, you look it up by its Commit SHA, and then you go find that location in a pack file, you decompress it, you go parse the text file to find out the different information about, what's its author-date, committer date, what are its parents, and then go find them again, and keep iterating through that. And it's a very expensive operation on these orders of commits and especially when it says the answer's no, it's not reachable, you have to walk every single possible commit that is reachable before you can say no. And both of those things cause significant delays in trying to answer these questions, which was part of the reason for the commit-graph file. First again, it was started when I was doing Azure DevOps server work but it's now something it's a good client feature, first, it avoids that going through to the pack file, and loading this plain text document, you have to decompress and parse by just saying, I've got it well-structured information, that tells me where in the commit-graph files the next one. So I don't have to store the whole object ID, I just have a little four-byte integer, my parent is this one in this table of data, and you can jump quickly between them. And then the other benefit is, we can store extra data that are not native to the commit object itself, and specifically, this is called generation number. The generation number is saying, if I don't have any parents, my generation number is one, so I'm at level one. But if I have parents, I'm going to have one larger number than the maximum most parents, so if I have one parent is; one, now two, and then three, if I merge, and I've got four and five, I'm going to be six. And what that allows me to do is that if I see two commits, and one is generation number 10, and one is 11, then the one with generation number 10, can't reach the one with 11 because that means an edge would go in the wrong direction. It also means that if I'm looking for the one with the 11, and I started at 20, I can stop when I hit commits that hit alright 10. So this gives us extra ways of visiting fewer commits to solve these questions.Utsav Shah: So maybe a basic question, why does the system care about what the parents of a commit are why does that end up mattering so much?Derek Stolee: Yes, it matters for a lot of reasons. One is if you just want to go through the history of what changes have happened to my repository, specifically File History, the way to get them in order is not you to say, give me all the commits that changed, and then we sort them by date because the commit date can be completely manufactured. And maybe something that was committed later emerged earlier, that's something else. And so by understanding those relationships of where the parents are, you can realize, this thing was committed earlier, it landed in the default branch later and I can see that by the way that the commits are structured to these parent relationships. And a lot of problems we see with people saying, where did my change go, or what happened here, it's because somebody did a weird merge. And you can only find it out by doing some interesting things with Git log to say, this merge caused a problem and cause your file history to get mixed up and if somebody resolved the merging correctly to cause this problem where somebody change got erased and you need to use these social relationships to discover that.Utsav Shah: Should everybody just be using rebase versus merge, what's your opinion?Derek Stolee: My opinion is that you should use rebase to make sure that the commits that you are trying to get reviewed by your coworkers are as clear as possible. Present a story, tell me that your commits are good, tell me in the comments just why you're trying to do this one small change, and how the sequence of commits creates a beautiful story that tells me how I get from point A to point B. And then you merge it into your branch with everyone else's, and then those commits are locked, you can't change them anymore. Do you not rebase them? Do you not edit them? Now they're locked in and the benefit of doing that as well, I can present this best story that not only is good for the people who are reviewing it at the moment, but also when I go back in history and say, why did I change it that way? You've got all the reasoning right there but then also you can do things like go down Do Git log dash the first parent to just show me which pull requests are merged against this branch. And that's it, I don't see people's commits. I see this one was merged, this one was merged, this one was merged and I can see the sequence of those events and that's the most valuable thing to see.Utsav Shah: Interesting, and then a lot of GitHub workflows, just squash all of your commits into one, which I think is the default, or at least a lot of people use that; any opinions on that, because I know the Git workflow for development does the whole separate by commits, and then merge all of them, do you have an opinion, just on that?Derek Stolee: Squash merges can be beneficial; the thing to keep in mind is that it's typically beneficial for people who don't know how to do interactive rebase. So their topic match looks like a lot of random commits that don't make a lot of sense. And they're just, I tried this and then I had a break. So I fixed a bug, and I kept on going forward, I'm responding to feedback and that's what it looks like. That's if those commits aren't going to be helpful to you in the future to diagnose what's going on and you'd rather just say, this pull request is the unit of change. The squash merge is fine, it's fine to do that, the thing I find out that is problematic as a new user is also then don't realize that they need to change their branch to be based on that squash merge before they continue working. Otherwise, they'll bring in those commits again, and their pull request will look very strange. So there are some unnatural bits to using squash merge, that require people to like, let me just start over from the main branch again, to do my next work. And if you don't remember to do that, it's confusing.Utsav Shah: Yes, that makes a lot of sense. So going back to your story, so you started working on improving, get interactions in Azure DevOps? When did the whole idea of let's move the windows repository to get begin and how did that evolve?Derek Stolee: Well, the biggest thing is that the windows repository moving to get was decided, before I came, it was a big project by Brian Harry, who was the CVP of Azure DevOps at the time. Windows was using this source control system called source depot, which was a literal fork of Perforce. And no one knew how to use it until they got there and learn on the job. And that caused some friction in terms of well, onboarding people is difficult. But also, if you have people working in the windows codebase, for a long time, they learn this version control system. They don't know what everyone else is using and so they're feeling like they're falling behind. And they're not speaking the same language as when they talk to somebody else who's working in the version control that most people are using these days. So they saw this as a way to not only update the way their source control works to a more modern tool but specifically Git because it allowed more free exchange of ideas and understanding, it's going to be a mono repo, it's going to be big, it's going to have some little tweaks here and there. But at the end of the day, you're just running Git commands and you can go look at Stack Overflow, how to solve your Git questions, as opposed to needing to talk to specific people within the windows organization, and how to use this tool. So that, as far as I understand was a big part of the motivation, to get it working. When I joined the team, we were in the swing of let's make sure that our Git implementation scales, and the thing that's special about Azure DevOps is that it's using, it doesn't use the core Git codebase, it has a complete reimplementation of the server-side of Git in C sharp. So it was rebuilding a lot of things to just be able to do the core features, but is in its way that worked in its deployment environment and it had done a pretty good job of handling scale. But the issues that the Linux repo was still a challenge to host. At that time, it had half a million commits, maybe 700,000 commits, and it's the site number of files is rather small. But we were struggling especially with the commit history being so deep to do that, but also even when they [inaudible 24:24] DevOps repo with maybe 200 or 300 engineers working on it and in their daily work that was moving at a pace that was difficult to keep up with, so those scale targets were things we were daily dealing with and handling and working to improve and we could see that improvement in our daily lives as we were moving forward.Utsav Shah: So how do you tackle the problem? You're on this team now and you know that we want to improve the scale of this because 2000 developers are going to be using this repository we have two or 300 people now and it's already not like perfect. My first impression is you sit and you start profiling code and you understand what's going wrong. What did you all do?Derek Stolee: You're right about the profiler, we had a tool, I forget what it's called, but it would run on every 10th request selected at random, it would run a dot net profiler and it would save those traces into a place where we could download them. And so we can say, you know what Git commit history is slow. And now that we've written it in C sharp, as opposed to a sequel, it's the C sharp fault. Let's go see what's going on there and see if we can identify what are the hotspots, you go pull a few of those traces down and see what's identified. And a lot of it was chasing that like, I made this change. Let's make sure that the timings are an improvement, I see some outliers over here, they're still problematic, we find those traces and be able to go and identify that the core parts to change. Some of them are more philosophical, we need to change data structures, we need to introduce things like generation numbers, we need to introduce things like Bloom filters for filed history, nor to speed that up because we're spending too much time parsing commits and trees. And once we get to the idea that once we're that far, it was time to essentially say, let's assess whether or not we can handle the windows repo. And I think would have been January, February 2017. My team was tasked with doing scale testing in production, they had the full Azure DevOps server ready to go that had the windows source code in it didn't have developers using it, but it was a copy of the windows source code but they were using that same server for work item tracking, they had already transitioned, that we're tracking to using Azure boards. And they said, go and see if you can make this fall over in production, that's the only way to tell if it's going to work or not. And so a few of us got together, we created a bunch of things to use the REST API and we were pretty confident that the Git operation is going to work because we had a caching layer in front of the server that was going to avoid that. And so we just went to the idea of like, let's have through the REST API and make a few changes, and create a pull request and merge it, go through that cycle. We started by measuring how often developers would do that, for instance, in the Azure DevOps, and then scale it up and see where be going and we crashed the job agents because we found a bottleneck. Turns out that we were using lib Git to do merges and that required going into native code because it's a C library and we couldn't have too many of those running, because they each took a gig of memory. And so once this native code was running out, things were crashing and so we ended up having to put a limit on how that but it was like, that was the only Fallout and we could then say, we're ready to bring it on, start transitioning people over. And when users are in the product, and they think certain things are rough or difficult, we can address them. But right now, they're not going to cause a server problem. So let's bring it on. And so I think it was that a few months later that they started bringing developers from source depot into Git.Utsav Shah: So it sounds like there was some server work to make sure that the server doesn't crash. But the majority of work that you had to focus on was Git inside. Does that sound accurate?Derek Stolee: Before my time in parallel, is my time was the creation of what's now called VFS Forget, he was GVFs, at the time, realized that don't let engineers name things, they won't do it. So we've renamed it to VFS forget, it's a virtual file system Forget, a lot of [inaudible 28:44] because the source depot, version that Windows is using had a virtualized file system in it to allow people to only download a portion of the working tree that they needed. And they can build whatever part they were in, and it would dynamically discover what files you need to run that build. And so we did the same thing on the Git side, which was, let's make the Git client let's modify in some slight ways, using our fork of Git to think that all the files are there. And then when a file is [inaudible 29:26] we look through it to a file system event, it communicates to the dot net process that says, you want that file and you go download it from the Git server, put it on disk and tell you what its contents are and now you can place it and so it's dynamically downloading objects. This required aversion approach protocol that we call the GVFs protocol, which is essentially an early version of what's now called get a partial clone, to say, you can go get the commits and trees, that's what you need to be able to do most of your work. But when you need the file contents into the blob of a file, we can download that as necessary and populate it on your disk. The different thing is that personalized thing, the idea that if you just run LS at the root directory, it looks like all the files are there. And that causes some problems if you're not used to it, like for instance, if you open the VS code in the root of your windows source code, it will populate everything. Because VS code starts crawling and trying to figure out I want to do searching and indexing. And I want to find out what's there but Windows users were used to this, the windows developers; they had this already as a problem. So they were used to using tools that didn't do that but we found that out when we started saying, VFS forget is this thing that Windows is using, maybe you could use it to know like, well, this was working great, then I open VS code, or I ran grep, or some other tool came in and decided to scan everything. And now I'm slow again, because I have absolutely every file in my mana repo, in my working directory for real. And so that led to some concerns that weren’t necessarily the best way to go. But it did specifically with that GFS protocol, it solved a lot of the scale issues because we could stick another layer of servers that were closely located to the developers, like for instance, get a lab of build machines, let's take one of these cache servers in there. So the build machines all fetch from there and there you have quick throughput, small latency. And they don't have to bug the origin server for anything but the Refs, you do the same thing around the developers that solved a lot of our scale problems because you don't have these thundering herds of machines coming in and asking for all the data all at once.Utsav Shah: If we had a super similar concept of repository mirrors that would be listening to some change stream every time anything changed on a region, it would run GitHub, and then all the servers. So it's remarkable how similar the problems that we're thinking about are. One thing that I was thinking about, so VFS Forget makes sense, what's the origin of the FS monitor story? So for listeners, FS monitor is the File System Monitor in Git that decides whether files have changed or not without running [inaudible 32:08] that lists every single file, how did that come about?Derek Stolee: There are two sides to the story; one is that as we are building all these features, custom for VFS Forget, we're doing it inside the Microsoft slash Git fork on GitHub working in the open. So you can see all the changes we're making, it's all GPL. But we're making changes in ways that are going fast. And we're not contributing to upstream Git to the core Git feature. Because of the way VFS Forget works, we have this process that's always running, that is watching the file system and getting all of its events, it made sense to say, well, we can speed up certain Git operations, because we don't need to go looking for things. We don't want to run a bunch of L-stats, because that will trigger the download of objects. So we need to refer to that process to tell me what files have been updated, what's new, and I created the idea of what's now called FS monitor. And people who had built that tool for VFS Forget contributed a version of it upstream that used Facebook's watchman tool and threw a hook. So it created this hook called the FS monitor hook, it would say, tell me what's been updated since the last time I checked, the watchmen or whatever tools on their side would say, here's the small list of files that have been modified. You don't have to go walking all of the hundreds of 1000s of files because you just change these [inaudible 0:33:34]. And the Git command could store that and be fast to do things like Git status, we could add. So that was something that was contributed just mostly out of the goodness of their heart, we want to have this idea, this worked well and VFS Forget, we think can be working well for other people in regular Git, here we go and contributing and getting it in. It became much more important to us in particular when we started supporting the office monitor repo because they had a similar situation where they were moving from their version of source depot into Git and they thought VFS Forget is just going to work.The issue is that the office also has tools that they build for iOS and macOS. So they have developers who are on macOS and the team has just started by building a similar file system, virtualization for macOS using kernel extensions. And was very far along in the process when Apple said, we're deprecating kernel extensions, you can't do that anymore. If you're someone like Dropbox, go use this thing, if you use this other thing, and we tried both of those things, and none of them work in this scenario, they're either too slow, or they're not consistent enough. For instance, if you're in Dropbox, and you say, I want to populate my files dynamically as people ask for them. The way that Dropbox in OneNote or OneDrive now does that, the operating system we decided I'm going to delete this content because the disk is getting too big. You don't need it because you can just get it from the remote again, that inconsistency was something we couldn't handle because we needed to know that content once downloaded was there. And so we were at a crossroads of not knowing where to go. But then we decided, let's do an alternative approach, let's look at what the office monorepo is different from the windows monitor repo. And it turns out that they had a very componentized build system, where if you wanted to build a word, you knew what you needed to build words, you didn't need the Excel code, you didn't need the PowerPoint code, you needed the word code and some common bits for all the clients of Microsoft Office. And this was ingrained in their project system, it’s like if you know that in advance, Could you just tell Git, these are the files I need to do my work in to do my build. And that’s what they were doing in their version of source depot, they weren't using a virtualized file system and their version of source depot, they were just enlisting in the projects I care about. So when some of them were moving to Git with VFS Forget, they were confused, why do I see so many directories? I don't need them. So what we did is we decided to make a new way of taking all the good bits from VFS forget, like the GVFs protocol that allowed us to do the reduced downloads. But instead of a virtualized file system to use sparse checkout is a good feature and that allows us you can say, tell Git, only give me within these directories, the files and ignore everything outside. And that gives us the same benefits of working as the smaller working directory, than the whole thing without needing to have this virtualized file system. But now we need that File System Monitor hook that we added earlier. Because if I still have 200,000 files on my disk, and I edit a dozen, I don't want to walk with all 200,000 to find those dozen. And so the File System Monitor became top of mind for us and particularly because we want to support Windows developers and Windows process creation is expensive, especially compared to Linux; Linux, process creation is super-fast. So having hooky run, that then does some shell script stuff to come to communicate to another process and then come back. Just that process, even if it didn't, you don't have to do anything. That was expensive enough to say we should remove the hook from this equation. And also, there are some things that watchman does that we don't like and aren't specific enough to Git, let's make a version of the file system monitor that is entrenched to get. And that's what my colleague Jeff Hosteller, is working on right now. And getting reviewed in the core Git client right now is available on Git for Windows if you want to try it because the Git for Windows maintainer is also on my team. And so we only get an early version in there. But we want to make sure this is available to all Git users. There's an imputation for Windows and macOS and it's possible to build one for Linux, we just haven't included this first version. And that's our target is to remove that overhead. I know that you at Dropbox got had a blog post where you had a huge speed up just by replacing the Perl script hook with a rusted hook, is that correct?Utsav Shah: With the go hook not go hog, yes, but eventually we replace it with the rust one.Derek Stolee: Excellent. And also you did some contributions to help make this hook system a little bit better and not fewer bucks. Utsav Shah: I think yes, one or two bugs and it took me a few months of digging and figuring out what exactly is going wrong and it turned out there's this one environment variable which you added to skip process creation. So we just had to make sure to get forest on track caches on getting you or somebody else edited. And we just forced that environment variable to be true to make sure we cache every time you run Git status. So subsequent with Git statuses are not slow and things worked out great. So we just ended up shipping a wrapper that turned out the environment variable and things worked amazingly well. So, that was so long ago. How long does this process creation take on Windows? I guess that's one question that I have had for you for while, why did we skip writing that cache? Do you know what was slow but creating processes on Windows?Derek Stolee: Well, I know that there are a bunch of permission things that Windows does, it has many backhauls about can you create a process of this kind and what elevation privileges do you exactly have. And there are a lot of things like there that have built up because Windows is very much about re maintaining backward compatibility with a lot of these security sorts of things. So I don't know all the details I do know that it's something around the order of 100 milliseconds. So it's not something to scoff at and it's also the thing that Git for windows, in particular, has difficulty to because it has to do a bunch of translation layers to take this tool that was built for your Unix environment, and has dependencies on things like shell and Python, and Perl and how to make sure that it can work in that environment. That is an extra cost like if windows need to pay over even a normal windows process. Utsav Shah: Yes, that makes a lot of sense and maybe some numbers on I don't know how much you can share, like how big was the windows the office manrico annual decided to move from source depot to get like, what are we talking about here?Derek Stolee: The biggest numbers we think about are like, how many files do I have, but I didn't do anything I just checked out the default branch should have, and I said, how many files are there? And I believe the windows repository was somewhere around 3 million and that uncompressed data was something like 300 gigabytes of like that those 3 million files taking up that long. I don't know what the full size is for the office binary, but it is 2 million files at the head. So definitely a large project, they did their homework in terms of removing large binaries from the repository so that they're not big because of that, it's not like it's Git LSS isn't going to be the solution for them. They have mostly source code and small files that are not the reason for their growth. The reason for their growth is they have so many files, and they have so many developers moving, it moving that code around and adding commits and collaborating, that it's just going to get big no matter what you do. And at one point, the windows monorepo had 110 million Git objects and I think over 12 million of those were commits partly because they had some build machinery that would commit 40 times during its build. So they rein that in, and we've set to do a history cutting and start from scratch and now it's not moving nearly as quickly, but it's still very similar size so they've got more runways.Utsav Shah: Yes, maybe just for comparison to listeners, like the numbers I remember in 2018, the biggest repositories that were open-source that had people contributing to get forward, chromium. And remember chromium being roughly like 300,000 files, and there were like a couple of chromium engineers contributing to good performance. So this is just one order of magnitude but bigger than that, like 3 million files, I don't think there's a lot of people moving such a large repository around especially with the kind of history with like, 12 million objects it's just a lot. What was the reaction I guess, of the open-source community, the maintainers of getting stuff when you decided to help out? Did you have a conversation to start with they were just super excited when you reached out on the mailing list? What happened?Derek Stolee: So for full context, I switched over to working on the client-side and contributed upstream get kind of, after all of the DFS forget was announced and released as open-source software. And so, I can only gauge what I saw from people afterward and people I've become to know since then, but the general reaction was, yes, it's great that you can do this, but if you had contributed to get everyone would benefit and part of the things were, the initial plan wasn't ever to open source it or, the goal was to make this work for Windows if that's the only group that ever uses it that was a success. And it turns out, we can maybe try to say it, because we can host the windows source code, we can handle your source code was kind of like a marketing point for Azure Repos and that was a big push to put this out there and say in the world, but to say like, well, it also needs this custom thing that's only on Azure Repos and we created it with our own opinions that wouldn't be up to snuff with the Git project. And so, things like FS monitor and partial clone are direct contributions from Microsoft engineers at the time that we're saying, here's a way to contribute the ideas that made VFS forget work to get and that was an ongoing effort to try to bring that back but it kind of started after the fact kind of, hey, we are going to contribute these ideas but at first, we needed to ship something. So we shipped something without working with the community but I think that over the last few years, is especially with the way that we've shifted our stance within our strategy to do sparse check out things with the Office monitor repo, we've much more been able to align with the things we want to build, we can build them for upstream Git first, and then we can benefit from them and then we don't have to build it twice. And then we don't have to do something special that's only for our internal teams that again, once they learn that thing, it's different from what everyone else is doing and we have that same problem again. So, right now the things that the office is depending on our sparse Checkout, yes, they're using the GVFs protocol, but to them, you can just call it partial clone and it's going to be the same from their perspective. And in fact, the way we've integrated it for them is that we've gone underneath the partial clone machinery from upstream Git and just taught it to do the GVFS protocol. So, we're much more aligned with because we know things are working for the office, upstream, Git is much more suited to be able to handle this kind of scale.Utsav Shah: And that makes a ton of sense and given that, it seems like the community wanted you to contribute these features back. And that's just so refreshing, you want to help out someone, I don't know if you've heard of those stories where people were trying to contribute to get like Facebook has like this famous story of trying to continue to get a long time ago and not being successful and choosing to go in Mercurial, I'm happy to see that finally, we could add all of these nice things to Git.Derek Stolee: And I should give credit to the maintainer, Junio Hamano, and people who are now my colleagues at GitHub, like Peff Jeff King, and also other Git contributors at companies like Google, who took time out of their day to help us learn what's it like to be a Git contributor, and not just open source, because open source merging pull requests on GitHub is a completely different thing than working in the Git mailing list and contributing patch sets via email. And so learning how to do that, and also, the level of quality expert expected is so high so, how can we navigate that space has new contributors, who have a lot of ideas, and are motivated to do this good work. But we needed to get over a hump of let's get into this community and establish ourselves as being good citizens and trying to do the right thing.Utsav Shah: And maybe one more selfish question from my side. One thing that I think Git could use is some kind of login system, where today, if somebody checks in PII into our repository into the main branch, from my understanding, it's extremely hard to get rid of that without doing a full rewrite. And some kinds of plugins for companies where they can rewrite stuff or hide stuff on servers, does GitHub have something like that?Derek Stolee: I'm not aware of anything on the GitHub or Microsoft side for that, we generally try to avoid it by doing pre received books, or when you push will reject it, for some reason, if we can, otherwise, it's on you to clear up the data. Part of that is because we want to make sure that we are maintaining repositories that are still valid, that are not going to be missing objects. I know that Google source control tool, Garrett has a way to obliterate these objects and I'm not exactly sure how it works to then say they get clients are fetching and cloning and they say, I don't have this object it'll complain, but I don't know how they get around that. And with the distributed nature of Git, it's hard to say that the Git project should take on something like that, because it is centralizing things to such a degree that you have to say, yes, you didn't send me all the objects you said you were going to, but I'll trust you to do that anyway, that trust boundary is something that gets cautious to violate. Utsav Shah: Yes, that makes sense and now to the non-selfish questions, maybe you can walk through listeners, why does it need to bloom filter internally?Derek Stolee: Sure. So let's think about commit history is specifically when, say you're in a Java repo, a repo that uses the Java programming language, and your directory structure mimics your namespace. So if you want to get to your code, you go down five directories before you find your code file. Now in Git that's represented as I have my commit, then I have my route tree, which describes the root of my working directory and then I go down for each of those directories I have another tree object, tree object, and then finally my file. And so when we want to do a history query, say what things have changed this file, I go to my first commit, and I say, let's compare it to its parent and I'm going to the root trees, well, they're different, okay they're different. Let me open them up find out which tree object they have at that first portion of the path and see if those are different, they're different let me keep going and you go all the way down these five things, you've opened up 10 trees in this diff, to parse these things and if those trees are big, that's expensive to do. And at the end, you might find out, wait a minute the blobs are identical way down here but I had to do all that work to find out now multiply that by a million. And you have to find out that this file that was changed 10 times in the history of a million commits; you have to do a ton of work to parse all of those trees. So, the Bloom filters come in, in a way to say, can we guarantee sometimes, and in the most case that these commits, did not change that path, we expect that most commits did not change the path you're looking for. So what we do is we injected it in the commit-graph file because that gives us a quick way to index, I'm at a commit in a position that's going to graph file, I can understand where this Bloom filter data is. And the Bloom filter is storing which paths were changed by that commit and a bloom filter is what's called a probabilistic data structure. So it doesn't list those paths, which would be expensive, if I just actually listed, every single path that changed at every commit, I would have this sort of quadratic growth again, in my data would be in the gigabytes, even for a small repo. But with the Bloom filter, I only need 10 bits per path so it's compact. The thing we sacrifice is that sometimes it says yes, to a path that is the answer is no but the critical thing is if it says no, you can be sure it's no, and its false-positive rate is 2%, at the compression settings we're using so I think about the history of my million commits 98% of them will this Bloom filter will say no, it didn't change. So I can immediately go to my next parent, and I can say this commit isn't important so let's move on then the sparse any trees, 2% of them, I still have to go and parse them and the 10 that changed it they'll say yes. So, I'll parse them, I'll get the right answer but we've significantly subtracted the amount of work we had to do to answer that query. And it's important when you're in these big monitor repos because you have so many commits, that didn't touch the file, you need to be able to isolate them.Utsav Shah: At what point or like at what repository number of files, because the size of file that thing you mentioned, you can just use LFS for that should solve a lot of problems with the number of files, that's the problem. At what number of files, do I have to start thinking about okay; I want to use these good features like sparse checkout and the commit graphs and stuff? Have you noticed a tipping point like that?Derek Stolee: Yes, there are some tipping points but it's all about, can you take advantage of the different features. So to start, I can tell you that if you have a recent version of Git saved from the last year, so you can go to whatever repository you want, and run, Git, maintenance, start, just do that in every [inaudible 52:48] is going to moderate size and that's going to enable background maintenance. So it's going to turn off auto GC because it's going to run maintenance on a regular schedule, it'll do things like fetch for you in the background, so that way, when you run Git fetch, it just updates the refs and it's really fast but it does also keep your commit graph up to date. Now, by default, it doesn't contain the Bloom filters, because Bloom filters is an extra data sink and most clients don't need it, because you're not doing these deep queries that you need to do at web-scale, like the GitHub server. The GitHub server does generate those Bloom filters so when you do a File History query on GitHub, it's fast but it does give you that commit-graph thing so you can do things like Git log graph fast. The topological sorting has to do for that, it can use the generation numbers to be quick, as opposed to before printers, it would take six seconds to do that just to show 10 commits, on the left few books had to walk all of them, so now you can get that for free. So whatever size repo is, you can just run that command, and you're good to go and it's the only time you have to think about it run at once now your posture is going to be good for a long time. The next level I would say is, can I reduce the amount of data I download during my clones and fetches and that the partial clones for the good for the site that I prefer blob fewer clones, so you go, Git clone, dash filter, equals blob, colon, none. I know it's complicated, but it's what we have and it just says, okay, filter out all the blobs and just give me the commits and trees that are reachable from the refs. And when I do a checkout, or when I do a history query, I'll download the blobs I need on demand. So, don't just get on a plane and try to do checkouts and things and expect it to work that's the one thing you have to be understanding about. But as long as you are relatively frequently, having a network connection, you can operate as if it's a normal Git repo and that can make your fetch times your cleaning time fast and your disk space a lot less. So, that's kind of like the next level of boosting up your scale and it works a lot like LFS, LFS says, I'm only going to pull down these big LFS objects when you do a checkout and but it uses a different mechanism to do that this is you've got your regular Git blobs in there. And then the next level is okay, I am only getting the blobs I need, but can I use even fewer and this is the idea of using sparse checkout to scope you’re working directory down. And I like to say that, beyond 100,000 files is where you can start thinking about using it, I start seeing Git start to chug along when you get to 100,000 200,000 files. So if you can at least max out at that level, preferably less, but if you max out at that level that would be great sparse checkout is a way to do that the issue right now that we're seeing is, you need to have a connection between your build system and sparse Checkout, to say, hey, I work in this part of the code, what files I need. Now, if that's relatively stable, and you can identify, you know what, all the web services are in this directory, that's all I care about and all the client code is over there, I don't need it, then a static gets merged Checkout, will work, you can just go Git's sparse checkout set, whatever directories you need, and you're good to go. The issue is if you want to be close, and say, I'm only going to get this one project I need, but then it depends on these other directories and those dependencies might change and their dependencies might change, that's when you need to build that connection. So office has a tool, they call scooper, that connects their project dependency system to sparks Checkout, and will help them automatically do that but if your dependencies are relatively stable, you can manually run Git sparse checkout. And that's going to greatly reduce the size of your working directory, which means Git's doing less when it runs checkout and that can help out.Utsav Shah: That's a great incentive for developers to keep your code clean and modular so you're not checking out the world and eventually, it's going to help you in all these different ways and maybe for a final question here. What are you working on right now? What should we be excited about in the next few versions of Git?Derek Stolee: I'm working on a project this whole calendar year, and I'm not going to be done with it to the calendar year is done called the Sparse Index. So it's related to sparse checkout but it's about dealing with the index file, the index file is, if you go into your Git repository, go to dot Git slash index. That file is index is a copy of what it thinks should be at the head and also what it thinks is in your working directory, so when it doesn't get status, it's walked all those files and said, this is the last time it was modified or when I expected was modified. And any difference between the index and what's actually in your working tree, Git needs to do some work to sync them up. And normally, this is just fast, it's not that big but when you have millions of files, every single file at the head has an entry in the index. Even worse, if you have a sparse Checkout, even if you have 100,000 of those 2 million files in your working directory, the index itself has 2 million entries in it, just most of them are marked with what's called the Skip Worksheet that says, don't write this to disk. So for the office monitor repo, this file is 180 megabytes, which means that every single Git status needs to read 180 gigabytes from disk, and with the LFS monitor going on, it has to go rewrite it to have the latest token from the LFS monitor so it has to rewrite it to disk. So, this takes five seconds to run a Git status, even though it didn't say much and you just have to like load this thing up and write it back down. So the sparse index says, well, because we're using sparse checkout in a specific way called cone mode, which is directory-based, not path file-based, you can say, well, once I get to a certain directory, I know that none of its files inside of it matter. So let's store that directory and its tree object in the index instead, so it's a kind of a placeholder to say, I could recover all the data, and all the files that would be in this directory by parsing trees, but I don't want it in my index, there's no reason for that I'm not manipulating those files when I run a Git add, I'm not manipulating them, I do Git commit. And even if I do a Git checkout, I don't even care; I just want to replace that tree with whatever I'm checking out what it thinks the tree should be. It doesn't matter for what the work I'm doing and for a typical developer in the office monorepo; this reduces the index size to 10 megabytes. So it's a huge shrinking of the size and it's unlocking so much potential in terms of our performance, our Git status times are now 300 milliseconds on Windows, on Linux, and Mac, which are also platforms, we support for the office monitor repo, it's even faster. So that's what I'm working on the issue here is that there's a lot of things in Git that care about the index, and they explore the index as a flat array of entries and they're always expecting those to be filenames. So all these things run the Git codebase that needs to be updated to say, well, what happens if I have a directory here? What's the thing I should do? And so, all of the ideas of what is the sparse index format, have been already released in two versions of Git, and then there's also some protections and say, well, if I have a sparse index on disk, but I'm in a command that has an integrated, well, let me parse those trees to expand it to a full index before I continue. And then at the end, I'll write a sparse index instead of writing a full index and what we've been going through is, let's integrate these other commands, we've got things like status, add, commit, checkout, those things are all integrated, we got more on the way like merge, cherry-pick, rebase. And these things all need different special care to make it to work but it's unlocking this idea that when you're in the office monitoring who after this is done, and you're working on a small slice of the repo, it's going to feel like a small repo. And that is going to feel awesome. I'm just so excited for developers to be able to explore that we have a few more integrations; we want to get in there. So that we can release it and feel confident that users are going to be happy. The issue being that expanding to a full index is more expensive than just reading the 180 megabytes from disk, if I just already have it in the format; it's faster than being to parse it. So we want to make sure that we have enough integrations that most scenarios users do are a lot faster, and only a few that they use occasionally get a little slower. And once we have that, we can be very confident that developers are going to be excited about the experience.Utsav Shah: That sounds amazing the index already has so many features like the split index, the shared index, I still remember trying to like Wim understands when you're trying to read a Git index, and it just shows you as the right format and this is great. And do you think at some point, if you had all the time, and like a team of 100, people, you'd want to rewrite Git in a way that it was aware of all of these different features and layered in a way where all the different commands did not have to think about these different operations, since Git get a presented view of the index, rather than have to deal with all of these things individually?Derek Stolee: I think the index because it's a list of files, and it's a sorted list of files, and people want to do things like replace a few entries or scan them in a certain order that it would benefit from being replaced by some sort of database, even just sequel lite would be enough. And people have brought that idea up but because this idea of a flat array of in-memory entries is so ingrained in the Git code base, that's just not possible. To do the work to layer on top, an API that allows the compatibility between the flat layer and it's something like a sequel, it's just not feasible to do, we would just disrupt users, it would probably never get done and just cause bugs. So, I don't think that that's a realistic thing to do but I think if we were to redesign it from scratch, and we weren't in a rush to get something out fast, that we would be able to take that approach. And for instance, you would sparse index, so I update one file after we write the whole index that is something I'll have to do it's just that it's smaller now. But if I had something like a database, we could just replace that entry in the database and that would be a better operation to do but it's just not built for that right now.Utsav Shah: Okay. And if you had one thing that you would change

Software at Scale 31 - Maju Kuruvilla: CTO/COO, Bolt

Play Episode Listen Later Sep 2, 2021 58:10

Maju Kuruvilla is the CTO and COO of Bolt, a startup that offers quick online checkout technology to retailers. Previously, he was VP and GM at Amazon Global Mile, in charge of Amazon’s global logistics and Amazon Prime fulfillment operations amongst other things.Apple Podcasts | Spotify | Google PodcastsHighlights00:30 - What does a VP at Amazon even do? The day-to-day experience of a VP/GM at Amazon. I think I’ve asked enough people this question that I finally have a vague sense of what these engineering leaders do (I think).04:00 - Managing global logistics in one of the world’s largest logistics companies in the middle of a pandemic09:00 - Shipping software quickly when you’re a large company. Two pizza teams with a twist.16:00 - The role of software in global logistics. Amazon’s epic migration off Oracle databases. How to get thousands of people interested in migration or similar work.25:00 - Launching Amazon Prime Now in 111 days (21 days more than what Jeff Bezos mandated).38:00 - The complexity behind a checkout operation in an online store. Tax operations, compliance (!), and other complexities.46:00 - A tech stack to solve the checkout problem.51:00 - Building trust, relationships, and making an impact as an engineering leader in a new company. Everyone wants to hire great people, but what does that really mean?TranscriptUtsav Shah: Welcome to another episode of the Software at Scale Podcast, joining me today is Maju Kuruvilla, who is the CTO and now CEO of Bolt. Previously, he was VP at Global Mile, which is an organization at Amazon that was in charge of global fulfillment. Thank you for joining me.Maju Kuruvilla: Thanks for inviting me, Utsav. It's great to be here.Utsav Shah: Maybe we can start with what exactly does a VP of Global Mile at Amazon do? How many people are reporting to you eventually and what is your day-to-day look like at that time?Maju Kuruvilla: So I had few different roles at Amazon, the last role was the VP of Global Mile. Before that, I was a VP of the Worldwide Fulfillment technology team and the difference is when I was VP of the Worldwide Fulfillment technology team, we were responsible for all tech and products that Amazon uses in our fulfillment centers worldwide and so that's kind of global responsibility. And then a move to Global Mile, which was a little bit more a general manager role, where I was responsible for not just the technology and products, but also the operations, the sales, all of the different components of that as an end-to-end business. In either of those roles were global had more than 1000s of engineers and a lot of product managers? And even when I was managing the Worldwide Fulfillment team and even hardware teams and networking teams, all of them are involved as part of the team. The difference in the Global Mile was more being responsible for a P&L and an entire business, it is a little bit different than the Global Fulfillment side and I'm happy to explain on either side but just want to put the difference out there, what's up.Utsav Shah: So maybe you can just explain both of those things, I'm sure you're looking at P&L and then going into specifics understanding of what's going on? What does your day-to-day look like? I don't know if you can talk about any specific projects that you did, I think that would be super interesting to know about.Maju Kuruvilla: When I took on the Global Mile role; first of all, what Global Mile is, before it used to be called Global Logistics, it's all the logistics that is done to connect between countries. So whenever an item moves between countries and has to cross a border, that's where global, logistics or global mile comes into the picture. Most of the items that we sell in the US or Europe comes from manufacturing countries like Asia or China. And so when we have to bring those items from there to the US, that's part of the global logistics role. So I took on this role right before the pandemic hit and the pandemic initially hit China, which I was responsible for running the China operations at that time. And then when the pandemic then expanded and spread out to the Western countries, and then the rest of the world, and running global logistics during that time was very challenging and exciting. At the same time, there are a lot of changes that got disrupted that even today are not restored to what it was before. So the global supply chain is still recovering from all the problems that have happened since the beginning of COVID. So since I took on the role, it was largely doing catch-up on how to do, what do we do with the China operations? How do we manage our people there? How do we get the operations up and running? How do we keep our people safe during this process? And then, when all the planes stop flying or the passenger flight stop flying, the majority of the cargo space was not available after that. So we had to figure out how to create more air capacity between countries and then there was this big backlog of things that were coming from China all over the world, all the way from the challenges at the beginning of the port in China to the port of Los Angeles. For example, you could see lines of ships that are waiting to get unloaded and just managing all of that process [5:00] and reinventing and figuring out how to solve the global logistics problem in the middle of the pandemic was the highlight of my time at Global Logistics.Utsav Shah: Sounds like a fun on boarding project.Maju Kuruvilla: It's hard to get trained for global logistics anywhere because Amazon does a lot of that. On top of that, dealing with that during a pandemic was certainly something I was not ready for but like with every challenge and every job; you just have to figure things out. And they've created a lot of great opportunities for innovations, we did a lot of things that we wouldn't have done at the speed at which we did if we were not presented with a constraint. So a lot of innovations came out of that, a lot of new capabilities came out of that, and it was great to have a strong team at Amazon, where everyone responded fast, and started building things fast and get [unclear 06:08] operational in a very short time.Utsav Shah: Are you at liberty to share any of those innovations that you talked about? Maju Kuruvilla: A few I can certainly share; one is the air capacity problem, like, around 45% of air cargo capacity comes from the belly cargo from the passenger flight. So when passenger flights stopped flying, you essentially lost 45% of the global air cargo capacity, it’s just completely gone. And so how do we recreate that, so we had to create, start thinking of running out on charters. So we started renting planes, like 747 to fly from China to the US and Europe, from the US to export countries and so we started releasing planes and started flying, which we never had before and that was something we created last minute and started flying. An interesting story there was we started even getting the cargo planes for lease became very expensive and because everyone was trying to do something similar, we started renting even VIP jets. At one point, we rented a VIP jet that even I have never been on but then we were putting packages on it and shipping all over the world. So creating that kind of air capacity in a very short time and then building that into a framework that can be used for longer-term was something amazing and quick we did and whenever we have to do that a lot of things need to come together. One is, do you need to have the right technology for us to organize all the different products at source, figure out the route at which we all need to go, and then how do we fill this capacity when on the airplanes. You put something called a ULD which is a box that you fill all the items in and that's what you load and unload. And so you have this complex math problem of how do we fill it because you want to mix the right amount of weight and cube so that you can maximize the utilization of that. And you just have to come up with all these algorithms quickly to maximize that and then you have to do the safety to make sure that the things that you don't want to get on a plane don't get on a plane. And the same thing, once it reaches the destination, how do we unload it, and then from there, how does it go into all the different distribution centers, so creating the technology for that into the end and also rocking the operational process, so we can run that end-to-end, and then making sure that the business is ready to run through all of that. Creating all of that in a matter of few weeks is the speed and scale sometimes we have to run with.Utsav Shah: First of all, I didn't even know that 45% of all cargo capacity is on passenger flights so that fact itself blew my mind. But it's also interesting that there's so much software involved to ship something like this out. One thing that always fascinated me is that even though Amazon is such a large company, it can operate and ship some things like these so fast, and maybe it's like a secret sauce, how does Amazon get that done? And we've all heard about the two pizza teams and all of that but if there's something else, like something that you've seen is a super important part of the culture.Maju Kuruvilla: One is what you just mentioned, which is the two-pizza team but the other aspect which at least I think allows Amazon to move [10:00] fast is the decision-making capability. So whenever at Amazon, we have to make a decision, we call it one word or a two word or decision. And one word or decision is something that you make a decision and there is no going back, so you have to be extremely thoughtful on whether you make that decision and decide to walk through that door. Whereas a duet or decision is a decision you can make and then if you don't like it, you can always walk back. And so for us to move fast, we have to create a lot of [inaudible 10:36] decisions, and then allow the teams and the people involved to make them because it's okay, if they make a mistake, they can always walk back. And decentralizing that decision-making, allowing people, enabling them to make that decision, and then giving them a framework where most of the decisions are just too [inaudible10:58] decisions where they can walk back so that it creates a very fast decision making. And execute and enabling people to make those decisions fast and then allow them to verify if this is working or not so that they can walk back. That kind of decentralized enablement is critical for companies to move fast and Amazon certainly takes advantage of that quite a bit, so people are not afraid to make decisions and people are not afraid to fail. Because the culture is that you can make decisions, and learn from it and if something is wrong, you need to work back from it, so long as you can do that properly, then that's great. And that is where I feel like most of the companies get stuck because nobody knows who will make a decision. Everyone is kind of marveling at the decision to a higher level. And there is somebody who's trying to make a decision that is not very close to all the action and even if it's a very good decision, it just takes a long time. So we say sometimes, a fast repair wrong decision made fast, might be better than the right decision made very slow as long as it's a two-word artist.Utsav Shah: Yes, that makes sense, and if you can walk through the example of renting airplanes, who would make that decision? Would a VP or a GM decide that we have to rent out this airplane, how would that bubble down just so that you can walk listeners through a project like that? Who would be making decisions, at what level?Maju Kuruvilla: Again, these are not qualified anywhere, per se and so whenever you make a decision, it is one thing to make a decision, the other thing is to notify and let people know, this is what we are doing and this is going to be the implications of that. In this scenario, it was brought up to me by the team itself, so they were like, capacity is down, and we got to come up with some of the new ideas. So they started coming up with options and now the challenge here was, whenever you create capacity in any supply chain if there is a little bit of a chicken and egg problem, do you create capacity first, or you create demand first. And when you build a supply chain, most of the time, you have to create capacity for us because if you create demand for us, then demand is just going to wait and you create a terrible experience for people. So you have to build capacity first and so in this particular case, it was more like, we didn't have enough demand to charter our planes. But if you can build it and create an infrastructure around that and make it reliable, are more people going to use it. And so it was a decision that people bubbled up, it was largely too for me to make that choice at that point. And I made a very strong recommendation to our leadership, and they were like, go for it and the decision were made in less than six hours, and then off we go. And now that we decide that we are going to double the capacity and we are going to figure this out, then it's all about execution. Now, that's only a small part of the decision-making, then there is the decision-making that happens every single day, like how many do we need, which days we need to run? And when at the beginning and the source and the destination, do you have the right operations aligned and which days you don't want to come and wait somewhere? And so how do you know that there's a lot of decision-making that happens at that level? And then it's like, what do you put on the plane, how do you prioritize? Do you prioritize an Amazon retail item? Do you [15:00] prioritize a seller's item, do we prioritize protective equipment for our associates. We even transported some of the equipment for hospitals, and some others just so that we can help during the middle of a pandemic. So, there are decision makings that happen at all levels. And the key there is not making one big decision right now that it's all about when you think it's in when you are working in a very fast-paced environment like this. There's a lot of micro-level decision making and if anybody hesitates to decide because they feel like okay, now somebody is going to beat me up, and I don't know what exactly it means, or I don't have all the answers, then you can move fast. But if everybody knows that, that's okay, I can't explain this. And it's fine. And I have no I and my everybody's going to understand or at least, and if I did something wrong, that's a great learning experience for me, then you can move fast. So it’s not a decision making at a particular level and what kind, it's enabling that a culture of freeing people from the fear of failure, and allowing people to focus on what are we trying to do? How do we achieve it fast? And anything and everything is available and possible to do that? It’s the culture of this decision-making that needed to move fast.Utsav Shah: That's interesting for me to hear and can you apply this a little bit to a large software project that you all did so that it's more relatable to people? We're more used to trying to ship a large piece of software? Maju Kuruvilla: Quite a few examples are coming to my head but one is a very complex and very technical project, even though the outcome is fairly simple. So Amazon has built the entire software on the Oracle Database platform over decades and then we were struggling and started running into constraints on Oracle and there were scaling issues because what could scale vertically, Amazon wants to scale more horizontally? So we wanted to move out of vertically scaling systems and getting more horizontally scalable systems? So the bottom line is, how do we move out of Oracle, and get to different platforms, whether it's dynamo DB or Aurora, or any of them some kind of a horizontally scalable solution. Now, doing this the entire Amazon had to go through it but for the fulfillment, which is one of the earliest teams at Amazon, it is a big ordeal. How do we now go and change your database where you have; first of all, fulfillment cannot stop right now, every day, you're fulfilling millions of units, and everything needs to move fast. But on the other side, you want to change the database, which you have been relying on for decades and this is not a small load, you're applying heavy load on this completely new technology and it's an all-in-one database. Everybody has their tables, and there are dependencies across all these tables, and how do we make it all happen in some kind of a sequential way and we want to get it all done in one year.So, this is one of those projects, where we thought it was impossible, and then we said, alright, let's go after it because our leadership all agree that this is the right thing to do for the company. And some teams were able to take their data and go and find new destinations for them and that was fine; that simple those are the 20% of the use case of that all locked. everybody else, there were a ton of features needed, some needed transactional support, that are dependent on the ACID properties on a database to kind of do whatever that feature they are trying to do. And there are also interdependencies where if this error is moved, then that service also needs to move because they share the same data and you cannot shift in different ways and this is a project. I spent a lot of time, a whole year working with every single team and you have to imagine I have around 1000 plus engineering teams, a few 100 individual teams, and hundreds of services that need to move out of this. And so this is why again, another one [20:00] where we decentralized decision making where we tell people that these are the things that need to be done and you can go ahead and get your pieces done and if you cannot, then you need to come back with a proposal, and how do we collaborate between the teams? And also, a lot of times there were decisions that need to be made where do we go from a relational database like Oracle to a completely no sequel database? Or we go from a relational database to another relational database like Postgres or an Aurora, kind of a data store? And how do we manage all of that journey? And again, this is where every team had to make hundreds and 1000s of micro-decisions on their side and they have to figure out how to collaborate across all the people. And then there were times, where some of those things were not happening like there was no plan. And so I still remember me and one of the Distinguished Engineer in the team, we know we had to tell people if something is not going to work, you need to tell us right now, you need to raise your hand because if you keep thinking that it's going to work, and if it doesn't work on last minute, we won't be able to help. So you need to ask us early enough that we can help but there are a few instances where we had to go and help. There are a few instances, we had to innovate and come up with completely new technology so that we can solve that by running that kind of an experience of moving the entire worldwide fulfillment to Amazon that has been running 20 years or so in Oracle. And in one year, moved them completely out of Oracle to a new platform and we got it done and it was no small feat and we got it run and scale just fine for a week and it was a huge outcome for the company. So I don't think running something like that in one year is something a lot of other companies may struggle to achieve, something like that in a short time.Utsav Shah: That sounds like a hard migration way or that you said that 20% of use cases are easy, like half of them will probably get done and there'll be so many stragglers with special cases and one-offs, it just sounds extremely painful to drive. How do you set up incentives in the right way to make sure that people wanted to do like because I'm sure a lot of it is grunt work and it's a little bit of an ordeal as you said? So one thing you mentioned was, you have a Distinguished Engineer, going to help everyone but people have so many competing priorities, they want to ship new things. How do you make it easier for these teams to prioritize this work?Maju Kuruvilla: It's all about prioritization, so you have two people look at what's important, you can say something is important but then people will pay attention to where you are spending your time. Right now, if people see that I as a leader or anybody who was a leader, spending most of their time on a project then people know that is important. It's not that they are saying it, they're doing it, and then provide with structure or guidance. For example, we created a small Tiger team and they knew how to go and audit every single team's plan, not just their migration plan but also their peak readiness plan and they will give out a report to me, and then I can review that. And if they are not ready, even if they think they are ready and if the team that's auditing did not feel like that's ready, they will come back with that and there are different mechanisms we put in place to enable teams to do that. Now, that example is more a hard groundwork of just migration, if I switch to a different example, which is the launch of prime now that is more exciting new work. Amazon wanted to get into fast delivery and there was this big plan that was created on how to do that and when debase [unclear 24:27], he reviewed a plan and he was, well, this is great, I love it go and make it happen in 90 days. So when you bait in your bases, you get to say things like that, go and make an entirely new experience of allowing people to get things within one hour in 90 days. This is said when same-day delivery is not even a thing, nobody would see our delivery is not even a thing and nobody know what it means. And so I was responsible for all the fulfillment aspect of that which turned out to be very complex, [25:00] how do we deliver something less than an hour when usually our fulfillment systems are designed to deliver in two days or less. So there is another one where we assembled a core team, we divvy up the function, and we decided to build, take some of the components of the fulfillment, created a lighter version of our fulfillment stack, and build. Some teams built the front end, new app, how the call connectivity? How do we manage payments? How do we manage fraud? And then, in the end, how do we even create technologies within the fulfillment center where you can pick and box all these things in a lightning-fast way. And then we launched that in New York, right in the middle of the city during Christmas, and I was there because we work hard on it and I delivered, we call it rideshare. You can go with the people who are delivering and see the experience for the customer and I did that for some of the initial ones and it was fascinating to watch the people's reactions when they get it. Sometimes we delivered things within eight minutes of when a customer first presses the order button and you can just see the stars and customers' eyes when something just shows up within eight minutes of something they ordered and mind you back in the day, these things didn't exist. So it was a magical experience for everyone; so we got it done. Not just my team but the collective team recording was done in 11 days, from the basis meeting, even though here that 90 days, we got it done in 11 days, and, and I don't even know how we got it done in under 11 days and basis, we're still happy but that's the speed at which you move. And if you need to move that kind of speed, you are innovating so many things; you are making decisions on so many items. But, it's not one person making all of the decisions, the entire team making those decisions, and allowing everyone to move faster.Utsav Shah: Just from reading one of the books on Amazon and understanding your stories, it seems the idea is to find each level so that it has enough people, so that's not what the blocker is. But then make sure there is fast decision-making and accountability. So projects get done on time, rather than not funding each level enough that it will just take forever to get stuff done because people feel they're super or something like that.Maju Kuruvilla: Well, whenever you have a big problem it's very hard to solve a big problem as is. So number one is let's agree on the problem and make sure that solving this problem makes sense for everybody and this is the right thing to do for the company, for the customer, and we have the right know-how to make it happen. And once we do that, then the next question is, how do we divvy up this problem into small chunks? If you want to move the mountain, it's extremely hard to move the mountain but assuming everybody can take a piece of rock from it and 1000s and millions of people if you can all assemble, then we can move the mountain. That's the concept of swarming a group of people around a huge problem not attacking the whole problem as one but divvying up the problem into smaller components. Now, divvying up the components is an art, you have to find out how to divvy it up because divvying up is not like, I will do this part or that part. Each of the smaller components needs to be fully functional by itself so that the team knows if they make it, they know how to test it, and how can you know if it's a car? Each of the components is like a tire and if you build the tire properly, you can have a tire team, you can have a wheel team and the good thing about that is, that team can continue to obsess over that and make it better every day. Today you can come up with some new tread wearing and all of that and tomorrow you can come up with some nails, but they have something they build that is complete in its sense as a component and that is something they can continue [30:00] to make better throughout their life, so we call that a pizza team. Or at Amazon, we call the pizza team and what that means is that team has ownership of a component that is complete and relevant [unclear 30:20]. So it's not a short term, as a project I will do this piece and that piece is more long-term ownership. And in the long-term ownership, I own this component, and I will continue to make that component better and I know how this component fits into all the other larger pieces but I don't care. So I am not going to worry about how the headlights are going to work or something, all those other things, I know it's all there but I don't need to worry about it. I'm not constrained by it because I built the right contract, if it's a tire, I just need to fit it on the right wheel and then I'm good, and then what happens after that I don't care. And I am going to obsess over this tire and all the materials that are on that forever and so we'll make it better over time. That's the concept, how do we break it down into components where each of the teams can own one of those pieces and then they can obsess over it? Every day, they have 24 7 365 days, they just tried to make it better and when a lot of components come together, you assemble to solve the larger problem you're going after.Utsav Shah: And I'm guessing the holistic vision on the assembly is management's responsibility along with the senior engineers and the principal engineers who will give guidance.Maju Kuruvilla: Yes, more than management it is that the senior engineers and the principal engineers, and a distinguished engineers because that's the way at least Amazon is structured is you have this pizza team that it's called two pizza team because suddenly, you should be able to feed a team with two pizzas. And so the magic number is somewhere between seven and 11, that's the number of people you should have in a team and that team should have everybody you need to solve the problem with that component. So sometimes the team may have hardware people, software people, data science people, it cannot be just a software team; it's a team, what do you need to solve that problem, you need all of them in there. And then whenever you have a principal engineer, their responsibility is to look at across multiple components like this, and see how they all come together and also the senior engineers in the teams, even though they are part of the team, they look across and negotiate and make sure that all the pieces are coming together well. But the rest of the teams, are headstand focused on what they need to build, and how to and they all have a metric that they are trying to make better and Amazon call it a fitness function. So you are looking at that fit fitness function and trying to make sure that you're continuing to make progress on that.Utsav Shah: And then that's some quality metric which is indicative of this team's component functioning properly or not.Maju Kuruvilla: It's more than that, this team's component being the best. So, it's not a functional metric, it's a fitness function means this is the most important thing that you can measure to see if they are doing the ultimate best, they can do.Utsav Shah: That's interesting, and then you can imagine that some people might even have an NPS score or something eventually getting tracked as part of that fitness function. And then finally, management reviews, all of these different teams is fitness functions to make sure that everything is coming together and then they can deliver the final large piece which is a huge prime now.Maju Kuruvilla: And usually there is a operate operational plan, so that that happens every year, you see how everything is going to come together and Amazon also has heavy documentation, culture, writing and reading are important. And reading is as important as writing by the way because I have seen a lot of companies tend to say we have a writing culture but most of them don't read. So I feel writing is one of those networking effects. If people don't read then writers have no incentive to write properly but Amazon has a very good doc in writing and reading culture and what that means is whenever you have to solve let's say, [unclear 34:43], another product was we did a computer vision system to completely automate the inventory, counting process in all fulfillment centers, it used to be a very manual [35:00] process. Before people used to count, it’s late, you're closing a store, you have to count the entire inventory all the time for compliance purposes, and also have to make sure that the virtual and the physical things are connected. And we completely get on with a new project where computer vision systems constantly monitor things as robots are moving things across the field and that can replace the whole process manual counting process. And again, a team of 11 people made that happen from an idea, they pitch to making it happen for momentum, Amazon's providing fulfillment centers worldwide, just loving people just made it happen. And it saves a lot of money for the company and automated a lot of processes but its more pizza team coming together. And so when you have to do that, the first things you write is what do you call a press release document or a PR FAQ document? And it's a one-page that clearly says, what's the problem? Why do we think we are the right people to solve? Is this the right problem to solve? And if you solve it, what will be the experience for the customer, so you have to write the experience for the customer, past solving this problem before you start doing anything. So that's what the press releases, you are writing down how the experience is going to be after the fact before you start any work, which is very powerful, by the way, because most of the time when you have to write that from a customer angle, a lot of things could become very clear, things we didn't think through will become very obvious because you might be solving either part of a problem or it's part of a bigger problem. And a customer only cares if the whole thing is off, not a piece of the result. So going to explain that customer experience is very powerful and then from that, there is a sequence of FAQs and usually, it's only one page, by the way. PR FAQ is just one page on the press release and then a lot of questions and that document is the first thing everybody writes and reviews. And so any of this brothers prime now, whether it's this computer vision-based inventory, counting any of that's the first thing people do, they write in a document, and then that gets reviewed by leadership. And then if someone reads and then finally says, approve, and you can go and start a new pizza team for that, or you going to allocate resources for that. And then you can go and do what there might be multiple pizza teams come together to solve that and you explain that in your plan. That's our plan, and then that's fine, you get a proof and then people will go off from that point, and make things out.Utsav Shah: So switching gears maybe a little bit, one of the things that Amazon is good at is making sure the checkout experience is super smooth, there's the one-click option, which I think Amazon has a patent on, so it's not easy for other companies to do something similar. But a lot of reasons why people like using Amazon is also the smooth checkout experience plus, it's super reliable. So just recently, I tried to buy a DELL laptop from Dell, and twice they just canceled my order and I had to place it through Amazon. I don't know how much Dell has to pay Amazon for it, but they just lost the commission on that laptop. So can you walk through why it's so important for checkout to be so seamless, intuitively it makes sense, you don't want people to be abandoning stuff in their car, but any numbers or anything whatsoever about why that checkout experience has to be as convenient as it is?Maju Kuruvilla: It's an interesting story and the whole online commerce story is unfolding in front of our eyes as we speak, this is the time for e-commerce. Companies like Amazon created this online buying experience and people started trusting buying online before people were worried about the security and safety of their staff and the quality of the things that they might get. Amazon solved all of that and got people to buy online without thinking twice about it. What we are seeing now is the consumers are used to it, now pandemic accelerated the whole process too, pandemic accelerated e-commerce adoption, almost 10 years ahead of the previous base. So if we didn't have a pandemic ended up taking another 10 years for us to be where we are, but it accelerated 10 years. So now people are [40:00] buying, especially the newer generation, folks buying online are not a big deal. They don't even think twice about it but what comes with that is people also want a different experience, people want to buy, they will continue to buy from Amazon but they also want to go and buy from other places. Because there are a lot of different merchants and brands that want to provide a very unique experience for customers. And people having that experience is not just about buying a product as it's about that whole experience of connecting yourself with the brand and that the experience of buying and then all the post-purchase experience after that on being connected with that brand. So sometimes that relationship is more than just buying something and more people want that, and especially the newer generations are more into that process. The challenge with most of those margins is that, how do you provide a simple seamless checkout experience like Amazon? Because Amazon has, like you said, an amazing checkout experience and how do you provide that because people are used to it now, so you don't need to beat Amazon on that. But at least people, you need to provide a similar experience that everywhere else and this is where companies like Bolt come in. I'll speak about Bolt a little bit here, where we can provide that checkout experience that people are used to, and make that one-click checkout experience where you come to our site, and you just click one button, and it's yours. And if you can provide that experience, people will start engaging with brands and merchants a lot more than if they have to go through the whole kind of high friction buying process. Because when you think of buying, it's a funnel and the first is research, and then there is a discovery of product, then there is intent to buy. That's why most of the time [unclear 42:11] on and then there is the conversion and at every point, you are adding friction, and if you can have where people can back off from the process. But if you can remove all of that, it's when you have to randomly a movie, and you have to go all the way to a blockbuster or stand in line, pick up a video and come home and watch it, that seems like a lot of friction versus just sitting at home and click Netflix and boom. People do watch more Netflix because of the ease of it, than like Hollywood or [unclear 42:51] Hollywood, or blockbuster or somebody like that because there's a lot of friction in that and the same people just do more because it's convenient. And that applies even in a checkout where we want to eliminate all that friction so that when people want their intent to buy is their conversion. There's nothing that stands between that and beyond just buying things on the brands and their website. There is also one more step beyond that. This is where most of the people are moving to and that is called social commerce where people want to buy things as soon as they see something, it's called buying at the point of discovery. So as you see an advertisement, you see a video, you will see an influencers media or you're reading a review, and you see something and you want to buy it, can you just click and buy it right there with one click, or it's a link that takes you back to some site and you have to buy it there now by going through all the process. So simplifying this whole process, whether that's buying from a website, or buying at the point of discovery at any surface, is going to be very critical in the future. And in fact, that's going to be the expectation for a lot of people as we move more and more into the e-commerce journey and that's what companies like Bolt provide out of the box for merchants so they can provide this experience to their customers.Utsav Shah: And maybe even talk to us a little bit about the impact of going through that funnel, if I'm a shopkeeper today who just set up my site without using Bolt or an optimized lead what abandonment rates and stuff would I expect? I know it's going to be different for each person, but I'm just trying to understand how much does that convenience factor play into online purchases?Maju Kuruvilla: It's substantial. So, it depends on the merchant, depends on the category, depends on the customer [45:00] and we have several case studies that we have listed on our website. But in some of the good use cases, you see an 80% increase in conversion, when you get a one-click checkout experience that's on the highest end. But it's very powerful because there are a couple of reasons; One, is it just the friction, you don't need to go and do something else to make it happen and number two is, it's just the safe, that people feel their safety and privacy. Do I want to give my information to all the websites out there? Or if I just give it to this one party I trust, and it insists that my identity is just there, and all my information is there but it enables me to log in everywhere, it's the single sign-on concept everywhere, that provides more comfort and safety for customers. So it's one side is the friction, the other side is feeling having a safe way to buy from anywhere.Utsav Shah: And I think one thing you spoke about was the fulfillment stack, building out all of this technology that provides fulfillment, maybe you can walk us through the checkout stack if you have to build a checkout system like this. At a glance, you're adding something to a cart, and you're retrieving that cart, and you're making a purchase but how does this work? How do you make it go one-click? And what are some challenges? Because I can imagine there are all sorts of things to worry about as users might click by mistake, there's fraud and stuff, how do you solve all of these problems?Maju Kuruvilla: Great question. So when you think of checkout when from the outside, it seemed like a fairly simple process, you add something to the cart, there's a checkout button, you click on it, and it asks you for payments and few things and it's done. Checkout is one of the most complex parts in a commerce stack and the reason for that is until it is checkout, it's just browsing, you are just adding things here and there, nothing needs consistency until at that point but when it's come to checkout, it's real. The checkout system needs to check for inventory, it needs to check for taxes, you need to look for coupons, it needs to look for the pricing at that time, it needs to look for all the shipping options, anything and everything that you need to check is all done at checkout. So checkout needs to call every single system that an e-commerce stack has to make sure that it all comes together so you can present it to the customer. What does it take to buy the thing and when they can get it and so it is an extremely complex function and so behind that scene, there is the UX how we provide a seamless user experience. For example, at Bolt, we obsess over how every single pixel works in that checkout? How does it work on the website? How does it work on a mobile? How can we make sure that it's so optimized that our customers feel the whole process like a breeze but then there is the payment gateway? People may want to use a variety of different payment instruments; payments are the whole world is changing so fast that there is some new payment method or an alternate payment method coming out, or every other week nowadays. So how do we keep provide merchants and integration into the entire payment world? And how do we place all of that so that the cost the customers can choose it in the right way, which is the other layer of complexity? But then there is the identity of the user itself, how do we do provide one click? We need to save that user's name and all the payment details and all the information so that we can do that. So, a company like Bolt, what we have is, we use that user in all the shopper's data as a network. So think of all the different shoppers who are connected to our shopping accounts network and everybody can contribute to that network and everybody can benefit from that network if they use Bolt checkout. So the Bolt checkout system is built on our accounts network that is continuing to grow and evolve and as people are shopping and all of these different merchants, people are getting added to this network. So as a new merchant coming brand new use Bolt, they have access to this whole network that's created through the shared network that is created so far, and they can provide one-click checkout to every single user in that [50:00] network. So what we are finding is this virtuous cycle of as we get more merchants, we get more accounts into our network and as we have more accounts in the network, you get more one-click checkout transactions, and then more one-click means more merchants want to sign up for us. So that virtuous cycle that's happening and that's added is accelerating, kind of Bolt's growth for now. But those are the different layers, all the way from a UI to all the complexities around that, but fundamentally built on this shared accounts network that's truly powering the one-click experience for everybody.Utsav Shah: So as an end-user, do I know I'm using Bolt when I'm buying something [unclear 50:46] website, or is it just like opaque to me?Maju Kuruvilla: No, you are buying it through Bolt; however, we don't try to brand it as a completely different brand because the merchant is buying us and we want to seamlessly integrate into Martin's ecosystem, so that we want to look more like, we are enabling the merchant, and we want to stay away between the customer and the merchant because we want to provide an experience as seamless as possible. However, for users they need to know that this is Bolt so that they can trust us, they know that it's powered by Bolt, creating an icon with Bolt, or logging in with Bolt, so that they know where it is, and they can control their information, they can manage it, and they can be at peace that we are taking care of that information and in one safe place. Utsav Shah: So that makes sense to me. Now, I'm curious to learn, you were running divisions or organizations at large companies like Amazon before, you can't just take every single good idea you have and apply it at a much smaller company. But what are some things that you feel you've changed in the first few months you were there and what are some things that you feel extremely strongly about? Even at a much smaller company when the Bolt is not super small, clearly, but what are just some learning’s that you feel you had to apply?Maju Kuruvilla: First and foremost, I would say, it doesn't matter what's the size of the company, you have to be connected with the customer and every single person all the way. For engineering, it doesn't matter whether you're an engineer, whether you are an accountant, or it doesn't matter what your discipline is, if you are in a company, your mission, you're passionate about it, you need to be very connected with your customer, you need to attend to some customer calls, you need to attend some support calls, you need to be on-call, sometimes, you need to be right in the thick of things. So big, small, doesn't matter and that I'm passionate about, and I'll force to make sure that everyone is deeply connected to that. And number two is hiring great people whether it’s; again, your company is only as good as the people you have. So hiring great people and taking care of great people is the highest priority of any leader across big, large mall doesn't matter; all companies. So, for example, when I came in, hiring more people into Bolt we are growing fast, so we wanted to bring in a lot of great people. I spend a lot of time in hiring and meeting with our people, for example, in my first 90 days, I met with every single person in the engineering and the product organization, had a one-on-one with them as simple three questions on what's working well at Bolt? What's not working well? And what are they hoping that I'll be doing to help? Three questions to every single person involved in engineering and products, I enter my first 90 days and I did that north of 100 interviews in my first 90 days, to both sides of the world, you have to take care of your great people and understand the people and what they want, what's working, what's not working so you can fix them. And continue to bring in great people into the company and that to me, knowing the customer and having a great team are the two things I hold dear to me, it doesn't matter where I go.Utsav Shah: Look, that makes a lot of sense to me but how do you [55:00] gauge the right people? How do you know somebody is good or not? Is there something in the interview you do or is this just a quality? You see, I don't know, I'm just curious, everybody talks about having great people, what does that mean?Maju Kuruvilla: There are two aspects where I look for personally, when we look for hiring great people, one is the basic table stakes, which are operational excellence, their technical skills, and all of that, which I think almost every company probably look for, which is great. But then there is the other side and this is where I will say beyond culture, I tend to look for people who are system thinkers, people who can think into it, people whenever they hear a problem, it's not just like, how do I solve this piece of the problem, but take a step back and look at, what's going on here. And what is the right way to solve it is maybe the solution is very different than what's obvious at the very beginning. And so people who can take a step back and look into and have that system thinking. And what that means is, truly solve the problem and sometimes not solving the problem you solve. But solving the original problem that even caused the problem, what you are seeing right now and so I tend to focus a lot during the interviews on things that like, what are some of the things they did? And I asked questions like, why did they do that? And why did that particular problem, the solution was the right one? How did they think that will solve whatever they're looking for? Because what I'm finding is that people can think more, can have that system level, an end-to-end, and end up solving innovative ways, than people just attack individual issues and solve them by themselves. And so I know, it's not that scientific way to think but it's a mindset, I have found that the people tend to create much resolving longstanding impact than most of the others.Utsav Shah: Yes, I think that makes a lot of sense, systems thinking framework, thinking people can understand the problem more holistically, than looking at the smaller things. But yes, I think this was a lot of useful information and we're almost on time. Thank you so much for being a guest, I certainly feel I learned a lot about a company that is pretty much in a black box, sometimes from the outside. I had no idea about so many things, about how fulfillment works, check out, thank you so much. Get on the email list at www.softwareatscale.dev

Software at Scale 30 - Bharat Mediratta: Coinbase Fellow

Play Episode Listen Later Aug 18, 2021 53:01

Bharat Mediratta is the first Coinbase Fellow. Previously, he was a Distinguished Engineer at Google, CTO at AltSchool, and CTO at Dropbox. Get on the email list at www.softwareatscale.dev

google software scale fellow cto dropbox coinbase bharat distinguished engineer altschool

Software at Scale 29 - Sugu Sougoumarane: CTO, PlanetScale

Play Episode Listen Later Aug 4, 2021 73:25

Listen now | Elon Musk, Databases in Containers, and other horrors Get on the email list at www.softwareatscale.dev

elon musk software scale databases containers planetscale

Software at Scale 28 - Tammy Butow: Principal SRE, Gremlin

Play Episode Listen Later Jul 27, 2021 58:17

Listen now | Tammy Butow is a Principal SRE at Gremlin, a Failure as a Service platform company that helps engineers build more resilient software. She’s also the co-founder of Girl Geek Academy, an organization to encourage women to learn technology skills. She previously held IC and management roles in SRE at Dropbox and Digital Ocean. Get on the email list at www.softwareatscale.dev

service failure software scale principal gremlins dropbox ic sre digital ocean girl geek academy tammy butow

Software at Scale 27 - Itiel Schwartz: CTO and Co-Founder, Komodor

Play Episode Listen Later Jul 15, 2021 43:32

Listen now | Kubernetes and Kubernetes Debugging Get on the email list at www.softwareatscale.dev

co founders software scale schwartz kubernetes itiel

Software at Scale 26 - Tramale Turner: Head of Engineering, Traffic at Stripe

Play Episode Listen Later Jul 7, 2021 64:04

Listen now | Tramale Turner is the Head of Engineering, Traffic at Stripe. Previously, he was a Senior Engineering Manager at F5 Networks and a Senior Manager at Nintendo. Apple Podcasts | Spotify | Google Podcasts This episode has an unexpectedly deep dive into security and compliance at Stripe. We discuss some of the various challenges that Stripe has to deal with for managing its internal security, as well as solve for various compliance requirements like PCI, audit logging, data locality, and more. Get on the email list at www.softwareatscale.dev

head software engineering nintendo scale traffic senior manager stripe pci senior engineering manager f5 networks

Software at Scale 25 - Rajesh Venkataraman: Senior Staff Software Engineer at Google

Play Episode Listen Later Jun 25, 2021 52:16

Listen now | Building Search at Microsoft, Google, and Dropbox Get on the email list at www.softwareatscale.dev

google microsoft scale software engineers rajesh senior staff staff software engineer

Software at Scale 24 - Devdatta Akhawe: Head of Security, Figma

Play Episode Listen Later Jun 17, 2021 51:36

Listen now | Devdatta Akhawe is the Head of Security at Figma. Previously, he was Director of Security Engineering at Dropbox, where he led multiple teams on product security and abuse prevention. Apple Podcasts | Spotify | Google Podcasts On this episode, we discuss security for startups, as well as dive deep into some interesting new developments in the security realm like Get on the email list at www.softwareatscale.dev

director head security software scale dropbox figma security engineering

Software at Scale 23 - Laurent Ploix: Engineering Manager, Spotify

Play Episode Listen Later Jun 10, 2021 59:32

Listen now | Laurent Ploix is an engineering manager on the Platform Insights team at Spotify. Previously, he was responsible for CI/CD at several Swedish companies, most recently as a Product Manager at Spotify, and a Continuous Integration Manager at Sungard. Apple Podcasts Get on the email list at www.softwareatscale.dev

spotify software scale swedish laurent product managers ci cd engineering manager sungard

Software at Scale 22 - Sujay Jayakar

Play Episode Listen Later Jun 2, 2021 65:51

Listen now | The Magic Pocket, Kernel Bypass Networking, and more Get on the email list at www.softwareatscale.dev

software scale sujay

Software at Scale 21: Colin Chartier: CEO, LayerCI

Play Episode Listen Later May 19, 2021 57:17

Listen now | Colin Chartier is the co-founder and CEO of LayerCI. LayerCI speeds up web developers by providing unique VM-like environments for every commit to a codebase. This enables developers, product managers, QA, and other stakeholders to preview code changes extremely quickly, and removes the need to spin up a local environment to showcase a demo. Colin was previously the CTO of ParseHub and a software design lecturer at the University of Toronto. Get on the email list at www.softwareatscale.dev

ceo university toronto software scale cto vm qa chartier

Software at Scale 20 - Naphat Sanguansin: ex Server Platform SRE, Dropbox

Play Episode Listen Later May 12, 2021 62:35

Listen now | Naphat Sanguansin was the former TL of the Server Platform SRE and Application Services teams at Dropbox, where he led efforts to improve Dropbox’s availability SLA and set a long term vision for server development. This episode is more conversational than regular episodes, since I was on the same team as Naphat and we worked on a few initiatives together. We share the story behind the reliability of a large monolith with hundreds of weekly contributors, and the eventual decision to “componentize” the monolith for both reliability and developer productivity that we’ve written about officially Get on the email list at www.softwareatscale.dev

software scale platform dropbox server tl sla application services

Software at Scale 19: Vanta

Play Episode Listen Later May 4, 2021 59:18

Listen now | Interview with Christina Cacioppo, CEO, and Robbie Ostrow, First Software Engineer Get on the email list at www.softwareatscale.dev

ceo interview software scale vanta

Software at Scale 18 - Alexander Gallego: CEO, Vectorized

Play Episode Listen Later Apr 27, 2021 61:41

Alexander Gallego is the founder and CEO of Vectorized. Vectorized offers a product called RedPanda, an Apache Kafka-compatible event streaming platform that’s significantly faster and easier to operate than Kafka. We talk about the increasing ubiquity of streaming platforms, what they’re used for, why Kafka is slow, and how to safely and effectively build a replacement.Previously, Alex was a Principal Software Engineer at Akamai systems and the creator of the Concord Framework, a distributed stream processing engine built in C++ on top of Apache Mesos.Apple Podcasts | Spotify | Google PodcastsHighlights7:00 - Who uses streaming platforms, and why? Why would someone use Kafka?12:30 - What would be the reason to use Kafka over Amazon SQS or Google PubSub?17:00 - What makes Kafka slow? The story behind RedPanda. We talk about memory efficiency in RedPanda which is better optimized for machines with more cores.34:00 - Other optimizations in RedPanda39:00 - WASM programming within the streaming engine, almost as if Kafka was an AWS Lambda processor.43:00 - How to convince potential customers to switch from Kafka to Redpanda?48:00 - What is the release process for Redpanda? How do they ensure that a new version isn’t broken?52:00 - What have we learnt about the state of Kafka and the use of streaming tools? Get on the email list at www.softwareatscale.dev

ceo spotify software scale kafka gallego akamai wasm aws lambda principal software engineer apache kafka apache mesos amazon sqs

Software at Scale 17 - John Egan: CEO, Kintaba

Play Episode Listen Later Apr 20, 2021 58:16

Listen now | John Egan is the CEO and Co-Founder of Kintaba, an incident management platform. He was the co-creator of Workplace by Facebook, and previously built Caffeinated Mind, a file transfer company, which was acquired by Facebook. In this episode, our focus is on incident management tools and culture. We discuss learnings about incident management through John’s personal experiences at his startup and at Facebook, and his observations through customers of Kintaba. We explore the stage at which a company might be interested in having an incident response tool, the surprising adoption of such tools outside of engineering teams, the benefits of enforcing cultural norms via tools, and whether such internal tools should lean towards being opinionated or flexible. We also discuss postmortem culture and how the software industry moves forward by learning through transparency of failures. Get on the email list at www.softwareatscale.dev

ceo co founders workplace software scale john egan

Software at Scale 16 - Nipunn Koorapati: ex Software Engineer, Dropbox

Play Episode Listen Later Apr 13, 2021 75:26

Listen now | Nipunn Koorapati was a Software Engineer at Dropbox, where he worked on two distinct areas - Developer Productivity and Client Sync. He drove many initiatives like consolidating various repositories into a server-side monorepo (read more here), and was part of a high leverage project to Get on the email list at www.softwareatscale.dev

scale dropbox software engineers developer productivity

Software at Scale 15 - Ben Sigelman: CEO, Lightstep

Play Episode Listen Later Apr 4, 2021 54:24

Listen now | Ben Sigelman is the CEO and Co-Founder of Lightstep, a DevOps observability platform. He was the co-creator of Dapper - Google’s distributed tracing system and Monarch - an in-memory time-series database for metrics. Get on the email list at www.softwareatscale.dev

ceo co founders software scale monarch devops lightstep

Software at Scale 14 - Liran Haimovitch: CTO, Rookout

Play Episode Listen Later Mar 23, 2021 40:04

Listen now | Liran Haimovitch is the Co-Founder and CTO of Rookout, a new style debugging tool that enables developers to debug web applications by adding debugger style breakpoints in production (without actually stopping the application). Rookout belongs to a new class of developer tools that aim to make application debugging more interactive than the standard “inspect logs experience” that is standard industry practice today. I’d encourage checking out the Get on the email list at www.softwareatscale.dev

co founders software scale cto l'iran

Software at Scale 13 - Emma Tang: ex Data Infrastructure Lead, Stripe

Play Episode Listen Later Mar 21, 2021 41:00

Listen now | Effective Management of Big Data Platforms Get on the email list at www.softwareatscale.dev

software scale tang stripe effective management data infrastructure

Software at Scale 12 - John Micco: Cloud Transformation Architect, VMWare

Play Episode Listen Later Mar 13, 2021 94:55

Listen now | John Micco is a Cloud Transformation Architect at VMWare, where he works on CI/CD systems. He’s worked in the CI/CD space.. Get on the email list at www.softwareatscale.dev

transformation software cloud scale architects vmware ci cd

Software at Scale 11 - Barak Schoster: CEO, BridgeCrew

Play Episode Listen Later Mar 4, 2021 38:15

Listen now | Barak Schoster is the CEO of BridgeCrew, a cloud security platform that was just acquired by Palo Alto Networks. He’s also the maintainer of Checkov, a popular static code analysis tool for infrastructure-as-code. In this episode, we discuss both aspects - the experience running a DevOps company and a popular open-source tool. Get on the email list at www.softwareatscale.dev

ceo software scale devops barak palo alto networks checkov bridge crew

Software at Scale 10 - David Cramer: CTO, Sentry

Play Episode Listen Later Feb 24, 2021 73:24

Listen now | David Cramer is the co-founder and CTO of Sentry, a monitoring platform helps every developer diagnose, fix, and optimize the performance of their applications. Before this, he worked at Dropbox and Disqus. Apple Podcasts | Spotify | Google Podcasts Get on the email list at www.softwareatscale.dev

software scale cto dropbox sentry disqus david cramer

Claim Software at Scale

In order to claim this podcast we'll send an email to with a verification link. Simply click the link and you will be able to edit tags, request a refresh, and other features to take control of your podcast page!

Claim Cancel