Podcasts about MapReduce

  • 83PODCASTS
  • 119EPISODES
  • 52mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • May 6, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about MapReduce

Latest podcast episodes about MapReduce

Engineering Kiosk
#194 Was wurde aus MapReduce und der funktionalen Eleganz in verteilten Systemen?

Engineering Kiosk

Play Episode Listen Later May 6, 2025 60:13


MapReduce: Ein Deep DiveIm Jahr 2004 war die Verarbeitung von großen Datenmengen eine richtige Herausforderung. Einige Firmen hatten dafür sogenannte Supercomputer. Andere haben nur mit der Schulter gezuckt und auf das Ende ihrer Berechnung gewartet. Google war einer der Player, der zwar große Datenmengen hatte und diese auch verarbeiten wollte, jedoch keine Supercomputer zur Verfügung hatte. Oder besser gesagt: Nicht das Geld in die Hand nehmen wollte.Was macht man also, wenn man ein Problem hat? Eine Lösung suchen. Das hat Jeffrey Dean und sein Team getan. Das Ergebnis? Ein revolutionäres Paper, wie man mittels MapReduce große Datenmengen verteilt auf einfacher Commodity-Hardware verarbeiten kann.In dieser Podcast-Episode schauen wir uns das mal genauer an. Wir klären, was MapReduce ist, wie es funktioniert, warum MapReduce so revolutionär war, wie es mit Hardware-Ausfällen umgegangen ist, welche Herausforderungen in der Praxis hatte bzw. immer noch hat, was das Google File System, Hadoop und HDFS damit zu tun haben und ordnen MapReduce im Kontext der heutigen Technologien mit Cloud und Co ein.Eine weitere Episode “Papers We Love”.Bonus: Hadoop ist wohl der Elefant im Raum.Unsere aktuellen Werbepartner findest du auf https://engineeringkiosk.dev/partnersDas schnelle Feedback zur Episode:

Data Driven
Jacob Leverich on Efficiency, Elegance, and the Joy of Not Grepping log files at 2AM

Data Driven

Play Episode Listen Later Apr 22, 2025 58:10


This week, Frank sat down with Dr. Jacob Leverich—Stanford PhD, cofounder of Observe, and a veteran of the Google MapReduce team and Splunk. Jacob's journey, from tinkering with video game code as a kid, to innovating at the cutting edge of distributed systems and energy efficiency, is as inspiring as it is informative.Key TakeawaysEarly Tech Roots: Hear how curiosity with QBasic and classic PCs (think IBM PCXT and Commodore) put Jacob on a path to high-impact data engineering.MapReduce, Dremel, & the Rise of Big Data: Jacob pulls back the curtain on working with some of the most influential data processing tools at Google and how these systems shifted the entire data landscape (hello, BigQuery!).Building Efficient Systems: It's not just about scale—energy efficiency and performance optimization are the unsung heroes of today's data infrastructure. Jacob explains why making things “just work” isn't enough anymore.The Realities of Ops & Observability: Remember the days of grepping logs at 2AM? There's a better way. Jacob shares how platforms like Observe help teams consolidate, visualize, and act on operational data—turning chaos into actionable insight.Bridging Data & Ops: The lines between data observability and traditional ops are blurring, and Jacob's unique experience shows how best practices from data warehousing are finally making ops smoother (and less sleepless).Power Concerns & the Future: As data grows, so does energy consumption in data centers. Find out why optimization isn't just good for performance—it's key to sustainability.Timestamps00:00 Interview with Jacob Levrich05:59 Journey into Game Programming06:43 "Pursuing Fast Video Game Code"10:23 Data Processing and Power Efficiency16:11 Snowflake's Transformative Database Approach19:18 Journey to Data Management Industry21:37 Data Products: Solving Core Challenges27:07 Early Web Log Analysis Techniques28:57 Consolidating Data for Efficiency33:23 Specialized Tools and Context Switching35:43 Unique Dual-Expertise in Tech38:58 User-Centric Business Strategies42:13 IP Data Analysis in Cloud47:23 Electricity Transport Upsets Local Farms48:25 Shift to Parallel Computing52:10 Hardware Specialization & Software Optimization57:32 "Stay Data Driven"

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

The free livestreams for AI Engineer Summit are now up! Please hit the bell to help us appease the algo gods. We're also announcing a special Online Track later today.Today's Deep Research episode is our last in our series of AIE Summit preview podcasts - thanks for following along with our OpenAI, Portkey, Pydantic, Bee, and Bret Taylor episodes, and we hope you enjoy the Summit! Catch you on livestream.Everybody's going deep now. Deep Work. Deep Learning. DeepMind. If 2025 is the Year of Agents, then the 2020s are the Decade of Deep.While “LLM-powered Search” is as old as Perplexity and SearchGPT, and open source projects like GPTResearcher and clones like OpenDeepResearch exist, the difference with “Deep Research” products is they are both “agentic” (loosely meaning that an LLM decides the next step in a workflow, usually involving tools) and bundling custom-tuned frontier models (custom tuned o3 and Gemini 1.5 Flash).The reception to OpenAI's Deep Research agent has been nothing short of breathless:"Deep Research is the best public-facing AI product Google has ever released. It's like having a college-educated researcher in your pocket." - Jason Calacanis“I have had [Deep Research] write a number of ten-page papers for me, each of them outstanding. I think of the quality as comparable to having a good PhD-level research assistant, and sending that person away with a task for a week or two, or maybe more. Except Deep Research does the work in five or six minutes.” - Tyler Cowen“Deep Research is one of the best bargains in technology.” - Ben Thompson“my very approximate vibe is that it can do a single-digit percentage of all economically valuable tasks in the world, which is a wild milestone.” - sama“Using Deep Research over the past few weeks has been my own personal AGI moment. It takes 10 mins to generate accurate and thorough competitive and market research (with sources) that previously used to take me at least 3 hours.” - OAI employee“It's like a bazooka for the curious mind” - Dan Shipper“Deep research can be seen as a new interface for the internet, in addition to being an incredible agent… This paradigm will be so powerful that in the future, navigating the internet manually via a browser will be "old-school", like performing arithmetic calculations by hand.” - Jason Wei“One notable characteristic of Deep Research is its extreme patience. I think this is rapidly approaching “superhuman patience”. One realization working on this project was that intelligence and patience go really well together.” - HyungWon“I asked it to write a reference Interaction Calculus evaluator in Haskell. A few exchanges later, it gave me a complete file, including a parser, an evaluator, O(1) interactions and everything. The file compiled, and worked on my test inputs. There are some minor issues, but it is mostly correct. So, in about 30 minutes, o3 performed a job that would take me a day or so.” - Victor Taelin“Can confirm OpenAI Deep Research is quite strong. In a few minutes it did what used to take a dozen hours. The implications to knowledge work is going to be quite profound when you just ask an AI Agent to perform full tasks for you and come back with a finished result.” - Aaron Levie“Deep Research is genuinely useful” - Gary MarcusWith the advent of “Deep Research” agents, we are now routinely asking models to go through 100+ websites and generate in-depth reports on any topic. The Deep Research revolution has hit the AI scene in the last 2 weeks: * Dec 11th: Gemini Deep Research (today's guest!) rolls out with Gemini Advanced* Feb 2nd: OpenAI releases Deep Research* Feb 3rd: a dozen “Open Deep Research” clones launch* Feb 5th: Gemini 2.0 Flash GA* Feb 15th: Perplexity launches Deep Research * Feb 17th: xAI launches Deep SearchIn today's episode, we welcome Aarush Selvan and Mukund Sridhar, the lead PM and tech lead for Gemini Deep Research, the originators of the entire category. We asked detailed questions from inspiration to implementation, why they had to finetune a special model for it instead of using the standard Gemini model, how to run evals for them, and how to think about the distribution of use cases. (We also have an upcoming Gemini 2 episode with our returning first guest Logan Kilpatrick so stay tuned

The Lunar Society
Jeff Dean & Noam Shazeer – 25 years at Google: from PageRank to AGI

The Lunar Society

Play Episode Listen Later Feb 12, 2025 134:43


This week I welcome on the show two of the most important technologists ever, in any field.Jeff Dean is Google's Chief Scientist, and through 25 years at the company, has worked on basically the most transformative systems in modern computing: from MapReduce, BigTable, Tensorflow, AlphaChip, to Gemini.Noam Shazeer invented or co-invented all the main architectures and techniques that are used for modern LLMs: from the Transformer itself, to Mixture of Experts, to Mesh Tensorflow, to Gemini and many other things.We talk about their 25 years at Google, going from PageRank to MapReduce to the Transformer to MoEs to AlphaChip – and maybe soon to ASI.My favorite part was Jeff's vision for Pathways, Google's grand plan for a mutually-reinforcing loop of hardware and algorithmic design and for going past autoregression. That culminates in us imagining *all* of Google-the-company, going through one huge MoE model.And Noam just bites every bullet: 100x world GDP soon; let's get a million automated researchers running in the Google datacenter; living to see the year 3000.SponsorsScale partners with major AI labs like Meta, Google Deepmind, and OpenAI. Through Scale's Data Foundry, labs get access to high-quality data to fuel post-training, including advanced reasoning capabilities. If you're an AI researcher or engineer, learn about how Scale's Data Foundry and research lab, SEAL, can help you go beyond the current frontier at scale.com/dwarkesh.Curious how Jane Street teaches their new traders? They use Figgie, a rapid-fire card game that simulates the most exciting parts of markets and trading. It's become so popular that Jane Street hosts an inter-office Figgie championship every year. Download from the app store or play on your desktop at figgie.com.Meter wants to radically improve the digital world we take for granted. They're developing a foundation model that automates network management end-to-end. To do this, they just announced a long-term partnership with Microsoft for tens of thousands of GPUs, and they're recruiting a world class AI research team. To learn more, go to meter.com/dwarkesh.Advertisers:To sponsor a future episode, visit: dwarkeshpatel.com/p/advertise.Timestamps00:00:00 - Intro00:02:44 - Joining Google in 199900:05:36 - Future of Moore's Law00:10:21 - Future TPUs00:13:13 - Jeff's undergrad thesis: parallel backprop00:15:10 - LLMs in 200700:23:07 - “Holy s**t” moments00:29:46 - AI fulfills Google's original mission00:34:19 - Doing Search in-context00:38:32 - The internal coding model00:39:49 - What will 2027 models do?00:46:00 - A new architecture every day?00:49:21 - Automated chip design and intelligence explosion00:57:31 - Future of inference scaling01:03:56 - Already doing multi-datacenter runs01:22:33 - Debugging at scale01:26:05 - Fast takeoff and superalignment01:34:40 - A million evil Jeff Deans01:38:16 - Fun times at Google01:41:50 - World compute demand in 203001:48:21 - Getting back to modularity01:59:13 - Keeping a giga-MoE in-memory02:04:09 - All of Google in one model02:12:43 - What's missing from distillation02:18:03 - Open research, pros and cons02:24:54 - Going the distance Get full access to Dwarkesh Podcast at www.dwarkeshpatel.com/subscribe

Engineering Kiosk
#128 Devs müssen wissenschaftliche Papers lesen!?

Engineering Kiosk

Play Episode Listen Later Jun 18, 2024 61:02


Wie werden eigentlich wissenschaftliche Paper richtig gelesen?Du besuchst HackerNews und es trendet ein Artikel über einen neuen Algorithmus, der 100 mal besser ist als ein anderer. 1500 Kommentare hat der Post bereits. Für dich ist eins klar: Das MUSST du lesen. Du klickst drauf und erkennst “Uh … es ist ein wissenschaftliches Paper”.Du fragst dich: Quälst du dich da nun durch? Oder suchst du lieber auf YouTube nach einer Zusammenfassung? So gehts wahrscheinlich vielen Nicht-Akademikern - Denn, diese Dokumente können langweilig und trocken sein, voll von irgendwelchen Formeln, die sowieso nur 3% der Menschheit verstehen.Doch was ist, wenn man wissenschaftliche Paper nicht von vorne bis hinten liest, wie normale Bücher? Wie liest man diese Dokumente richtig, dass man nicht konstant weg pennt? Darum gehts in dieser Episode - Wolfgang erklärt die Tricks und Kniffe, wie man das meiste in kurzer Zeit aus den neusten wissenschaftlichen Erkenntnissen rausholt.Bonus: Bit-Shifting ist immer noch ein Hass-Thema.Das schnelle Feedback zur Episode:

Data Engineering Podcast
Ship Smarter Not Harder With Declarative And Collaborative Data Orchestration On Dagster+

Data Engineering Podcast

Play Episode Listen Later Mar 24, 2024 55:39


Summary A core differentiator of Dagster in the ecosystem of data orchestration is their focus on software defined assets as a means of building declarative workflows. With their launch of Dagster+ as the redesigned commercial companion to the open source project they are investing in that capability with a suite of new features. In this episode Pete Hunt, CEO of Dagster labs, outlines these new capabilities, how they reduce the burden on data teams, and the increased collaboration that they enable across teams and business units. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Pete Hunt about how the launch of Dagster+ will level up your data platform and orchestrate across language platforms Interview Introduction How did you get involved in the area of data management? Can you describe what the focus of Dagster+ is and the story behind it? What problems are you trying to solve with Dagster+? What are the notable enhancements beyond the Dagster Core project that this updated platform provides? How is it different from the current Dagster Cloud product? In the launch announcement you tease new capabilities that would be great to explore in turns: Make data a team sport, enabling data teams across the organization Deliver reliable, high quality data the organization can trust Observe and manage data platform costs Master the heterogeneous collection of technologies—both traditional and Modern Data Stack What are the business/product goals that you are focused on improving with the launch of Dagster+ What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the design and launch of Dagster+? When is Dagster+ the wrong choice? What do you have planned for the future of Dagster/Dagster Cloud/Dagster+? Contact Info Twitter (https://twitter.com/floydophone) LinkedIn (https://linkedin.com/in/pwhunt) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-data-applications-episode-104) Dagster+ Launch Event (https://dagster.io/events/dagster-plus-launch-event) Hadoop (https://hadoop.apache.org/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Pydantic (https://docs.pydantic.dev/latest/) Software Defined Assets (https://docs.dagster.io/concepts/assets/software-defined-assets) Dagster Insights (https://docs.dagster.io/dagster-cloud/insights) Dagster Pipes (https://docs.dagster.io/guides/dagster-pipes) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) Data Mesh (https://www.datamesh-architecture.com/) Dagster Code Locations (https://docs.dagster.io/concepts/code-locations) Dagster Asset Checks (https://docs.dagster.io/concepts/assets/asset-checks) Dave & Buster's (https://www.daveandbusters.com/us/en/home) SQLMesh (https://sqlmesh.readthedocs.io/en/latest/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) SDF (https://www.sdf.com/) Malloy (https://www.malloydata.dev/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Data Engineering Podcast
Designing Data Transfer Systems That Scale

Data Engineering Podcast

Play Episode Listen Later Dec 4, 2023 63:57


Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data capture Interview Introduction How did you get involved in the area of data management? Your most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve? What were the shortcomings of other options in the ecosystem that led you to building a new system? What was the design of your initial solution to the problem? What are the sharp edges that you had to deal with to operate and use that initial implementation? What were the limitations of the system as you started to scale it? Can you describe the current architecture of your data transfer platform? What are the capabilities and constraints that you are optimizing for? As you move beyond the initial use case that started you down this path, what are the complexities involved in generalizing to add new functionality or integrate with additional platforms? What are the most interesting, innovative, or unexpected ways that you have seen your data transfer service used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data transfer system? When is DoubleCloud Data Transfer the wrong choice? What do you have planned for the future of DoubleCloud Data Transfer? Contact Info LinkedIn (https://www.linkedin.com/in/andrei-tserakhau/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links DoubleCloud (https://double.cloud/) Kafka (https://kafka.apache.org/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) dbt (https://www.getdbt.com/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/) Speaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.

Latent Space: The AI Engineer Podcast — CodeGen, Agents, Computer Vision, Data Science, AI UX and all things Software 3.0

Catch us at Modular's ModCon next week with Chris Lattner, and join our community!Due to Bryan's very wide ranging experience in data science and AI across Blue Bottle (!), StitchFix, Weights & Biases, and now Hex Magic, this episode can be considered a two-parter.Notebooks = Chat++We've talked a lot about AI UX (in our meetups, writeups, and guest posts), and today we're excited to dive into a new old player in AI interfaces: notebooks! Depending on your background, you either Don't Like or you Like notebooks — they are the most popular example of Knuth's Literate Programming concept, basically a collection of cells; each cell can execute code, display it, and share its state with all the other cells in a notebook. They can also simply be Markdown cells to add commentary to the analysis. Notebooks have a long history but most recently became popular from iPython evolving into Project Jupyter, and a wave of notebook based startups from Observable to DeepNote and Databricks sprung up for the modern data stack.The first wave of AI applications has been very chat focused (ChatGPT, Character.ai, Perplexity, etc). Chat as a user interface has a few shortcomings, the major one being the inability to edit previous messages. We enjoyed Bryan's takes on why notebooks feel like “Chat++” and how they are building Hex Magic:* Atomic actions vs Stream of consciousness: in a chat interface, you make corrections by adding more messages to a conversation (i.e. “Can you try again by doing X instead?” or “I actually meant XYZ”). The context can easily get messy and confusing for models (and humans!) to follow. Notebooks' cell structure on the other hand allows users to go back to any previous cells and make edits without having to add new ones at the bottom. * “Airlocks” for repeatability: one of the ideas they came up with at Hex is “airlocks”, a collection of cells that depend on each other and keep each other in sync. If you have a task like “Create a summary of my customers' recent purchases”, there are many sub-tasks to be done (look up the data, sum the amounts, write the text, etc). Each sub-task will be in its own cell, and the airlock will keep them all in sync together.* Technical + Non-Technical users: previously you had to use Python / R / Julia to write notebooks code, but with models like GPT-4, natural language is usually enough. Hex is also working on lowering the barrier of entry for non-technical users into notebooks, similar to how Code Interpreter is doing the same in ChatGPT. Obviously notebooks aren't new for developers (OpenAI Cookbooks are a good example), but haven't had much adoption in less technical spheres. Some of the shortcomings of chat UIs + LLMs lowering the barrier of entry to creating code cells might make them a much more popular UX going forward.RAG = RecSys!We also talked about the LLMOps landscape and why it's an “iron mine” rather than a “gold rush”: I'll shamelessly steal [this] from a friend, Adam Azzam from Prefect. He says that [LLMOps] is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. Don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this resource to something valuable is significant.Some of my favorite takeaways:* RAG as RecSys for LLMs: at its core, the goal of a RAG pipeline is finding the most relevant documents based on a task. This isn't very different from traditional recommendation system products that surface things for users. How can we apply old lessons to this new problem? Bryan cites fellow AIE Summit speaker and Latent Space Paper Club host Eugene Yan in decomposing the retrieval problem into retrieval, filtering, and scoring/ranking/ordering:As AI Engineers increasingly find that long context has tradeoffs, they will also have to relearn age old lessons that vector search is NOT all you need and a good systems not models approach is essential to scalable/debuggable RAG. Good thing Bryan has just written the first O'Reilly book about modern RecSys, eh?* Narrowing down evaluation: while “hallucination” is a easy term to throw around, the reality is more nuanced. A lot of times, model errors can be automatically fixed: is this JSON valid? If not, why? Is it just missing a closing brace? These smaller issues can be checked and fixed before returning the response to the user, which is easier than fixing the model.* Fine-tuning isn't all you need: when they first started building Magic, one of the discussions was around fine-tuning a model. In our episode with Jeremy Howard we talked about how fine-tuning leads to loss of capabilities as well. In notebooks, you are often dealing with domain-specific data (i.e. purchases, orders, wardrobe composition, household items, etc); the fact that the model understands that “items” are probably part of an “order” is really helpful. They have found that GPT-4 + 3.5-turbo were everything they needed to ship a great product rather than having to fine-tune on notebooks specifically.Definitely recommend listening to this one if you are interested in getting a better understanding of how to think about AI, data, and how we can use traditional machine learning lessons in large language models. The AI PivotFor more Bryan, don't miss his fireside chat at the AI Engineer Summit:Show Notes* Hex Magic* Bryan's new book: Building Recommendation Systems in Python and JAX* Bryan's whitepaper about MLOps* “Kitbashing in ML”, slides from his talk on building on top of foundation models* “Bayesian Statistics The Fun Way” by Will Kurt* Bryan's Twitter* “Berkeley man determined to walk every street in his city”* People:* Adam Azzam* Graham Neubig* Eugene Yan* Even OldridgeTimestamps* [00:00:00] Bryan's background* [00:02:34] Overview of Hex and the Magic product* [00:05:57] How Magic handles the complex notebook format to integrate cleanly with Hex* [00:08:37] Discussion of whether to build vs buy models - why Hex uses GPT-4 vs fine-tuning* [00:13:06] UX design for Magic with Hex's notebook format (aka “Chat++”)* [00:18:37] Expanding notebooks to less technical users* [00:23:46] The "Memex" as an exciting underexplored area - personal knowledge graph and memory augmentation* [00:27:02] What makes for good LLMops vs MLOps* [00:34:53] Building rigorous evaluators for Magic and best practices* [00:36:52] Different types of metrics for LLM evaluation beyond just end task accuracy* [00:39:19] Evaluation strategy when you don't own the core model that's being evaluated* [00:41:49] All the places you can make improvements outside of retraining the core LLM* [00:45:00] Lightning RoundTranscriptAlessio: Hey everyone, welcome to the Latent Space Podcast. This is Alessio, Partner and CTO-in-Residence of Decibel Partners, and today I'm joining by Bryan Bischof. [00:00:15]Bryan: Hey, nice to meet you. [00:00:17]Alessio: So Bryan has one of the most thorough and impressive backgrounds we had on the show so far. Lead software engineer at Blue Bottle Coffee, which if you live in San Francisco, you know a lot about. And maybe you'll tell us 30 seconds on what that actually means. You worked as a data scientist at Stitch Fix, which used to be one of the premier data science teams out there. [00:00:38]Bryan: It used to be. Ouch. [00:00:39]Alessio: Well, no, no. Well, you left, you know, so how good can it still be? Then head of data science at Weights and Biases. You're also a professor at Rutgers and you're just wrapping up a new O'Reilly book as well. So a lot, a lot going on. Yeah. [00:00:52]Bryan: And currently head of AI at Hex. [00:00:54]Alessio: Let's do the Blue Bottle thing because I definitely want to hear what's the, what's that like? [00:00:58]Bryan: So I was leading data at Blue Bottle. I was the first data hire. I came in to kind of get the data warehouse in order and then see what we could build on top of it. But ultimately I mostly focused on demand forecasting, a little bit of recsys, a little bit of sort of like website optimization and analytics. But ultimately anything that you could imagine sort of like a retail company needing to do with their data, we had to do. I sort of like led that team, hired a few people, expanded it out. One interesting thing was I was part of the Nestle acquisition. So there was a period of time where we were sort of preparing for that and didn't know, which was a really interesting dynamic. Being acquired is a very not necessarily fun experience for the data team. [00:01:37]Alessio: I build a lot of internal tools for sourcing at the firm and we have a small VCs and data community of like other people doing it. And I feel like if you had a data feed into like the Blue Bottle in South Park, the Blue Bottle at the Hanahaus in Palo Alto, you can get a lot of secondhand information on the state of VC funding. [00:01:54]Bryan: Oh yeah. I feel like the real source of alpha is just bugging a Blue Bottle. [00:01:58]Alessio: Exactly. And what's your latest book about? [00:02:02]Bryan: I just wrapped up a book with a coauthor Hector Yee called Building Production Recommendation Systems. I'll give you the rest of the title because it's fun. It's in Python and JAX. And so for those of you that are like eagerly awaiting the first O'Reilly book that focuses on JAX, here you go. [00:02:17]Alessio: Awesome. And we'll chat about that later on. But let's maybe talk about Hex and Magic before. I've known Hex for a while, I've used it as a notebook provider and you've been working on a lot of amazing AI enabled experiences. So maybe run us through that. [00:02:34]Bryan: So I too, before I sort of like joined Hex, saw it as this like really incredible notebook platform, sort of a great place to do data science workflows, quite complicated, quite ad hoc interactive ones. And before I joined, I thought it was the best place to do data science workflows. And so when I heard about the possibility of building AI tools on top of that platform, that seemed like a huge opportunity. In particular, I lead the product called Magic. Magic is really like a suite of sort of capabilities as opposed to its own independent product. What I mean by that is they are sort of AI enhancements to the existing product. And that's a really important difference from sort of building something totally new that just uses AI. It's really important to us to enhance the already incredible platform with AI capabilities. So these are things like the sort of obvious like co-pilot-esque vibes, but also more interesting and dynamic ways of integrating AI into the product. And ultimately the goal is just to make people even more effective with the platform. [00:03:38]Alessio: How do you think about the evolution of the product and the AI component? You know, even if you think about 10 months ago, some of these models were not really good on very math based tasks. Now they're getting a lot better. I'm guessing a lot of your workloads and use cases is data analysis and whatnot. [00:03:53]Bryan: When I joined, it was pre 4 and it was pre the sort of like new chat API and all that. But when I joined, it was already clear that GPT was pretty good at writing code. And so when I joined, they had already executed on the vision of what if we allowed the user to ask a natural language prompt to an AI and have the AI assist them with writing code. So what that looked like when I first joined was it had some capability of writing SQL and it had some capability of writing Python and it had the ability to explain and describe code that was already written. Those very, what feel like now primitive capabilities, believe it or not, were already quite cool. It's easy to look back and think, oh, it's like kind of like Stone Age in these timelines. But to be clear, when you're building on such an incredible platform, adding a little bit of these capabilities feels really effective. And so almost immediately I started noticing how it affected my own workflow because ultimately as sort of like an engineering lead and a lot of my responsibility is to be doing analytics to make data driven decisions about what products we build. And so I'm actually using Hex quite a bit in the process of like iterating on our product. When I'm using Hex to do that, I'm using Magic all the time. And even in those early days, the amount that it sped me up, that it enabled me to very quickly like execute was really impressive. And so even though the models weren't that good at certain things back then, that capability was not to be underestimated. But to your point, the models have evolved between 3.5 Turbo and 4. We've actually seen quite a big enhancement in the kinds of tasks that we can ask Magic and even more so with things like function calling and understanding a little bit more of the landscape of agent workflows, we've been able to really accelerate. [00:05:57]Alessio: You know, I tried using some of the early models in notebooks and it actually didn't like the IPyNB formatting, kind of like a JSON plus XML plus all these weird things. How have you kind of tackled that? Do you have some magic behind the scenes to make it easier for models? Like, are you still using completely off the shelf models? Do you have some proprietary ones? [00:06:19]Bryan: We are using at the moment in production 3.5 Turbo and GPT-4. I would say for a large number of our applications, GPT-4 is pretty much required. To your question about, does it understand the structure of the notebook? And does it understand all of this somewhat complicated wrappers around the content that you want to show? We do our very best to abstract that away from the model and make sure that the model doesn't have to think about what the cell wrapper code looks like. Or for our Magic charts, it doesn't have to speak the language of Vega. These are things that we put a lot of work in on the engineering side, to the AI engineer profile. This is the AI engineering work to get all of that out of the way so that the model can speak in the languages that it's best at. The model is quite good at SQL. So let's ensure that it's speaking the language of SQL and that we are doing the engineering work to get the output of that model, the generations, into our notebook format. So too for other cell types that we support, including charts, and just in general, understanding the flow of different cells, understanding what a notebook is, all of that is hard work that we've done to ensure that the model doesn't have to learn anything like that. I remember early on, people asked the question, are you going to fine tune a model to understand Hex cells? And almost immediately, my answer was no. No we're not. Using fine-tuned models in 2022, I was already aware that there are some limitations of that approach and frankly, even using GPT-3 and GPT-2 back in the day in Stitch Fix, I had already seen a lot of instances where putting more effort into pre- and post-processing can avoid some of these larger lifts. [00:08:14]Alessio: You mentioned Stitch Fix and GPT-2. How has the balance between build versus buy, so to speak, evolved? So GPT-2 was a model that was not super advanced, so for a lot of use cases it was worth building your own thing. Is with GPT-4 and the likes, is there a reason to still build your own models for a lot of this stuff? Or should most people be fine-tuning? How do you think about that? [00:08:37]Bryan: Sometimes people ask, why are you using GPT-4 and why aren't you going down the avenue of fine-tuning today? I can get into fine-tuning specifically, but I do want to talk a little bit about the good old days of GPT-2. Shout out to Reza. Reza introduced me to GPT-2. I still remember him explaining the difference between general transformers and GPT. I remember one of the tasks that we wanted to solve with transformer-based generative models at Stitch Fix were writing descriptions of clothing. You might think, ooh, that's a multi-modal problem. The answer is, not necessarily. We actually have a lot of features about the clothes that are almost already enough to generate some reasonable text. I remember at that time, that was one of the first applications that we had considered. There was a really great team of NLP scientists at Stitch Fix who worked on a lot of applications like this. I still remember being exposed to the GPT endpoint back in the days of 2. If I'm not mistaken, and feel free to fact check this, I'm pretty sure Stitch Fix was the first OpenAI customer, unlike their true enterprise application. Long story short, I ultimately think that depending on your task, using the most cutting-edge general model has some advantages. If those are advantages that you can reap, then go for it. So at Hex, why GPT-4? Why do we need such a general model for writing code, writing SQL, doing data analysis? Shouldn't a fine-tuned model just on Kaggle notebooks be good enough? I'd argue no. And ultimately, because we don't have one specific sphere of data that we need to write great data analysis workbooks for, we actually want to provide a platform for anyone to do data analysis about their business. To do that, you actually need to entertain an extremely general universe of concepts. So as an example, if you work at Hex and you want to do data analysis, our projects are called Hexes. That's relatively straightforward to teach it. There's a concept of a notebook. These are data science notebooks, and you want to ask analytics questions about notebooks. Maybe if you trained on notebooks, you could answer those questions, but let's come back to Blue Bottle. If I'm at Blue Bottle and I have data science work to do, I have to ask it questions about coffee. I have to ask it questions about pastries, doing demand forecasting. And so very quickly, you can see that just by serving just those two customers, a model purely fine-tuned on like Kaggle competitions may not actually fit the bill. And so the more and more that you want to build a platform that is sufficiently general for your customer base, the more I think that these large general models really pack a lot of additional opportunity in. [00:11:21]Alessio: With a lot of our companies, we talked about stuff that you used to have to extract features for, now you have out of the box. So say you're a travel company, you want to do a query, like show me all the hotels and places that are warm during spring break. It would be just literally like impossible to do before these models, you know? But now the model knows, okay, spring break is like usually these dates and like these locations are usually warm. So you get so much out of it for free. And in terms of Magic integrating into Hex, I think AI UX is one of our favorite topics and how do you actually make that seamless. In traditional code editors, the line of code is like kind of the atomic unit and HEX, you have the code, but then you have the cell also. [00:12:04]Bryan: I think the first time I saw Copilot and really like fell in love with Copilot, I thought finally, fancy auto-complete. And that felt so good. It felt so elegant. It felt so right sized for the task. But as a data scientist, a lot of the work that you do previous to the ML engineering part of the house, you're working in these cells and these cells are atomic. They're expressing one idea. And so ultimately, if you want to make the transition from something like this code, where you've got like a large amount of code and there's a large amount of files and they kind of need to have awareness of one another, and that's a long story and we can talk about that. But in this atomic, somewhat linear flow through the notebook, what you ultimately want to do is you want to reason with the agent at the level of these individual thoughts, these atomic ideas. Usually it's good practice in say Jupyter notebook to not let your cells get too big. If your cell doesn't fit on one page, that's like kind of a code smell, like why is it so damn big? What are you doing in this cell? That also lends some hints as to what the UI should feel like. I want to ask questions about this one atomic thing. So you ask the agent, take this data frame and strip out this prefix from all the strings in this column. That's an atomic task. It's probably about two lines of pandas. I can write it, but it's actually very natural to ask magic to do that for me. And what I promise you is that it is faster to ask magic to do that for me. At this point, that kind of code, I never write. And so then you ask the next question, which is what should the UI be to do chains, to do multiple cells that work together? Because ultimately a notebook is a chain of cells and actually it's a first class citizen for Hex. So we have a DAG and the DAG is the execution DAG for the individual cells. This is one of the reasons that Hex is reactive and kind of dynamic in that way. And so the very next question is, what is the sort of like AI UI for these collections of cells? And back in June and July, we thought really hard about what does it feel like to ask magic a question and get a short chain of cells back that execute on that task. And so we've thought a lot about sort of like how that breaks down into individual atomic units and how those are tied together. We introduced something which is kind of an internal name, but it's called the airlock. And the airlock is exactly a sequence of cells that refer to one another, understand one another, use things that are happening in other cells. And it gives you a chance to sort of preview what magic has generated for you. Then you can accept or reject as an entire group. And that's one of the reasons we call it an airlock, because at any time you can sort of eject the airlock and see it in the space. But to come back to your question about how the AI UX fits into this notebook, ultimately a notebook is very conversational in its structure. I've got a series of thoughts that I'm going to express as a series of cells. And sometimes if I'm a kind data scientist, I'll put some text in between them too, explaining what on earth I'm doing. And that feels, in my opinion, and I think this is quite shared amongst exons, that feels like a really nice refinement of the chat UI. I've been saying for several months now, like, please stop building chat UIs. There is some irony because I think what the notebook allows is like chat plus plus. [00:15:36]Alessio: Yeah, I think the first wave of everything was like chat with X. So it was like chat with your data, chat with your documents and all of this. But people want to code, you know, at the end of the day. And I think that goes into the end user. I think most people that use notebooks are software engineer, data scientists. I think the cool things about these models is like people that are not traditionally technical can do a lot of very advanced things. And that's why people like code interpreter and chat GBT. How do you think about the evolution of that persona? Do you see a lot of non-technical people also now coming to Hex to like collaborate with like their technical folks? [00:16:13]Bryan: Yeah, I would say there might even be more enthusiasm than we're prepared for. We're obviously like very excited to bring what we call the like low floor user into this world and give more people the opportunity to self-serve on their data. We wanted to start by focusing on users who are already familiar with Hex and really make magic fantastic for them. One of the sort of like internal, I would say almost North Stars is our team's charter is to make Hex feel more magical. That is true for all of our users, but that's easiest to do on users that are already able to use Hex in a great way. What we're hearing from some customers in particular is sort of like, I'm excited for some of my less technical stakeholders to get in there and start asking questions. And so that raises a lot of really deep questions. If you immediately enable self-service for data, which is almost like a joke over the last like maybe like eight years, if you immediately enabled self-service, what challenges does that bring with it? What risks does that bring with it? And so it has given us the opportunity to think about things like governance and to think about things like alignment with the data team and making sure that the data team has clear visibility into what the self-service looks like. Having been leading a data team, trying to provide answers for stakeholders and hearing that they really want to self-serve, a question that we often found ourselves asking is, what is the easiest way that we can keep them on the rails? What is the easiest way that we can set up the data warehouse and set up our tools such that they can ask and answer their own questions without coming away with like false answers? Because that is such a priority for data teams, it becomes an important focus of my team, which is, okay, magic may be an enabler. And if it is, what do we also have to respect? We recently introduced the data manager and the data manager is an auxiliary sort of like tool on the Hex platform to allow people to write more like relevant metadata about their data warehouse to make sure that magic has access to the best information. And there are some things coming to kind of even further that story around governance and understanding. [00:18:37]Alessio: You know, you mentioned self-serve data. And when I was like a joke, you know, the whole rush to the modern data stack was something to behold. Do you think AI is like in a similar space where it's like a bit of a gold rush? [00:18:51]Bryan: I have like sort of two comments here. One I'll shamelessly steal from a friend, Adam Azzam from Prefect. He says that this is more of like an iron mine than a gold mine in the sense of there is a lot of work to extract this precious, precious resource. And that's the first one is I think, don't expect to just go down to the stream and do a little panning. There's a lot of work to be done. And frankly, the steps to go from this like gold to, or this resource to something valuable is significant. I think people have gotten a little carried away with the old maxim of like, don't go pan for gold, sell pickaxes and shovels. It's a much stronger business model. At this point, I feel like I look around and I see more pickaxe salesmen and shovel salesmen than I do prospectors. And that scares me a little bit. Metagame where people are starting to think about how they can build tools for people building tools for AI. And that starts to give me a little bit of like pause in terms of like, how confident are we that we can even extract this resource into something valuable? I got a text message from a VC earlier today, and I won't name the VC or the fund, but the question was, what are some medium or large size companies that have integrated AI into their platform in a way that you're really impressed by? And I looked at the text message for a few minutes and I was finding myself thinking and thinking, and I responded, maybe only co-pilot. It's been a couple hours now, and I don't think I've thought of another one. And I think that's where I reflect again on this, like iron versus gold. If it was really gold, I feel like I'd be more blown away by other AI integrations. And I'm not yet. [00:20:40]Alessio: I feel like all the people finding gold are the ones building things that traditionally we didn't focus on. So like mid-journey. I've talked to a company yesterday, which I'm not going to name, but they do agents for some use case, let's call it. They are 11 months old. They're making like 8 million a month in revenue, but in a space that you wouldn't even think about selling to. If you were like a shovel builder, you wouldn't even go sell to those people. And Swix talks about this a bunch, about like actually trying to go application first for some things. Let's actually see what people want to use and what works. What do you think are the most maybe underexplored areas in AI? Is there anything that you wish people were actually trying to shovel? [00:21:23]Bryan: I've been saying for a couple of months now, if I had unlimited resources and I was just sort of like truly like, you know, on my own building whatever I wanted, I think the thing that I'd be most excited about is building sort of like the personal Memex. The Memex is something that I've wanted since I was a kid. And are you familiar with the Memex? It's the memory extender. And it's this idea that sort of like human memory is quite weak. And so if we can extend that, then that's a big opportunity. So I think one of the things that I've always found to be one of the limiting cases here is access. How do you access that data? Even if you did build that data like out, how would you quickly access it? And one of the things I think there's a constellation of technologies that have come together in the last couple of years that now make this quite feasible. Like information retrieval has really improved and we have a lot more simple systems for getting started with information retrieval to natural language is ultimately the interface that you'd really like these systems to work on, both in terms of sort of like structuring the data and preparing the data, but also on the retrieval side. So what keys off the query for retrieval, probably ultimately natural language. And third, if you really want to go into like the purely futuristic aspect of this, it is latent voice to text. And that is also something that has quite recently become possible. I did talk to a company recently called gather, which seems to have some cool ideas in this direction, but I haven't seen yet what I, what I really want, which is I want something that is sort of like every time I listen to a podcast or I watch a movie or I read a book, it sort of like has a great vector index built on top of all that information that's contained within. And then when I'm having my next conversation and I can't quite remember the name of this person who did this amazing thing, for example, if we're talking about the Memex, it'd be really nice to have Vannevar Bush like pop up on my, you know, on my Memex display, because I always forget Vannevar Bush's name. This is one time that I didn't, but I often do. This is something that I think is only recently enabled and maybe we're still five years out before it can be good, but I think it's one of the most exciting projects that has become possible in the last three years that I think generally wasn't possible before. [00:23:46]Alessio: Would you wear one of those AI pendants that record everything? [00:23:50]Bryan: I think I'm just going to do it because I just like support the idea. I'm also admittedly someone who, when Google Glass first came out, thought that seems awesome. I know that there's like a lot of like challenges about the privacy aspect of it, but it is something that I did feel was like a disappointment to lose some of that technology. Fun fact, one of the early Google Glass developers was this MIT computer scientist who basically built the first wearable computer while he was at MIT. And he like took notes about all of his conversations in real time on his wearable and then he would have real time access to them. Ended up being kind of a scandal because he wanted to use a computer during his defense and they like tried to prevent him from doing it. So pretty interesting story. [00:24:35]Alessio: I don't know but the future is going to be weird. I can tell you that much. Talking about pickaxes, what do you think about the pickaxes that people built before? Like all the whole MLOps space, which has its own like startup graveyard in there. How are those products evolving? You know, you were at Wits and Biases before, which is now doing a big AI push as well. [00:24:57]Bryan: If you really want to like sort of like rub my face in it, you can go look at my white paper on MLOps from 2022. It's interesting. I don't think there's many things in that that I would these days think are like wrong or even sort of like naive. But what I would say is there are both a lot of analogies between MLOps and LLMops, but there are also a lot of like key differences. So like leading an engineering team at the moment, I think a lot more about good engineering practices than I do about good ML practices. That being said, it's been very convenient to be able to see around corners in a few of the like ML places. One of the first things I did at Hex was work on evals. This was in February. I hadn't yet been overwhelmed by people talking about evals until about May. And the reason that I was able to be a couple of months early on that is because I've been building evals for ML systems for years. I don't know how else to build an ML system other than start with the evals. I teach my students at Rutgers like objective framing is one of the most important steps in starting a new data science project. If you can't clearly state what your objective function is and you can't clearly state how that relates to the problem framing, you've got no hope. And I think that is a very shared reality with LLM applications. Coming back to one thing you mentioned from earlier about sort of like the applications of these LLMs. To that end, I think what pickaxes I think are still very valuable is understanding systems that are inherently less predictable, that are inherently sort of experimental. On my engineering team, we have an experimentalist. So one of the AI engineers, his focus is experiments. That's something that you wouldn't normally expect to see on an engineering team. But it's important on an AI engineering team to have one person whose entire focus is just experimenting, trying, okay, this is a hypothesis that we have about how the model will behave. Or this is a hypothesis we have about how we can improve the model's performance on this. And then going in, running experiments, augmenting our evals to test it, et cetera. What I really respect are pickaxes that recognize the hybrid nature of the sort of engineering tasks. They are ultimately engineering tasks with a flavor of ML. And so when systems respect that, I tend to have a very high opinion. One thing that I was very, very aligned with Weights and Biases on is sort of composability. These systems like ML systems need to be extremely composable to make them much more iterative. If you don't build these systems in composable ways, then your integration hell is just magnified. When you're trying to iterate as fast as people need to be iterating these days, I think integration hell is a tax not worth paying. [00:27:51]Alessio: Let's talk about some of the LLM native pickaxes, so to speak. So RAG is one. One thing is doing RAG on text data. One thing is doing RAG on tabular data. We're releasing tomorrow our episode with Kube, the semantic layer company. Curious to hear your thoughts on it. How are you doing RAG, pros, cons? [00:28:11]Bryan: It became pretty obvious to me almost immediately that RAG was going to be important. Because ultimately, you never expect your model to have access to all of the things necessary to respond to a user's request. So as an example, Magic users would like to write SQL that's relevant to their business. And it's important then to have the right data objects that they need to query. We can't expect any LLM to understand our user's data warehouse topology. So what we can expect is that we can build a RAG system that is data warehouse aware, data topology aware, and use that to provide really great information to the model. If you ask the model, how are my customers trending over time? And you ask it to write SQL to do that. What is it going to do? Well, ultimately, it's going to hallucinate the structure of that data warehouse that it needs to write a general query. Most likely what it's going to do is it's going to look in its sort of memory of Stack Overflow responses to customer queries, and it's going to say, oh, it's probably a customer stable and we're in the age of DBT, so it might be even called, you know, dim customers or something like that. And what's interesting is, and I encourage you to try, chatGBT will do an okay job of like hallucinating up some tables. It might even hallucinate up some columns. But what it won't do is it won't understand the joins in that data warehouse that it needs, and it won't understand the data caveats or the sort of where clauses that need to be there. And so how do you get it to understand those things? Well, this is textbook RAG. This is the exact kind of thing that you expect RAG to be good at augmenting. But I think where people who have done a lot of thinking about RAG for the document case, they think of it as chunking and sort of like the MapReduce and the sort of like these approaches. But I think people haven't followed this train of thought quite far enough yet. Jerry Liu was on the show and he talked a little bit about thinking of this as like information retrieval. And I would push that even further. And I would say that ultimately RAG is just RecSys for LLM. As I kind of already mentioned, I'm a little bit recommendation systems heavy. And so from the beginning, RAG has always felt like RecSys to me. It has always felt like you're building a recommendation system. And what are you trying to recommend? The best possible resources for the LLM to execute on a task. And so most of my approach to RAG and the way that we've improved magic via retrieval is by building a recommendation system. [00:30:49]Alessio: It's funny, as you mentioned that you spent three years writing the book, the O'Reilly book. Things must have changed as you wrote the book. I don't want to bring out any nightmares from there, but what are the tips for people who want to stay on top of this stuff? Do you have any other favorite newsletters, like Twitter accounts that you follow, communities you spend time in? [00:31:10]Bryan: I am sort of an aggressive reader of technical books. I think I'm almost never disappointed by time that I've invested in reading technical manuscripts. I find that most people write O'Reilly or similar books because they've sort of got this itch that they need to scratch, which is that I have some ideas, I have some understanding that we're hard won, I need to tell other people. And there's something that, from my experience, correlates between that itch and sort of like useful information. As an example, one of the people on my team, his name is Will Kurt, he wrote a book sort of Bayesian statistics the fun way. I knew some Bayesian statistics, but I read his book anyway. And the reason was because I was like, if someone feels motivated to write a book called Bayesian statistics the fun way, they've got something to say about Bayesian statistics. I learned so much from that book. That book is like technically like targeted at someone with less knowledge and experience than me. And boy, did it humble me about my understanding of Bayesian statistics. And so I think this is a very boring answer, but ultimately like I read a lot of books and I think that they're a really valuable way to learn these things. I also regrettably still read a lot of Twitter. There is plenty of noise in that signal, but ultimately it is still usually like one of the first directions to get sort of an instinct for what's valuable. The other comment that I want to make is we are in this age of sort of like archive is becoming more of like an ad platform. I think that's a little challenging right now to kind of use it the way that I used to use it, which is for like higher signal. I've chatted a lot with a CMU professor, Graham Neubig, and he's been doing LLM evaluation and LLM enhancements for about five years and know that I didn't misspeak. And I think talking to him has provided me a lot of like directionality for more believable sources. Trying to cut through the hype. I know that there's a lot of other things that I could mention in terms of like just channels, but ultimately right now I think there's almost an abundance of channels and I'm a little bit more keen on high signal. [00:33:18]Alessio: The other side of it is like, I see so many people say, Oh, I just wrote a paper on X and it's like an article. And I'm like, an article is not a paper, but it's just funny how I know we were kind of chatting before about terms being reinvented and like people that are not from this space kind of getting into AI engineering now. [00:33:36]Bryan: I also don't want to be gatekeepy. Actually I used to say a lot to people, don't be shy about putting your ideas down on paper. I think it's okay to just like kind of go for it. And I, I myself have something on archive that is like comically naive. It's intentionally naive. Right now I'm less concerned by more naive approaches to things than I am by the purely like advertising approach to sort of writing these short notes and articles. I think blogging still has a good place. And I remember getting feedback during my PhD thesis that like my thesis sounded more like a long blog post. And I now feel like that curmudgeonly professor who's also like, yeah, maybe just keep this to the blogs. That's funny.Alessio: Uh, yeah, I think one of the things that Swyx said when he was opening the AI engineer summit a couple of weeks ago was like, look, most people here don't know much about the space because it's so new and like being open and welcoming. I think it's one of the goals. And that's why we try and keep every episode at a level that it's like, you know, the experts can understand and learn something, but also the novices can kind of like follow along. You mentioned evals before. I think that's one of the hottest topics obviously out there right now. What are evals? How do we know if they work? Yeah. What are some of the fun learnings from building them into X? [00:34:53]Bryan: I said something at the AI engineer summit that I think a few people have already called out, which is like, if you can't get your evals to be sort of like objective, then you're not trying hard enough. I stand by that statement. I'm not going to, I'm not going to walk it back. I know that that doesn't feel super good because people, people want to think that like their unique snowflake of a problem is too nuanced. But I think this is actually one area where, you know, in this dichotomy of like, who can do AI engineering? And the answer is kind of everybody. Software engineering can become AI engineering and ML engineering can become AI engineering. One thing that I think the more data science minded folk have an advantage here is we've gotten more practice in taking very vague notions and trying to put a like objective function around that. And so ultimately I would just encourage everybody who wants to build evals, just work incredibly hard on codifying what is good and bad in terms of these objective metrics. As far as like how you go about turning those into evals, I think it's kind of like sweat equity. Unfortunately, I told the CEO of gantry several months ago, I think it's been like six months now that I was sort of like looking at every single internal Hex request to magic by hand with my eyes and sort of like thinking, how can I turn this into an eval? Is there a way that I can take this real request during this dog foodie, not very developed stage? How can I make that into an evaluation? That was a lot of sweat equity that I put in a lot of like boring evenings, but I do think ultimately it gave me a lot of understanding for the way that the model was misbehaving. Another thing is how can you start to understand these misbehaviors as like auxiliary evaluation metrics? So there's not just one evaluation that you want to do for every request. It's easy to say like, did this work? Did this not work? Did the response satisfy the task? But there's a lot of other metrics that you can pull off these questions. And so like, let me give you an example. If it writes SQL that doesn't reference a table in the database that it's supposed to be querying against, we would think of that as a hallucination. You could separately consider, is it a hallucination as a valuable metric? You could separately consider, does it get the right answer? The right answer is this sort of like all in one shot, like evaluation that I think people jump to. But these intermediary steps are really important. I remember hearing that GitHub had thousands of lines of post-processing code around Copilot to make sure that their responses were sort of correct or in the right place. And that kind of sort of defensive programming against bad responses is the kind of thing that you can build by looking at many different types of evaluation metrics. Because you can say like, oh, you know, the Copilot completion here is mostly right, but it doesn't close the brace. Well, that's the thing you can check for. Or, oh, this completion is quite good, but it defines a variable that was like already defined in the file. Like that's going to have a problem. That's an evaluation that you could check separately. And so this is where I think it's easy to convince yourself that all that matters is does it get the right answer? But the more that you think about production use cases of these things, the more you find a lot of this kind of stuff. One simple example is like sometimes the model names the output of a cell, a variable that's already in scope. Okay. Like we can just detect that and like we can just fix that. And this is the kind of thing that like evaluations over time and as you build these evaluations over time, you really can expand the robustness in which you trust these models. And for a company like Hex, who we need to put this stuff in GA, we can't just sort of like get to demo stage or even like private beta stage. We really hunting GA on all of these capabilities. Did it get the right answer on some cases is not good enough. [00:38:57]Alessio: I think the follow up question to that is in your past roles, you own the model that you're evaluating against. Here you don't actually have control into how the model evolves. How do you think about the model will just need to improve or we'll use another model versus like we can build kind of like engineering post-processing on top of it. How do you make the choice? [00:39:19]Bryan: So I want to say two things here. One like Jerry Liu talked a little bit about in his episode, he talked a little bit about sort of like you don't always want to retrain the weights to serve certain use cases. Rag is another tool that you can use to kind of like soft tune. I think that's right. And I want to go back to my favorite analogy here, which is like recommendation systems. When you build a recommendation system, you build the objective function. You think about like what kind of recs you want to provide, what kind of features you're allowed to use, et cetera, et cetera. But there's always another step. There's this really wonderful collection of blog posts from Eugene Yon and then ultimately like even Oldridge kind of like iterated on that for the Merlin project where there's this multi-stage recommender. And the multi-stage recommender says the first step is to do great retrieval. Once you've done great retrieval, you then need to do great ranking. Once you've done great ranking, you need to then do a good job serving. And so what's the analogy here? Rag is retrieval. You can build different embedding models to encode different features in your latent space to ensure that your ranking model has the best opportunity. Now you might say, oh, well, my ranking model is something that I've got a lot of capability to adjust. I've got full access to my ranking model. I'm going to retrain it. And that's great. And you should. And over time you will. But there's one more step and that's downstream and that's the serving. Serving often sounds like I just show the s**t to the user, but ultimately serving is things like, did I provide diverse recommendations? Going back to Stitch Fix days, I can't just recommend them five shirts of the same silhouette and cut. I need to serve them a diversity of recommendations. Have I respected their requirements? They clicked on something that got them to this place. Is the recommendations relevant to that query? Are there any hard rules? Do we maybe not have this in stock? These are all things that you put downstream. And so much like the recommendations use case, there's a lot of knobs to pull outside of retraining the model. And even in recommendation systems, when do you retrain your model for ranking? Not nearly as much as you do other s**t. And even this like embedding model, you might fiddle with more often than the true ranking model. And so I think the only piece of the puzzle that you don't have access to in the LLM case is that sort of like middle step. That's okay. We've got plenty of other work to do. So right now I feel pretty enabled. [00:41:56]Alessio: That's great. You obviously wrote a book on RecSys. What are some of the key concepts that maybe people that don't have a data science background, ML background should keep in mind as they work in this area? [00:42:07]Bryan: It's easy to first think these models are stochastic. They're unpredictable. Oh, well, what are we going to do? I think of this almost like gaseous type question of like, if you've got this entropy, where can you put the entropy? Where can you let it be entropic and where can you constrain it? And so what I want to say here is think about the cases where you need it to be really tightly constrained. So why are people so excited about function calling? Because function calling feels like a way to constrict it. Where can you let it be more gaseous? Well, maybe in the way that it talks about what it wants to do. Maybe for planning, if you're building agents and you want to do sort of something chain of thoughty. Well, that's a place where the entropy can happily live. When you're building applications of these models, I think it's really important as part of the problem framing to be super clear upfront. These are the things that can be entropic. These are the things that cannot be. These are the things that need to be super rigid and really, really aligned to a particular schema. We've had a lot of success in making specific the parts that need to be precise and tightly schemified, and that has really paid dividends. And so other analogies from data science that I think are very valuable is there's the sort of like human in the loop analogy, which has been around for quite a while. And I have gone on record a couple of times saying that like, I don't really love human in the loop. One of the things that I think we can learn from human in the loop is that the user is the best judge of what is good. And the user is pretty motivated to sort of like interact and give you kind of like additional nudges in the direction that you want. I think what I'd like to flip though, is instead of human in the loop, I'd like it to be AI in the loop. I'd rather center the user. I'd rather keep the user as the like core item at the center of this universe. And the AI is a tool. By switching that analogy a little bit, what it allows you to do is think about where are the places in which the user can reach for this as a tool, execute some task with this tool, and then go back to doing their workflow. It still gets this back and forth between things that computers are good at and things that humans are good at, which has been valuable in the human loop paradigm. But it allows us to be a little bit more, I would say, like the designers talk about like user-centered. And I think that's really powerful for AI applications. And it's one of the things that I've been trying really hard with Magic to make that feel like the workflow as the AI is right there. It's right where you're doing your work. It's ready for you anytime you need it. But ultimately you're in charge at all times and your workflow is what we care the most about. [00:44:56]Alessio: Awesome. Let's jump into lightning round. What's something that is not on your LinkedIn that you're passionate about or, you know, what's something you would give a TED talk on that is not work related? [00:45:05]Bryan: So I walk a lot. [00:45:07]Bryan: I have walked every road in Berkeley. And I mean like every part of every road even, not just like the binary question of, have you been on this road? I have this little app that I use called Wanderer, which just lets me like kind of keep track of everywhere I've been. And so I'm like a little bit obsessed. My wife would say a lot a bit obsessed with like what I call new roads. I'm actually more motivated by trails even than roads, but like I'm a maximalist. So kind of like everything and anything. Yeah. Believe it or not, I was even like in the like local Berkeley paper just talking about walking every road. So yeah, that's something that I'm like surprisingly passionate about. [00:45:45]Alessio: Is there a most underrated road in Berkeley? [00:45:49]Bryan: What I would say is like underrated is Kensington. So Kensington is like a little town just a teeny bit north of Berkeley, but still in the Berkeley hills. And Kensington is so quirky and beautiful. And it's a really like, you know, don't sleep on Kensington. That being said, one of my original motivations for doing all this walking was people always tell me like, Berkeley's so quirky. And I was like, how quirky is Berkeley? Turn it out. It's quite, quite quirky. It's also hard to say quirky and Berkeley in the same sentence I've learned as of now. [00:46:20]Alessio: That's a, that's a good podcast warmup for our next guests. All right. The actual lightning ground. So we usually have three questions, acceleration, exploration, then a takeaway acceleration. What's, what's something that's already here today that you thought would take much longer to arrive in AI and machine learning? [00:46:39]Bryan: So I invited the CEO of Hugging Face to my seminar when I worked at Stitch Fix and his talk at the time, honestly, like really annoyed me. The talk was titled like something to the effect of like LLMs are going to be the like technology advancement of the next decade. It's on YouTube. You can find it. I don't remember exactly the title, but regardless, it was something like LLMs for the next decade. And I was like, okay, they're like one modality of model, like whatever. His talk was fine. Like, I don't think it was like particularly amazing or particularly poor, but what I will say is damn, he was right. Like I, I don't think I quite was on board during that talk where I was like, ah, maybe, you know, like there's a lot of other modalities that are like moving pretty quick. I thought things like RL were going to be the like real like breakout success. And there's a little pun with Atari and breakout there, but yeah, like I, man, I was sleeping on LLMs and I feel a little embarrassed. I, yeah. [00:47:44]Alessio: Yeah. No, I mean, that's a good point. It's like sometimes the, we just had Jeremy Howard on the podcast and he was saying when he was talking about fine tuning, everybody thought it was dumb, you know, and then later people realize, and there's something to be said about messaging, especially like in technical audiences where there's kind of like the metagame, you know, which is like, oh, these are like the cool ideas people are exploring. I don't know where I want to align myself yet, you know, or whatnot. So it's cool exploration. So it's kind of like the opposite of that. You mentioned RL, right? That's something that was kind of like up and up and up. And then now it's people are like, oh, I don't know. Are there any other areas if you weren't working on, on magic that you want to go work on? [00:48:25]Bryan: Well, I did mention that, like, I think this like Memex product is just like incredibly exciting to me. And I think it's really opportunistic. I think it's very, very feasible, but I would maybe even extend that a little bit, which is I don't see enough people getting really enthusiastic about hardware with advanced AI built in. You're hearing whispering of it here and there, put on the whisper, but like you're starting to see people putting whisper into pieces of hardware and making that really powerful. I joked with, I can't think of her name. Oh, Sasha, who I know is a friend of the pod. Like I joked with Sasha that I wanted to make the big mouth Billy Bass as a babble fish, because at this point it's pretty easy to connect that up to whisper and talk to it in one language and have it talk in the other language. And I was like, this is the kind of s**t I want people building is like silly integrations between hardware and these new capabilities. And as much as I'm starting to hear whisperings here and there, it's not enough. I think I want to see more people going down this track because I think ultimately like these things need to be in our like physical space. And even though the margins are good on software, I want to see more like integration into my daily life. Awesome. [00:49:47]Alessio: And then, yeah, a takeaway, what's one message idea you want everyone to remember and think about? [00:49:54]Bryan: Even though earlier I was talking about sort of like, maybe like not reinventing things and being respectful of the sort of like ML and data science, like ideas. I do want to say that I think everybody should be experimenting with these tools as much as they possibly can. I've heard a lot of professors, frankly, express concern about their students using GPT to do their homework. And I took a completely opposite approach, which is in the first 15 minutes of the first class of my semester this year, I brought up GPT on screen and we talked about what GPT was good at. And we talked about like how the students can sort of like use it. I showed them an example of it doing data analysis work quite well. And then I showed them an example of it doing quite poorly. I think however much you're integrating with these tools or interacting with these tools, and this audience is probably going to be pretty high on that distribution. I would really encourage you to sort of like push this into the other people in your life. My wife is very technical. She's a product manager and she's using chat GPT almost every day for communication or for understanding concepts that are like outside of her sphere of excellence. And recently my mom and my sister have been sort of like onboarded onto the chat GPT train. And so ultimately I just, I think that like it is our duty to help other people see like how much of a paradigm shift this is. We should really be preparing people for what life is going to be like when these are everywhere. [00:51:25]Alessio: Awesome. Thank you so much for coming on, Bryan. This was fun. [00:51:29]Bryan: Yeah. Thanks for having me. And use Hex magic. [00:51:31] Get full access to Latent Space at www.latent.space/subscribe

Cross-Chain Examination
Ismael H.R. of Lagrange: ZK Big Data

Cross-Chain Examination

Play Episode Listen Later Nov 29, 2023 21:35


On this episode of Archebyte, Ash Egan is joined by Lagrange's founder and CEO, Ismael Hishon-Rezaizadeh to talk zero knowledge, storage proofs, and unlocking blockchain data.Today, blockchains have a difficult time leveraging data from other chains, but Ismael and Lagrange are working to fix that. Leveraging ZK and storage proofs, Lagrange enables blockchains to better utilize blockchain data itself, removing the need to leverage offchain sources and the security risks that come with doing so.Ismael walks us through using blockchains as databases, shifting trust from provers to verification layers, crosschain data access, and what all of this means for the future of crypto and onchain applications. 

AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion
AI Today Podcast: AI Glossary Series – Hadoop, MapReduce

AI Today Podcast: Artificial Intelligence Insights, Experts, and Opinion

Play Episode Listen Later Nov 27, 2023 15:32


Hadoop and MapReduce changed the world of big data. And data is the heart of AI, so it should come as no surprise that talk about big data in the context of AI. In this episode of the AI Today podcast hosts Kathleen Walch and Ron Schmelzer define the terms Hadoop, MapReduce, explain how these terms relate to AI and why it's important to know about them. Continue reading AI Today Podcast: AI Glossary Series – Hadoop, MapReduce at Cognilytica.

Screaming in the Cloud
How Couchbase is Using AI to Enhance the User Experience with Laurent Doguin

Screaming in the Cloud

Play Episode Listen Later Nov 14, 2023 31:52


Laurent Doguin, Director of Developer Relations & Strategy at Couchbase, joins Corey on Screaming in the Cloud to talk about the work that Couchbase is doing in the world of databases and developer relations, as well as the role of AI in their industry and beyond. Together, Corey and Laurent discuss Laurent's many different roles throughout his career including what made him want to come back to a role at Couchbase after stepping away for 5 years. Corey and Laurent dig deep on how Couchbase has grown in recent years and how it's using artificial intelligence to offer an even better experience to the end user.About LaurentLaurent Doguin is Director of Developer Relations & Strategy at Couchbase (NASDAQ: BASE), a cloud database platform company that 30% of the Fortune 100 depend on.Links Referenced: Couchbase: https://couchbase.com XKCD #927: https://xkcd.com/927/ dbdb.io: https://dbdb.io DB-Engines: https://db-engines.com/en/ Twitter: https://twitter.com/ldoguin LinkedIn: https://www.linkedin.com/in/ldoguin/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: Are you navigating the complex web of API management, microservices, and Kubernetes in your organization? Solo.io is here to be your guide to connectivity in the cloud-native universe!Solo.io, the powerhouse behind Istio, is revolutionizing cloud-native application networking. They brought you Gloo Gateway, the lightweight and ultra-fast gateway built for modern API management, and Gloo Mesh Core, a necessary step to secure, support, and operate your Istio environment.Why struggle with the nuts and bolts of infrastructure when you can focus on what truly matters - your application. Solo.io's got your back with networking for applications, not infrastructure. Embrace zero trust security, GitOps automation, and seamless multi-cloud networking, all with Solo.io.And here's the real game-changer: a common interface for every connection, in every direction, all with one API. It's the future of connectivity, and it's called Gloo by Solo.io.DevOps and Platform Engineers, your journey to a seamless cloud-native experience starts here. Visit solo.io/screaminginthecloud today and level up your networking game.Corey: Welcome to Screaming in the Cloud, I'm Corey Quinn. This promoted guest episode is brought to us by our friends at Couchbase. And before we start talking about Couchbase, I would rather talk about not being at Couchbase. Laurent Doguin is the Director of Developer Relations and Strategy at Couchbase. First, Laurent, thank you for joining me.Laurent: Thanks for having me. It's a pleasure to be here.Corey: So, what I find interesting is that this is your second time at Couchbase, where you were a developer advocate there for a couple of years, then you had five years of, we'll call it wilderness I suppose, and then you return to be the Director of Developer Relations. Which also ties into my personal working thesis of, the best way to get promoted at a lot of companies is to leave and then come back. But what caused you to decide, all right, I'm going to go work somewhere else? And what made you come back?Laurent: So, I've joined Couchbase in 2014. Spent about two or three years as a DA. And during those three years as a developer advocate, I've been advocating SQL database and I—at the time, it was mostly DBAs and ops I was talking to. And DBA and ops are, well, recent, modern ops are writing code, but they were not the people I wanted to talk to you when I was a developer advocate. I came from a background of developer, I've been a platform engineer for an enterprise content management company. I was writing code all day.And when I came to Couchbase, I realized I was mostly talking about Docker and Kubernetes, which is still cool, but not what I wanted to do. I wanted to talk about developers, how they use database to be better app, how they use key-value, and those weird thing like MapReduce. At the time, MapReduce was still, like, a weird thing for a lot of people, and probably still is because now everybody's doing SQL. So, that's what I wanted to talk about. I wanted to… engage with people identify with, really. And so, didn't happen. Left. Built a Platform as a Service company called Clever Cloud. They started about four or five years before I joined. We went from seven people to thirty-one LFs, fully bootstrapped, no VC. That's an interesting way to build a company in this age.Corey: Very hard to do because it takes a lot of upfront investment to build software, but you can sort of subsidize that via services, which is what we've done here in some respects. But yeah, that's a hard road to walk.Laurent: That's the model we had—and especially when your competition is AWS or Azure or GCP, so that was interesting. So entrepreneurship, it's not for everyone. I did my four years there and then I realized, maybe I'm going to do something else. I met my former colleagues of Couchbase at a software conference called Devoxx, in France, and they told me, “Well, there's a new sheriff in town. You should come back and talk to us. It's all about developers, we are repositioning, rehandling the way we do marketing at Couchbase. Why not have a conversation with our new CMO, John Kreisa?”And I said, “Well, I mean, I don't have anything to do. I actually built a brewery during that past year with some friends. That was great, but that's not going to feed me or anything. So yeah, let's have a conversation about work.” And so, I talked to John, I talked to a bunch of other people, and I realized [unintelligible 00:03:51], he actually changed, like, there was a—they were purposely going [against 00:03:55] developer, talking to developer. And that was not the case, necessarily, five, six years before that.So, that's why I came back. The product is still amazing, the people are still amazing. It was interesting to find a lot of people that still work there after, what, five years. And it's a company based in… California, headquartered in California, so you would expect people to, you know, jump around a bit. And I was pleasantly surprised to find the same folks there. So, that was also one of the reasons why I came back.Corey: It's always a strong endorsement when former employees rejoin a company. Because, I don't know about you, but I've always been aware of those companies you work for, you leave. Like, “Aw, I'm never doing that again for love or money,” just because it was such an unpleasant experience. So, it speaks well when you see companies that do have a culture of boomerangs, for lack of a better term.Laurent: That's the one we use internally, and there's a couple. More than a couple.Corey: So, one thing that seems to have been a thread through most of your career has been an emphasis on developer experience. And I don't know if we come at it from the same perspective, but to me, what drives nuts is honestly, with my work in cloud, bad developer experience manifests as the developer in question feeling like they're somehow not very good at their job. Like, they're somehow not understanding how all this stuff is supposed to work, and honestly, it leads to feeling like a giant fraud. And I find that it's pernicious because even when I intellectually know for a fact that I'm not the dumbest person ever to use this tool when I don't understand how something works, the bad developer experience manifests to me as, “You're not good enough.” At least, that's where I come at it from.Laurent: And also, I [unintelligible 00:05:34] to people that build these products because if we build the products, the user might be in the same position that we are right now. And so, we might be responsible for that experience [unintelligible 00:05:43] a developer, and that's not a great feeling. So, I completely agree with you. I've tried to… always on software-focused companies, whether it was Nuxeo, Couchbase, Clever Cloud, and then Couchbase. And I guess one of the good thing about coming back to a developer-focused era is all the product alignments.Like, a lot of people talk about product that [grows 00:06:08] and what it means. To me what it means was, what it meant—what it still means—building a product that developer wants to use, and not just want to, sometimes it's imposed to you, but actually are happy to use, and as you said, don't feel completely stupid about it in front of the product. It goes through different things. We've recently revamped our Couchbase UI, Couchbase Capella UI—Couchbase Capella is a managed cloud product—and so we've added a lot of in-product getting started guidelines, snippets of code, to help developers getting started better and not have that feeling of, “What am I doing? Why is it not working and what's going on?”Corey: That's an interesting decision to make, just because historically, working with a bunch of tools, the folks who are building the documentation working with that tool, tend to generally be experts at it, so they tend to optimize for improving things for the experience of someone has been using it for five years as opposed to the newcomer. So, I find that the longer a product is in existence, in many cases, the worse the new user experience becomes because companies tend to grow and sprawl in different ways, the product does likewise. And if you don't know the history behind it, “Oh, your company, what does it do?” And you look at the website and there's 50 different offerings that you have—like, the AWS landing page—it becomes overwhelming very quickly. So, it's neat to see that emphasis throughout the user interface on the new developer experience.On the other side of it, though, how are the folks who've been using it for a while respond to those changes? Because it's frustrating for me at least, when I log into a new account, which happens periodically within AWS land, and I have this giant series of onboarding pop-ups that I have to click to make go away every single time. How are they responding to it?Laurent: Yeah, it's interesting. One of the first things that struck me when I joined Couchbase the first time was the size of the technical documentation team. Because the whole… well, not the whole point, but part of the reason why they exist is to do that, to make sure that you understand all the differences and that it doesn't feel like the [unintelligible 00:08:18] what the documentation or the product pitch or everything. Like, they really, really, really emphasize on this from the very beginning. So, that was interesting.So, when you get that culture built into the products, well, the good thing is… when people try Couchbase, they usually stick with Couchbase. My main issue as a Director of the Developer Relations is not to make people stick with Couchbase because that works fairly well with the product that we have; it's to make them aware that we exist. That's the biggest issue I have. So, my goal as DevRel is to make sure that people get the trial, get through the trial, get all that in-app context, all that helps, get that first sample going, get that first… I'm not going to say product built because that's even a bit further down the line, but you know, get that sample going. We have a code playground, so when you're in the application, you get to actually execute different pieces of code, different languages. And so, we get those numbers and we're happy to see that people actually try that. And that's a, well, that's a good feeling.Corey: I think that there's a definite lack of awareness almost industry-wide around the fact that as the diversity of your customers increases, you have to have different approaches that meet them at various points along the journey. Because things that I've seen are okay, it's easy to ass—even just assuming a binary of, “Okay, I've done this before a thousand times; this is the thousand and first, I don't need the Hello World tutorial,” versus, “Oh, I have no idea what I'm doing. Give me the Hello World tutorial,” there are other points along that continuum, such as, “Oh, I used to do something like this, but it's been three years. Can you give me a refresher,” and so on. I think that there's a desire to try and fit every new user into a predefined persona and that just doesn't work very well as products become more sophisticated.Laurent: It's interesting, we actually have—we went through that work of defining those personas because there are many. And that was the origin of my departure. I had one person, ops slash DBA slash the person that maintain this thing, and I wanted to talk to all the other people that built the application space in Couchbase. So, we broadly segment things into back-end, full-stack, and mobile because Couchbase is also a mobile database. Well, we haven't talked too much about this, so I can explain you quickly what Couchbase is.It's basically a distributed JSON database with an integrated caching layer, so it's reasonably fast. So it does cache, and when the key-value is JSON, then you can create with SQL, you can do full-text search, you can do analytics, you can run user-defined function, you get triggers, you get all that actual SQL going on, it's transactional, you get joins, ANSI joins, you get all those… windowing function. It's modern SQL on the JSON database. So, it's a general-purpose database, and it's a general-purpose database that syncs.I think that's the important part of Couchbase. We are very good at syncing cluster of databases together. So, great for multi-cloud, hybrid cloud, on-prem, whatever suits you. And we also sync on the device, there's a thing called Couchbase Mobile, which is a local database that runs in your phone, and it will sync automatically to the server. So, a general-purpose database that syncs and that's quite modern.We try to fit as much way of growing data as possible in our database. It's kind of a several-in-one database. We call that a data platform. It took me a while to warm up to the word platform because I used to work for an enterprise content management platform and then I've been working for a Platform as a Service and then a data platform. So, it took me a bit of time to warm up to that term, but it explained fairly well, the fact that it's a several-in-one product and we empower people to do the trade-offs that they want.Not everybody needs… SQL. Some people just need key-value, some people need search, some people need to do SQL and search in the same query, which we also want people to do. So, it's about choices, it's about empowering people. And that's why the word platform—which can feel intimidating because it can seem complex, you know, [for 00:12:34] a lot of choices. And choices is maybe the enemy of a good developer experience.And, you know, we can try to talk—we can talk for hours about this. The more services you offer, the more complicated it becomes. What's the sweet spots? We did—our own trade-off was to have good documentation and good in-app help to fix that complexity problem. That's the trade-off that we did.Corey: Well, we should probably divert here just to make sure that we cover the basic groundwork for those who might not be aware: what exactly is Couchbase? I know that it's a database, which honestly, anything is a database if you hold it incorrectly enough; that's my entire shtick. But what is it exactly? Where does it start? Where does it stop?Laurent: Oh, where does it start? That's an interesting question. It's a… a merge—some people would say a fork—of Apache CouchDB, and membase. Membase was a distributed key-value store and CouchDB was this weird Erlang and C JSON REST API database that was built by Damian Katz from Lotus Notes, and that was in 2006 or seven. That was before Node.js.Let's not care about the exact date. The point is, a JSON and REST API-enabled database before Node.js was, like, a strong [laugh] power move. And so, those two merged and created the first version of Couchbase. And then we've added all those things that people want to do, so SQL, full-text search, analytics, user-defined function, mobile sync, you know, all those things. So basically, a general-purpose database.Corey: For what things is it not a great fit? This is always my favorite question to ask database folks because the zealot is going to say, “It's good for every use case under the sun. Use it for everything, start to finish”—Laurent: Yes.Corey: —and very few databases can actually check that box.Laurent: It's a very interesting question because when I pitch like, “We do all the things,” because we are a platform, people say, “Well, you must be doing lots of trade-offs. Where is the trade-off?” The trade-off is basically the way you store something is going to determine the efficiency of your [growing 00:14:45]—or the way you [grow 00:14:47] it. And that's one of the first thing you learn in computer science. You learn about data structure and you know that it's easier to get something in a hashmap when you have the key than passing your whole list of elements and checking your data, is it right one? It's the same for databases.So, our different services are different ways to store the data and to query it. So, where is it not good, it's where we don't have an index or a service that answer to the way you want to query data. We don't have a graph service right now. You can still do recursive common table expression for the SQL nerds out there, that will allow you to do somewhat of a graph way of querying your data, but that's not, like, actual—that's not a great experience for people were expecting a graph, like a Neo4j or whatever was a graph database experience.So, that's the trade-off that we made. We have a lot of things at the same place and it can be a little hard, intimidating to operate, and the developer experience can be a little, “Oh, my God, what is this thing that can do all of those features?” At the same time, that's just, like, one SDK to learn for all of the features we've just talked about. So, that's what we did. That's a trade-off that we did.It sucks to operate—well, [unintelligible 00:16:05] Couchbase Capella, which is a lot like a vendor-ish thing to say, but that's the value props of our managed cloud. It's hard to operate, we'll operate this for you. We have a Kubernetes operator. If you are one of the few people that wants to do Kubernetes at home, that's also something you can do. So yeah, I guess what we cannot do is the thing that Route 53 and [Unbound 00:16:26] and [unintelligible 00:16:27] DNS do, which is this weird DNS database thing that you like so much.Corey: One thing that's, I guess, is a sign of the times, but I have to confess that I'm relatively skeptical around, when I pull up couchbase.com—as one does; you're publicly traded; I don't feel that your company has much of a choice in this—but the first thing it greets me with is Couchbase Capella—which, yes, that is your hosted flagship product; that should be the first thing I see on the website—then it says, “Announcing Capella iQ, AI-powered coding assistance for developers.” Which oh, great, not another one of these.So, all right, give me the pitch. What is the story around, “Ooh, everything that has been a problem before, AI is going to make it way better.” Because I've already talked to you about developer experience. I know where you stand on these things. I have a suspicion you would not be here to endorse something you don't believe in. How does the AI magic work in this context?Laurent: So, that's the thing, like, who's going to be the one that get their products out before the other? And so, we're announcing it on the website. It's available on the private preview only right now. I've tried it. It works.How does it works? The way most chatbot AI code generation work is there's a big model, large language model that people use and that people fine-tune into in order to specialize it to the tasks that they want to do. The way we've built Couchbase iQ is we picked a very famous large language model, and when you ask a question to a bot, there's a context, there's a… the size of the window basically, that allows you to fit as much contextual information as possible. The way it works and the reason why it's integrated into Couchbase Capella is we make sure that we preload that context as much as possible and fine-tune that model, that [foundation 00:18:19] model, as much as possible to do whatever you want to do with Couchbase, which usually falls into several—a couple of categories, really—well maybe three—you want to write SQL, you want to generate data—actually, that's four—you want to generate data, you want to generate code, and if you paste some SQL code or some application code, you want to ask that model, what does do? It's especially true for SQL queries.And one of the questions that many people ask and are scared of with chatbot is how does it work in terms of learning? If you give a chatbot to someone that's very new to something, and they're just going to basically use a chatbot like Stack Overflow and not really think about what they're doing, well it's not [great 00:19:03] right, but because that's the example that people think most developer will do is generate code. Writing code is, like, a small part of our job. Like, a substantial part of our job is understanding what the code does.Corey: We spend a lot more time reading code than writing it, if we're, you know—Laurent: Yes.Corey: Not completely foolish.Laurent: Absolutely. And sometimes reading big SQL query can be a bit daunting, especially if you're new to that. And one of the good things that you get—Corey: Oh, even if you're not, it can still be quite daunting, let me assure you.Laurent: [laugh]. I think it's an acquired taste, let's be honest. Some people like to write assembly code and some people like to write SQL. I'm sort of in the middle right now. You pass your SQL query, and it's going to tell you more or less what it does, and that's a very nice superpower of AI. I think that's [unintelligible 00:19:48] that's the one that interests me the most right now is using AI to understand and to work better with existing pieces of code.Because a lot of people think that the cost of software is writing the software. It's maintaining the codebase you've written. That's the cost of the software. That's our job as developers should be to write legacy code because it means you've provided value long enough. And so, if in a company that works pretty well and there's a lot of legacy code and there's a lot of new people coming in and they'll have to learn all those things, and to be honest, sometimes we don't document stuff as much as we should—Corey: “The code is self-documenting,” is one of the biggest lies I hear in tech.Laurent: Yes, of course, which is why people are asking retired people to go back to COBOL again because nobody can read it and it's not documented. Actually, if someone's looking for a company to build, I guess, explaining COBOL code with AI would be a pretty good fit to do in many places.Corey: Yeah, it feels like that's one of those things that would be of benefit to the larger world. The counterpoint to that is you got that many business processes wrapped around something running COBOL—and I assure you, if you don't, you would have migrated off of COBOL long before now—it's making sure that okay well, computers, when they're in the form of AI, are very, very good at being confident-sounding when they talk about things, but they can also do that when they're completely wrong. It's basically a BS generator. And that is a scary thing when you're taking a look at something that broad. I mean, I'll use the AI coding assistance for things all the time, but those things look a lot more like, “Okay, I haven't written CloudFormation from scratch in a while. Build out the template, just because I forget the exact sequence.” And it's mostly right on things like that. But then you start getting into some of the real nuanced areas like race conditions and the rest, and often it can make things worse instead of better. That's the scary part, for me, at least.Laurent: Most coding assistants are… and actually, each time you ask its opinion to an AI, they say, “Well, you should take this with a grain of salt and we are not a hundred percent sure that this is the case.” And this is, make sure you proofread that, which again, from a learning perspective, can be a bit hard to give to new students. Like, you're giving something to someone and might—that assumes is probably as right as Wikipedia but actually, it's not. And it's part of why it works so well. Like, the anthropomorphism that you get with chatbots, like, this, it feels so human. That's why it get people so excited about it because if you think about it, it's not that new. It's just the moment it took off was the moment it looked like an assertive human being.Corey: As you take a look through, I guess, the larger ecosystem now, as well as the database space, given that is where you specialize, what do you think people are getting right and what do you think people are getting wrong?Laurent: There's a couple of ways of seeing this. Right now, when I look at from the outside, every databases is going back to SQL, I think there's a good reason for that. And it's interesting to put into perspective with AI because when you generate something, there's probably less chance to generate something wrong with SQL than generating something with code directly. And I think five generation—was it four or five generation language—there some language generation, so basically, the first innovation is assembly [into 00:23:03] in one and then you get more evolved languages, and at some point you get SQL. And SQL is a way to very shortly express a whole lot of business logic.And I think what people are doing right now is going back to SQL. And it's been impressive to me how even new developers that were all about [ORMs 00:23:25] and [no-DMs 00:23:26], and you know, avoiding writing SQL as much as possible, are actually back to it. And that's, for an old guy like me—well I mean, not that old—it feels good. I think SQL is coming back with a vengeance and that makes me very happy. I think what people don't realize is that it also involves doing data modeling, right, and stuff because database like Couchbase that are schemaless exist. You should store your data without thinking about it, you should still do data modeling. It's important. So, I think that's the interesting bits. What are people doing wrong in that space? I'm… I don't want to say bad thing about other databases, so I cannot even process that thought right now.Corey: That's okay. I'm thrilled to say negative things about any database under the sun. They all haunt me. I mean, someone wants to describe SQL to me is the chess of the programming world and I feel like that's very accurate. I have found that it is far easier in working with databases to make mistakes that don't wash off after a new deployment than it is in most other realms of technology. And when you're lucky and have a particular aura, you tend to avoid that stuff, at least that was always my approach.Laurent: I think if I had something to say, so just like the XKCD about standards: like, “there's 14 standards. I'm going to do one that's going to unify them all.” And it's the same with database. There's a lot… a [laugh] lot of databases. Have you ever been on a website called dbdb.io?Corey: Which one is it? I'm sorry.Laurent: Dbdb.io is the database of databases, and it's very [laugh] interesting website for database nerds. And so, if you're into database, dbdb.io. And you will find Couchbase and you will find a whole bunch of other databases, and you'll get to know which database is derived from which other database, you get the history, you get all those things. It's actually pretty interesting.Corey: I'm familiar with DB-Engines, which is sort of like the ranking databases by popularity, and companies will bend over backwards to wind up hitting all of the various things that they want in that space. The counterpoint with all of it is that it's… it feels historically like there haven't exactly been an awful lot of, shall we say, huge innovations in databases for the past few years. I mean, sure, we hear about vectors all the time now because of the joy that's AI, but smarter people than I are talking about how, well that's more of a feature than it is a core database. And the continual battle that we all hear about constantly is—and deal with ourselves—of should we use a general-purpose database, or a task-specific database for this thing that I'm doing remains largely unsolved.Laurent: Yeah, what's new? And when you look at it, it's like, we are going back to our roots and bringing SQL again. So, is there anything new? I guess most of the new stuff, all the interesting stuff in the 2010s—well, basically with the cloud—were all about the distribution side of things and were all about distributed consensus, Zookeeper, etcd, all that stuff. Couchbase is using an RAFT-like algorithm to keep every node happy and under the same cluster.I think that's one of the most interesting things we've had for the past… well, not for the past ten years, but between, basically, 20 or… between the start of AWS and well, let's say seven years ago. I think the end of the distribution game was brought to us by the people that have atomic clock in every data center because that's what you use to synchronize things. So, that was interesting things. And then suddenly, there wasn't that much innovation in the distributed world, maybe because Aphyr disappeared from Twitter. That might be one of the reason. He's not here to scare people enough to be better at that.Aphyr was the person behind the test called the Jepsen Test [shoot 00:27:12]. I think his blog engine was called Call Me Maybe, and he was going through every distributed system and trying to break them. And that was super interesting. And it feels like we're not talking that much about this anymore. It really feels like database have gone back to the status of infrastructure.In 2010, it was not about infrastructure. It was about developer empowerment. It was about serving JSON and developer experience and making sure that you can code faster without some constraint in a distributed world. And like, we fixed this for the most part. And the way we fixed this—and as you said, lack of innovation, maybe—has brought databases back to an infrastructure layer.Again, it wasn't the case 15 years a—well, 2023—13 years ago. And that's interesting. When you look at the new generation of databases, sometimes it's just a gateway on top of a well-known database and they call that a database, but it provides higher-level services, provides higher-level bricks, better developer experience to developer to build stuff faster. We've been trying to do this with Couchbase App Service and our sync gateway, which is basically a gateway on top of a Couchbase cluster that allow you to manage authentication, authorization, that allows you to manage synchronization with your mobile device or with websites. And yeah, I think that's the most interesting thing to me in this industry is how it's been relegated back to infrastructure, and all the cool stuff, new stuff happens on the layer above that.Corey: I really want to thank you for taking the time to speak with me. If people want to learn more, where's the best place for them to find you?Laurent: Thanks for having me and for entertaining this conversation. I can be found anywhere on the internet with these six letters: L-D-O-G-U-I-N. That's actually 7 letters. Ldoguin. That's my handle on pretty much any social network. Ldoguin. So X, [BlueSky 00:29:21], LinkedIn. I don't know where to be anymore.Corey: I hear you. We'll put links to all of it in the [show notes 00:29:27] and let people figure out where they want to go on that. Thank you so much for taking the time to speak with me today. I really do appreciate it.Laurent: Thanks for having me.Corey: Laurent Doguin, Director of Developer Relations and Strategy at Couchbase. I'm Cloud Economist Corey Quinn and this episode has been brought to us by our friends at Couchbase. If you enjoyed this episode, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with an angry comment that you're not going to be able to submit properly because that platform of choice did not pay enough attention to the experience of typing in a comment.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.

Data Engineering Podcast
Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Data Engineering Podcast

Play Episode Listen Later May 21, 2023 55:50


Summary Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources Interview Introduction How did you get involved in the area of data management? Can you describe what Estuary is and the story behind it? Stream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem? What was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch? With the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming? What is the comparative level of difficulty and support for these disparate paradigms? What is the impact of continuous data flows on dags/orchestration of transforms? What role do modern table formats have on the viability of real-time data lakes? Can you describe the architecture of your Flow platform? What are the core capabilities that you are optimizing for in its design? What is involved in getting Flow/Estuary deployed and integrated with an organization's data systems? What does the workflow look like for a team using Estuary? How does it impact the overall system architecture for a data platform as compared to other prevalent paradigms? How do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources? What are the most interesting, innovative, or unexpected ways that you have seen Estuary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary? When is Estuary the wrong choice? What do you have planned for the future of Estuary? Contact Info Dave Y (mailto:dave@estuary.dev) Johnny G (mailto:johnny@estuary.dev) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Estuary (https://estuary.dev) Try Flow Free (https://dashboard.estuary.dev/register) Gazette (https://gazette.dev) Samza (https://samza.apache.org/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) Storm (https://storm.apache.org/) Kafka Topic Partitioning (https://www.openlogic.com/blog/kafka-partitions) Trino (https://trino.io/) Avro (https://avro.apache.org/) Parquet (https://parquet.apache.org/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Airbyte (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Vector Database (https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/vectordb) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Netflix DBLog (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b) JSON-Schema (http://json-schema.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

Der Data Analytics Podcast
Apache Spark - Datenverarbeitung und Analyse von BigData - Vs MapReduce

Der Data Analytics Podcast

Play Episode Listen Later Dec 13, 2022 5:04


Open-Source-Plattform, die es ermöglicht große Datenmengen zu prozessieren. Es kann mit verschiedenen Programmiersprachen wie Python, Java, R etc gearbeitet werden. Verwendet wird eine in-Memory-Technologie.

Screaming in the Cloud
Invisible Infrastructure and Data Solutions with Alex Rasmussen

Screaming in the Cloud

Play Episode Listen Later Aug 18, 2022 37:39


About AlexAlex holds a Ph.D. in Computer Science and Engineering from UC San Diego, and has spent over a decade building high-performance, robust data management and processing systems. As an early member of a couple fast-growing startups, he's had the opportunity to wear a lot of different hats, serving at various times as an individual contributor, tech lead, manager, and executive. He also had a brief stint as a Cloud Economist with the Duckbill Group, helping AWS customers save money on their AWS bills. He's currently a freelance data engineering consultant, helping his clients build, manage, and maintain their data infrastructure. He lives in Los Angeles, CA.Links Referenced: Company website: https://bitsondisk.com Twitter: https://twitter.com/alexras LinkedIn: https://www.linkedin.com/in/alexras/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: I come bearing ill tidings. Developers are responsible for more than ever these days. Not just the code that they write, but also the containers and the cloud infrastructure that their apps run on. Because serverless means it's still somebody's problem. And a big part of that responsibility is app security from code to cloud. And that's where our friend Snyk comes in. Snyk is a frictionless security platform that meets developers where they are - Finding and fixing vulnerabilities right from the CLI, IDEs, Repos, and Pipelines. Snyk integrates seamlessly with AWS offerings like code pipeline, EKS, ECR, and more! As well as things you're actually likely to be using. Deploy on AWS, secure with Snyk. Learn more at Snyk.co/scream That's S-N-Y-K.co/screamCorey: DoorDash had a problem. As their cloud-native environment scaled and developers delivered new features, their monitoring system kept breaking down. In an organization where data is used to make better decisions about technology and about the business, losing observability means the entire company loses their competitive edge. With Chronosphere, DoorDash is no longer losing visibility into their applications suite. The key? Chronosphere is an open-source compatible, scalable, and reliable observability solution that gives the observability lead at DoorDash business, confidence, and peace of mind. Read the full success story at snark.cloud/chronosphere. That's snark.cloud slash C-H-R-O-N-O-S-P-H-E-R-E.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I am joined this week by a returning guest, who… well, it's a little bit complicated and more than a little bittersweet. Alex Rasmussen was a principal cloud economist here at The Duckbill Group until he committed an unforgivable sin. That's right. He gave his notice. Alex, thank you for joining me here, and what have you been up to, traitor?Alex: [laugh]. Thank you for having me back, Corey.Corey: Of course.Alex: At time of recording, I am restarting my freelance data engineering business, which was dormant for the sadly brief time that I worked with you all at The Duckbill Group. And yeah, so that's really what I've been up to for the last few days. [laugh].Corey: I want to be very clear that I am being completely facetious when I say this. When someone is considering, “Well, am I doing what I really want to be doing?” And if the answer is no, too many days in a row, yeah, you should find something that aligns more with what you want to do. And anyone who's like, “Oh, you're leaving? Traitor, how could you do that?” Yeah, those people are trash. You don't want to work with trash.I feel I should clarify that this is entirely in jest and I could not be happier that you are finding things that are more aligned with aspects of what you want to be doing. I am serious when I say that, as a company, we are poorer for your loss. You have been transformative here across a number of different axes that we will be going into over the course of this episode.Alex: Well, thank you very much, I really appreciate that. And I came to a point where I realized, you know, the old saying, “You don't know what you got till it's gone?” I realized, after about six months of working with Duckbill Group that I missed building stuff, I missed building data systems, I missed being a full-time data person. And I'm really excited to get back to that work, even though I'll definitely miss working with everybody on the team. So yeah.Corey: There are a couple of things that I found really notable about your time working with us. One of them was that even when you wound up applying to work here, you were radically different than—well, let's be direct here—than me. We are almost polar opposites in a whole bunch of ways. I have an eighth-grade education; you have a PhD in computer science and engineering from UCSD. And you are super-deep into the world of data, start to finish, whereas I have spent my entire career on things that are stateless because I am accident prone, and when you accidentally have a problem with the database, you might not have a company anymore, but we can all laugh as we reprovision the web server fleet.We just went in very different directions as far as what we found interesting throughout our career, more or less. And we were not quite sure how it was going to manifest in the context of cloud economics. And I can say now that we have concluded the experiment, that from my perspective, it went phenomenally well. Because the exact areas that I am weak at are where you excel. And, on some level, I would say that you're not necessarily as weak in your weak areas as I am in mine, but we want to reinforce it and complementing each other rather than, “Well, we now have a roomful of four people who are all going to yell at you about the exact same thing.” We all went in different directions, which I thought was really neat.Alex: I did too. And honestly, I learned a tremendous, tremendous amount in my time at Duckbill Group. I think the window into just how complex and just how vast the ecosystem of services within AWS is, and kind of how they all ping off of each other in these very complicated ways was really fascinating, fascinating stuff. But also just an insight into just what it takes to get stuff done when you're talking with—you know, so most of my clientele to date have been small to medium-sized businesses, you know, small as two people; as big as a few hundred people. But I wasn't working with Fortune 1000 companies like Duckbill Group regularly does, and an insight into just, number one, what it takes to get things done inside of those organizations, but also what it takes to get things done with AWS when you're talking about, you know, for instance, contracts that are tens, or hundreds of millions of dollars in total contract value. And just what that involves was just completely eye-opening for me.Corey: From my perspective, what I found—I guess, in hindsight, it should have been more predictable than it was—but you talk about having a background and an abiding passion for the world of data, and I'm sitting here thinking, that's great. We have all this data in the form of the Cost and Usage Reports and the bills, and I forgot the old saw that yeah, if it fits in RAM, it's not a big data problem. And yeah, in most cases, what we have tends to fit in RAM. I guess you don't tend to find things interesting until Microsoft Excel gives up and calls uncle.Alex: I don't necessarily know that that's true. I think that there are plenty of problems to be had in the it fits in RAM space, precisely because so much of it fits in RAM. And I think that, you know, particularly now that, you know—I think there's it's a very different world that we live in from the world that we lived in ten years ago, where ten years ago—Corey: And right now I'm talking to you on a computer with 128 gigs of RAM, and it—Alex: Well, yeah.Corey: —that starts to look kind of big data-y.Alex: Well, not only that, but I think on the kind of big data side, right? When you had to provision your own Hadoop cluster, and after six months of weeping tears of blood, you managed to get it going, right, at the end of that process, you went, “Okay, I've got this big, expensive thing and I need this group of specialists to maintain it all. Now, what the hell do I do?” Right? In the intervening decade, largely due to the just crushing dominance of the public clouds, that problem—I wouldn't call that problem solved, but for all practical purposes, at all reasonable scales, there's a solution that you can just plug in a credit card and buy.And so, now the problem, I think, becomes much more high level, right, than it used to be. Used to be talking about how well you know, how do I make this MapReduce job as efficient as it possibly can be made? Nobody really cares about that anymore. You've got a query planner; it executes a query; it'll probably do better than you can. Now, I think the big challenges are starting to be more in the area of, again, “How do I know what I have? How do I know who's touched it recently? How do I fix it when it breaks? How do I even organize an organization that can work effectively with data at petabyte scale and say anything meaningful about it?”And so, you know, I think that the landscape is shifting. One of the reasons why I love this field so much is that the landscape is shifting very rapidly and as soon as we think, “Ah yes. We have solved all of the problems.” Then immediately, there are a hundred new problems to solve.Corey: For me, what I found, I guess, one of the most eye-opening things about having you here is your actual computer science background. Historically, we have biased for folks who have come up from the ops side of the world. And that lends itself to a certain understanding. And, yes, I've worked with developers before; believe it or not, I do understand how folks tend to think in that space. I have not a complete naive fool when it comes to these things.But what I wasn't prepared for was the nature of our internal, relatively casual conversations about a bunch of different things, where we'll be on a Zoom chat or something, and you will just very casually start sharing your screen, fire up a Jupyter Notebook and start writing code as you're talking to explain what it is you're talking about and watching it render in real time. And I'm sitting here going, “Huh, I can't figure out whether we should, like, wind up giving him a raise or try to burn him as a witch.” I could really see it going either way. Because it was magic and transformative from my perspective.Alex: Well, thank you. I mean, I think that part of what I am very grateful for is that I've had an opportunity to spend a considerable period of time in kind of both the academic and industrial spaces. I got a PhD, basically kept going to school until somebody told me that I had to stop, and then spent a lot of time at startups and had to do a lot of different kinds of work just to keep the wheels attached to the bus. And so, you know, when I arrived at Duckbill Group, I kind of looked around and said, “Okay, cool. There's all the stuff that's already here. That's awesome. What can I do to make that better?” And taking my lens so to speak, and applying it to those problems, and trying to figure out, like, “Okay, well as a cloud economist, what do I need to do right now that sucks? And how do I make it not suck?”Corey: It probably involves a Managed NAT Gateway.Alex: Whoa, God. And honestly, like, I spent a lot of time developing a bunch of different tools that were really just there in the service of that. Like, take my job, make it easier. And I'm really glad that you liked what you saw there.Corey: It was interesting watching how we wound up working together on things. Like, there's a blog post that I believe is out by the time this winds up getting published—but if not, congratulations on listening to this, you get a sneak preview—where I was looking at the intelligent tiering changes in pricing, where any object below 128 kilobytes does not have a monitoring charge attached to it, and above it, it does. And it occurred to me on a baseline gut level that, well wait a minute, it feels like there is some object sizes, where regardless of how long it lives in storage and transition to something cheaper, it will never quite offset that fee. So, instead of having intelligent tiering for everything, that there's some cut-off point below which you should not enable intelligent tiering because it will always cost you more than it can possibly save you.And I mentioned that to you and I had to do a lot of articulating with my hands because it's all gut feelings stuff and this stuff is complicated at the best of times. And your response was, “Huh.” Then it felt like ten minutes later you came back with a multi-page blog post written—again—in a Python notebook that has a dynamic interactive graph that shows the breakeven and cut-off points, a deep dive math showing exactly where in certain scenarios it is. And I believe the final takeaway was somewhere between 148 to 161 kilobytes, somewhere in that range is where you want to draw the cut-off. And I'm just looking at this and marveling, on some level.Alex: Oh, thanks. To be fair, it took a little bit more than ten minutes. I think it was something where it kind of went through a couple of stages where at first I was like, “Well, I bet I could model that.” And then I'm like, “Well, wait a minute. There's actually, like—if you can kind of put the compute side of this all the way to the side and just remove all API calls, it's a closed form thing. Like, you can just—this is math. I can just describe this with math.”And cue the, like, Beautiful Mind montage where I'm, like, going onto the whiteboard and writing a bunch of stuff down trying to remember the point intercept form of a line from my high school algebra days. And at the end, we had that blog post. And the reason why I kind of dove into that headfirst was just this, I have this fascination for understanding how all this stuff fits together, right? I think so often, what you see is a bunch of little point things, and somebody says, “You should use this at this point, for this reason.” And there's not a lot in the way of synthesis, relatively speaking, right?Like, nobody's telling you what the kind of underlying thing is that makes it so that this thing is better in these circumstances than this other thing is. And without that, it's a bunch of, kind of, anecdotes and a bunch of kind of finger-in-the-air guesses. And there's a part of that, that just makes me sad, fundamentally, I guess, that humans built all of this stuff; we should know how all of it fits together. And—Corey: You would think, wouldn't you?Alex: Well, but the thing is, it's so enormously complicated and it's been developed over such an enormously long period of time, that—or at least, you know, relatively speaking—it's really, really hard to kind of get that and extract it out. But I think when you do, it's very satisfying when you can actually say like, “Oh no, no, we've actually done—we've done the analysis here. Like, this is exactly what you ought to be doing.” And being able to give that clear answer and backing it up with something substantial is, I think, really valuable from the customer's point of view, right, because they don't have to rely on us kind of just doing the finger-in-the-air guess. But also, like, it's valuable overall. It extends the kind of domain where you don't have to think about whether or not you've got the right answer there. Or at least you don't have to think about it as much.Corey: My philosophy has always been that when I have those hunches, they're useful, and it's an indication that there's something to look into here. Where I think it goes completely off the rails is when people, like, “Well, I have a hunch and I have this belief, and I'm not going to evaluate whether or not that belief is still one that is reasonable to hold, or there has been perhaps some new information that it would behoove me to figure out. Nope, I've just decided that I know—I have a hunch now and that's enough and I've done learning.” That is where people get into trouble.And I see aspects of it all the time when talking to clients, for example. People who believe things about their bill that at one point were absolutely true, but now no longer are. And that's one of those things that, to be clear, I see myself doing this. This is not something—Alex: Oh, everybody does, yeah.Corey: —I'm blaming other people for it all. Every once in a while I have to go on a deep dive into our own AWS bill just to reacquaint myself with an understanding of what's going on over there.Alex: Right.Corey: And I will say that one thing that I was firmly convinced was going to happen during your tenure here was that you're a data person; hiring someone like you is the absolute most expensive thing you can ever do with respect to your AWS bill because hey, you're into the data space. During your tenure here, you cut the bill in half. And that surprises me significantly. I want to further be clear that did not get replaced by, “Oh, yeah. How do you cut your AWS bill by so much?” “We moved everything to Snowflake.” No, we did not wind up—Alex: [laugh].Corey: Just moving the data somewhere else. It's like, at some level, “Great. How do I cut the AWS bill by a hundred percent? We migrate it to GCP.” Technically correct; not what the customer is asking for.Alex: Right? Exactly, exactly. I think part of that, too—and this is something that happens in the data part of the space more than anywhere else—it's easy to succumb to shiny object syndrome, right? “Oh, we need a cloud data warehouse because cloud data warehouse, you know? Snowflake, most expensive IPO in the history of time. We got to get on that train.”And, you know, I think one of the things that I know you and I talked about was, you know, where should all this data that we're amassing go? And what should we be optimizing for? And I think one of the things that, you know, the kind of conclusions that we came to there was, well, we're doing some stuff here, that's kind of designed to accelerate queries that don't really need to be accelerated all that much, right? The difference between a query taking 500 milliseconds and 15 seconds, from our point of view, doesn't really matter all that much, right? And that realization alone, kind of collapsed a lot of technical complexity, and that, I will say we at Duckbill Group still espouse, right, is that cloud cost is an architectural problem, it's not a right-sizing your instances problem. And once we kind of got past that architectural problem, then the cost just sort of cratered. And honestly, that was a great feeling, to see the estimate in the billing console go down 47% from last month, and it's like, “Ah, still got it.” [laugh].Corey: It's neat to watch that happen, first off—Alex: For sure.Corey: But it also happened as well, with increasing amounts of utility. There was a new AWS billing page that came out, and I'm sure it meets someone's needs somewhere, somehow, but the things that I always wanted to look at when I want someone to pull up their last month's bill is great, hit the print button—on the old page—and it spits out an exploded pdf of every type of usage across their entire AWS estate. And I can skim through that thing and figure out what the hell's going on at a high level. And this new thing did not let me do that. And that's a concern, not just for the consulting story because with our clients, we have better access than printing a PDF and reading it by hand, but even talking to randos on the internet who were freaking out about an AWS bill, they shouldn't have to trust me enough to give me access into their account. They should be able to get a PDF and send it to me.Well, I was talking with you about this, and again, in what felt like ten minutes, you wound up with a command line tool, run it on an exported CSV of a monthly bill and it spits it out as an HTML page that automatically collapses in and allocates things based upon different groups and service type and usage. And congratulations, you spent ten minutes to create a better billing experience than AWS did. Which feels like it was probably, in fairness to AWS, about seven-and-a-half minutes more time than they spent on it.Alex: Well, I mean, I think that comes back to what we were saying about, you know, not all the interesting problems in data are in data that doesn't fit in RAM, right? I think, in this case, that came from two places. I looked at those PDFs for a number of clients, and there were a few things that just made my brain hurt. And you and Mike and the rest of the folks at Duckbill could stare at the PDF, like, reading the matrix because you've seen so many of them before and go, ah, yes, “Bill spikes here, here, here.” I'm looking at this and it's just a giant grid of numbers.And what I wanted was I wanted to be able to say, like, don't show me the services in alphabetical order; show me the service is organized in descending order by spend. And within that, don't show me the operations in alphabetical order; show me the operations in decreasing order by spend. And while you're at it, group them into a usage type group so that I know what usage type group is the biggest hitter, right? The second reason, frankly, was I had just learned that DuckDB was a thing that existed, and—Corey: Based on the name alone, I was interested.Alex: Oh, it was an incredible stroke of luck that it was named that. And I went, “This thing lets me run SQL queries against CSV files. I bet I can write something really fast that does this without having to bash my head against the syntactic wall that is Pandas.” And at the end of the day, we had something that I was pretty pleased with. But it's one of those examples of, like, again, just orienting the problem toward, “Well, this is awful.”Because I remember when we first heard about the new billing experience, you kind of had pinged me and went, “We might need something to fix this because this is a problem.” And I went, “Oh, yeah, I can build that.” Which is kind of how a lot of what I've done over the last 15 years has been. It's like, “Oh. Yeah, I bet I could build that.” So, that's kind of how that went.Corey: This episode is sponsored in part by our friend EnterpriseDB. EnterpriseDB has been powering enterprise applications with PostgreSQL for 15 years. And now EnterpriseDB has you covered wherever you deploy PostgreSQL on-premises, private cloud, and they just announced a fully-managed service on AWS and Azure called BigAnimal, all one word. Don't leave managing your database to your cloud vendor because they're too busy launching another half-dozen managed databases to focus on any one of them that they didn't build themselves. Instead, work with the experts over at EnterpriseDB. They can save you time and money, they can even help you migrate legacy applications—including Oracle—to the cloud. To learn more, try BigAnimal for free. Go to biganimal.com/snark, and tell them Corey sent you.Corey: The problem that I keep seeing with all this stuff is I think of it in terms of having to work with the tools I'm given. And yeah, I can spin up infrastructure super easily, but the idea of, I'm going to build something that manipulates data and recombines it in a bunch of different ways, that's not something that I have a lot of experience with, so it's not my instinctive, “Oh, I bet there's an easier way to spit this thing out.” And you think in that mode. You effectively wind up automatically just doing those things, almost casually. Which does make a fair bit of sense, when you understand the context behind it, but for those of us who don't live in that space, it's magic.Alex: I've worked in infrastructure in one form or another my entire career, data infrastructure mostly. And one of the things—I heard this from someone and I can't remember who it was, but they said, “When infrastructure works, it's invisible.” When you walk in the room and flip the light switch, the lights come on. And the fact that the lights come on is a minor miracle. I mean, the electrical grid is one of the most sophisticated, globally-distributed engineering systems ever devised, but we don't think about it that way, right?And the flip side of that, unfortunately, is that people really pay attention to infrastructure most when it breaks. But they are two edges of the same proverbial sword. It's like, I know, when I've done a good job, if the thing got built and it stayed built and it silently runs in the background and people forget it exists. That's how I know that I've done a good job. And that's what I aim to do really, everywhere, including with Duckbill Group, and I'm hoping that the stuff that I built hasn't caught on fire quite yet.Corey: The smoke is just the arising of the piles of money it wound up spinning up.Alex: [laugh].Corey: It's like, “Oh yeah, turns out that maybe we shouldn't have built a database out of pure Managed NAT Gateways. Yeah, who knew?”Alex: Right, right. Maybe I shouldn't have filled my S3 bucket with pure unobtainium. That was a bad idea.Corey: One other thing that we do here that I admit I don't talk about very often because people get the wrong idea, but we do analyst projects for vendors from time to time. And the reason I don't say that is, when people hear about analysts, they think about something radically different, and I do not self-identify as an analyst. It's, “Oh, I'm not an analyst.” “Really? Because we have analyst budget.” “Oh, you said analyst. I thought you said something completely different. Yes, insert coin to continue.”And that was fine, but unlike the vast majority of analysts out there, we don't form our opinions based upon talking to clients and doing deeper dive explorations as our primary focus. We're a team of engineers. All right, you have a product. Let's instrument something with it, or use your product for something and we'll see how it goes along the way. And that is something that's hard for folks to contextualize.What was really fun was bringing you into a few of those engagements just because it was interesting; at the start of those calls. “It was all great, Corey is here and—oh, someone else's here. Is this a security problem?” “It's no, no, Alex is with me.” And you start off those calls doing what everyone should do on those calls is, “How can we help?” And then we shut up and listen. Step one, be a good consultant.And then you ask some probing questions and it goes a little bit deeper and a little bit deeper, and by the end of that call, it's like, “Wow, Alex is amazing. I don't know what that Corey clown is doing here, but yeah, having Alex was amazing.” And every single time, it was phenomenal to watch as you, more or less, got right to the heart of their generally data-oriented problems. It was really fun to be able to think about what customers are trying to achieve through the lens that you see the world through.Alex: Well, that's very flattering, first of all. Thank you. I had a lot of fun on those engagements, honestly because it's really interesting to talk to folks who are building these systems that are targeting mass audiences of very deep-pocketed organizations, right? Because a lot of those organizations, the companies doing the building are themselves massive. And they can talk to their customers, but it's not quite the same as it would be if you or I were talking to the customers because, you know, you don't want to tell someone that their baby is ugly.And note, now, to be fair, we under no circumstances were telling people that their baby was ugly, but I think that the thing that is really fun for me is to kind of be able to wear the academic database nerd hat and the practitioner hat simultaneously, and say, like, “I see why you think this thing is really impressive because of this whiz-bang, technical thing that it does, but I don't know that your customers actually care about that. But what they do care about is this other thing that you've done as an ancillary side effect that actually turns out is a much more compelling thing for someone who has to deal with this stuff every day. So like, you should probably be focusing attention on that.” And the thing that I think was really gratifying was when you know that you're meeting someone on their level and you're giving them honest feedback and you're not just telling them, you know, “The Gartner Magic Quadrant says that in order to move up and to the right, you must do the following five features.” But instead saying, like, “I've built these things before, I've deployed them before, I've managed them before. Here's what sucks that you're solving.” And seeing the kind of gears turn in their head is a very gratifying thing for me.Corey: My favorite part of consulting—and I consider analyst style engagements to be a form of consulting as well—is watching someone get it, watching that light go on, and they suddenly see the answer to a problem that's been vexing them I love that.Alex: Absolutely. I mean, especially when you can tell that this is a thing that has been keeping them up at night and you can say, “Okay. I see your problem. I think I understand it. I think I might know how to help you solve it. Let's go solve it together. I think I have a way out.”And you know, that relief, the sense of like, “Oh, thank God somebody knows what they're doing and can help me with this, and I don't have to think about this anymore.” That's the most gratifying part of the job, in my opinion.Corey: For me, it has always been twofold. One, you've got people figuring out how to solve their problem and you've made their situation better for it. But selfishly, the thing I like the most personally has been the thrill you get from solving a puzzle that you've been toying with and finally it clicks. That is the endorphin hit that keeps me going.Alex: Absolutely.Corey: And I didn't expect when I started this place is that every client engagement is different enough that it isn't boring. It's not the same thing 15 times. Which it would be if it were, “Hi, thanks for having us. You haven't bought some RIs. You should buy some RIs. And I'm off.” It… yeah, software can do that. That's not interesting.Alex: Right. Right. But I think that's the other thing about both cloud economics and data engineering, they kind of both fit into that same mold. You know, what is it? “All happy families are alike, but each unhappy family is unhappy in its own way.” I'm butchering Chekhov, I'm sure. But like—if it's even Chekhov.But the general kind of shape of it is this: everybody's infrastructure is different. Everybody's organization is different. Everybody's optimizing for a different point in the space. And being able to come in and say, “I know that you could just buy a thing that tells you to buy some RIs, but it's not going to know who you are; it's not going to know what your business is; it's not going to know what your challenges are; it's not going to know what your roadmap is. Tell me all those things and then I'll tell you what you shouldn't pay attention to and what you should.”And that's incredibly, incredibly valuable. It's why, you know, it's why they pay us. And that's something that you can never really automate away. I mean, you hear this in data all the time, right? “Oh, well, once all the infrastructure is managed, then we won't need data infrastructure people anymore.”Well, it turns out all the infrastructure is managed now, and we need them more than we ever did. And it's not because this managed stuff is harder to run; it's that the capabilities have increased to the point that they're getting used more. And the more that they're getting used, the more complicated that use becomes, and the more you need somebody who can think at the level of what does the business need, but also, what the heck is this thing doing when I hit the run key? You know? And that I think, is something, particularly in AWS where I mean, my God, the amount and variety and complexity of stuff that can be deployed in service of an organization's use case is—it can't be contained in a single brain.And being able to make sense of that, being able to untangle that and figure out, as you say, the kind of the aha moment, the, “Oh, we can take all of this and just reduce it down to nothing,” is hugely, hugely gratifying and valuable to the customer, I'd like to think.Corey: I think you're right. And again, having been doing this in varying capacities for over five years—almost six now; my God—the one thing has been constant throughout all of that is, our number one source for new business has always been word of mouth. And there have been things that obviously contribute to that, and there are other vectors we have as well, but by and large, when someone winds up asking a colleague or a friend or an acquaintance about the problem of their AWS bill, and the response almost universally, is, “Yeah, you should go talk to The Duckbill Group,” that says something that validates that we aren't going too far wrong with what we're approaching. Now that you're back on the freelance data side, I'm looking forward to continuing to work with you, if through no other means and being your customer, just because you solve very interesting and occasionally very specific problems that we periodically see. There's no reason that we can't bring specialists in—and we do from time to time—to look at very specific aspects of a customer problem or a customer constraint, or, in your case for example, a customer data set, which, “Hmm, I have some thoughts on here, but just optimizing what storage class that three petabytes of data lives within seems like it's maybe step two, after figuring what the heck is in it.” Baseline stuff. You know, the place that you live in that I hand-wave over because I'm scared of the complexity.Alex: I am very much looking forward to continuing to work with you on this. There's a whole bunch of really, really exciting opportunities there. And in terms of word of mouth, right, same here. Most of my inbound clientele came to me through word of mouth, especially in the first couple years. And I feel like that's how you know that you're doing it right.If someone hires you, that's one thing, and if someone refers you, to their friends, that's validation that they feel comfortable enough with you and with the work that you can do that they're not going to—you know, they're not going to pass their friends off to someone who's a chump, right? And that makes me feel good. Every time I go, “Oh, I heard from such and such that you're good at this. You want to help me with this?” Like, “Yes, absolutely.”Corey: I've really appreciated the opportunity to work with you and I'm super glad I got the chance to get to know you, including as a person, not just as the person who knows the data, but there's a human being there, too, believe it or not.Alex: Weird. [laugh].Corey: And that's the important part. If people want to learn more about what you're up to, how you think about these things, potentially have you looked at a gnarly data problem they've got, where's the best place to find you now?Alex: So, my business is called Bits on Disk. The website is bitsondisk.com. I do write occasionally there. I'm also on Twitter at @alexras. That's Alex-R-A-S, and I'm on LinkedIn as well. So, if your lovely listeners would like to reach me through any of those means, please don't hesitate to reach out. I would love to talk to them more about the challenges that they're facing in data and how I might be able to help them solve them.Corey: Wonderful. And we will of course, put links to that in the show notes. Thank you again for taking the time to speak with me, spending as much time working here as you did, and honestly, for a lot of the things that you've taught me along the way.Alex: My absolute pleasure. Thank you very much for having me.Corey: Alex Rasmussen, data engineering consultant at Bits on Disk. I'm Cloud Economist Corey Quinn. This is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry comment that is so large it no longer fits in RAM.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Screaming in the Cloud
Data Analytics in Real Time with Venkat Venkataramani

Screaming in the Cloud

Play Episode Listen Later Apr 27, 2022 38:41


About VenkatVenkat Venkataramani is CEO and co-founder of Rockset. In his role, Venkat helps organizations build, grow and compete with data by making real-time analytics accessible to developers and data teams everywhere. Prior to founding Rockset in 2016, he was an Engineering Director for the Facebook infrastructure team that managed online data services for 1.5 billion users. These systems scaled 1000x during Venkat's eight years at Facebook, serving five billion queries per second at single-digit millisecond latency and five 9's of reliability. Venkat and his team also created and contributed to many noted data technologies and open-source projects, including Facebook's TAO distributed data store, RocksDB, Memcached, MySQL, MongoRocks, and others. Prior to Facebook, Venkat worked on tools to make the Oracle database easier to manage. He has a master's in computer science from the University of Wisconsin-Madison, and bachelor's in computer science from the National Institute of Technology, Tiruchirappalli.Links Referenced: Company website: https://rockset.com Company blog: https://rockset.com/blog TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I've been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn't limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That's R-E-V-E-L-O dot I-O slash screaming.Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I'm going to just guess that it's awful because it's always awful. No one loves their deployment process. What if launching new features didn't require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren't what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest episode is one of those questions I really like to ask because it can often come across as incredibly, well, direct, which is one of the things I love doing. In this case, the question that I am asking is, when you look around at the list of colossal blunders that people make in the course of careers in technology and the rest, it's one of the most common is, “Oh, yeah. I don't like the way that this thing works, so I'm going to build my own database.” That is the siren call to engineers, and it is often the prelude to horrifying disasters. Today, my guest is Venkat Venkataramani, co-founder and CEO at Rockset. Venkat, thank you for joining me.Venkat: Thanks for having me, Corey. It's a pleasure to be here.Corey: So, it is easy for me to sit here in my beautiful ivory tower that is crumbling down around me and use my favorite slash the best database imaginable, which is TXT records shoved into Route 53. Now, there are certainly better databases than that for most use cases. Almost anything really, to be honest with you, because that is a terrifying pattern; good joke, terrible practice. What is Rockset as we look at the broad landscape of things that store data?Venkat: Rockset is a real-time analytics platform built for the cloud. Let me break that down a little bit, right? I think it's a very good question when you say does the world really need another database? Don't we have enough already? SQL databases, NoSQL databases, warehouses, and lake houses now.So, if you really break it down, the first digital transformation that happened in the '80s was when people actually retired pen and paper records and started using a relational database to actually manage their business records and what have you instead of ledgers and books and what have you. And that was the first digital transformation. That was—and Oracle called the rows in a table ‘records' for a reason. They're called records to this date. And then, you know, 20 years later, when all businesses were doing system of record and transactions and transactional databases, then analytics was born, right?This was, like, the whole reason why I wanted to make better data-driven business decisions, and BI was born, warehouses and data lakes started becoming more and more mainstream. And there was really a second category of database management systems because the first category it was very good at to be a system of record, but not really good at complex analytics that businesses are asking to be able to guide their decisions. Fast-forward 20 years from then, the nature of applications are changing. The world is going from batch to real-time, your data never stops coming, advent of Apache Kafka and technologies like that, 5G, IoTs, data is coming from all sorts of nooks and corners within an enterprise, and now customers in enterprises are acquiring the data in real-time at a scale that the world has never seen before.Now, how do you get analytics out of that? And then if you look at the database market—entire market—there are still only two large categories of databases: OLTP databases for transaction processing, and warehouses and data lakes for batch analytics. Now suddenly, you need the speed of OLTP at the scale of batch, right, in terms of, like, complexity of compute, complexity of storage. So, that is really why we thought the data management space needs that third leg, and we call it real-time analytics platform or real-time analytics processing. And this is where the data never stops coming; the queries never stopped coming.You need the speed and the scale, and it's about time we innovate and solve the problem well because in 2015, 2016, when I was researching for this, every company that was looking to solve build applications that were real-time applications was building a custom Rube Goldberg machine of sorts. And it was insanely complex, it was insanely expensive. Fast-forward now, you can build a real-time application in a matter of hours with the simplicity of the cloud using Rockset.Corey: There's a lot to be said that the way we used to do things after the first transformation and we got into the world of batch processing, where—in the days of punch cards, which was a bit before my time and I believe yours as well—where they would drop them off and then the next day, or two days, they would come back later after the run, they would get the results only to figure out syntax error because you put the wrong card first or something like that. And it was maddening. In time, that got better, but still, nightly runs have become a thing to the point where even now, by default, if you wind up looking at the typical timing of a default Linux install, for example, you see that in the middle of the night is when a bunch of things will rotate when various cleanup jobs get done, et cetera, et cetera. And that seemed like a weird direction to go in. One of the most famous Google April Fools Day jokes was when they put out their white paper on MapReduce.And then Yahoo fell for it hook, line, and sinker, built out Hadoop, and we've been stuck with this idea of performing these big query jobs on top of existing giant piles of data, where ideally, you can measure it with a wall clock; in practice, you often measure the calendar in some cases. And as the world continues to evolve, being able to do streaming processing and understand in real-time what is going on, is unlocking different approaches, at least by all accounts. Do you have an example you can give me of a problem that real-time analytics solves for a customer? Because I can sit here and talk all day about how things might theoretically work, but I have to get out of my Route 53-based ivory tower over here, what are customers seeing?Venkat: That's a great question. And I want one hundred percent agree. I think Google did build MapReduce, and I think it's a very nice continuation of what happened there and what is happening in the world now. And built MapReduce and they quickly realized re-indexing the whole world [laugh] every night, as the size of the internet is exploding is a bad idea. And you know how Google index is now? They do real-time indexing.That is how they index the wor—you know, web. And they look for the changes that are happening in the internet, and they only index the changes. And that is exactly the same principle behind—one of the core principles behind Rockset's real-time analytics platform. So, what is the customer story? So, let me give you one of my favorite ones.So, the world's number one or number two buy now, pay later company, they have hundreds of millions of users, they have 300,000-plus merchants, they operate in, like, maybe 100-plus countries, so many different payment methods, you can imagine the complexity. At any given point in time, some part of the product is broken, well, Apple Pay stopped working in Switzerland for this e-commerce merchant. Oh God, like, we got to first detect that. Forget even debugging and figuring out what happened and having an incident response team. So, what did they do as they scale the number of payments processed in the system across the world—it's, like, in millions; first, it was millions in the day, and there was millions in an hour—so like everybody else, they built a batch-based system.So, they would accumulate all these payment records, and every six hours—so initially, it was a day, and then afterwards, you know, you try to see how far I can push it, and they couldn't push it beyond every six hours. Every six hours, some batch job would come and process through all the payments that happened, have some statistical models to detect, hey, here are some of the things that you might want to double-click and follow up on. And as they were scaling, the batch job that they will kick off every six hours was starting to take more than six hours. So, you can see how the story goes. Now, fast-forward, they came to us and say—it's almost like Rockset has, like, a big red button that says, “Real-time this.”And then they kind of like, “Can you make this real-time? Because not only that we are losing millions of potential revenue dollars in a year because something stops working and we're not processing payments, and we don't find out about that up to, like, three hours later, five hours later, six hours later, but our merchants are also very unhappy. We are also not able to protect our customers' business because that is all we are about.” And so fast-forward, they use Rockset, and simply using SQL now they have all the metrics and statistical computation that they want to do, happens in real-time, that are accurate up to the second. All of their anomaly detectors run every minute and the anomaly detectors take, like, hundreds of milliseconds to run.And so, now they've cut down the business observability, I would say. It's not metrics and machine observability is actually the—you know, they have now business observability in real-time. And that not only actually saves them a lot of potential revenue loss from downtimes, that's also allowing them to build a better product and give their customers a better experience because they are now telling their merchants and their customers that something is not working in some part of your e-commerce footprint before even the customers notice that something is wrong. And that allows them to build a better product and a better customer experience than their competitors. So, this is a very real-world example of why companies and enterprises are moving from batch to real-time.Corey: With the stories that you, and frankly, a lot of other data analytics companies tend to fall back on all the time has been stories of the ones you're telling, where you're talking about the largest buy now, pay later lender, for example. These are companies operating at massive scale who have tremendous existing transaction volume, and they're built out already. That's great, but then I wanted to try to cut to the truth of some of these things. And when I visit your pricing page at Rockset, it doesn't have what I would expect if that were the only use case. And what that would be is, “Great. Call here to conta—open up a sales quote, and we'll talk to you et cetera, et cetera, et cetera.”And the answer then is, “Okay, I know it's going to have at least two commas in it, ideally, not three, but okay, great.” Instead, you have a free tier where it's, “Hey, we'll give you a pile of credits, here's some limits on our free account, et cetera, et cetera.” Great. That is awesome. So, it tells me that there is a use case here for folks who have not already, on some level, made a good show of starting the process of conquering the world.Rather, someone with an idea some evening at two in the morning can wind up diving in and getting started. What is the Twitter for Pets, in my garage, spare-time side project story for using something like Rockset? What problem will I have as I wind up building those things out, when I don't have any user traffic or data yet, but I want to, you know for once in my life, do the smart thing in advance rather than building an impressive tower of technical debt?Venkat: That is the first thing we built, by the way. When we finish our product, the first thing we built was self-service. The first thing we built was a free forever tier, which has certain limits because somebody has to pay the bill, right? And then we also have compute instances that are very, very affordable that cost you, like, approximately $1 a day. And so, we built all of that because real-time analytics is not a need that only, like, the large-scale companies have. And I'll give you a very, very simple example.Let's say you're building a game, it's a mobile game. You can use Amazon DynamoDB and use AWS Lambdas and have a serverless stack and, like, you're really only paying… you're kind of keeping your footprint very, very small, and you're able to build a very lively game and see if it gets [wider 00:12:16], and it's growing. And once it grows, you can have all the big company scaling problems. But in the early days, you're just getting started. Now, if you think about DynamoDB and Lambdas and whatnot, you can build almost every part of the game except probably the leaderboard.So, how do I build a leaderboard when thousands of people are playing and all of their individual gameplays and scores and everything is just another simple record in DynamoDB. It's all serverless. But DynamoDB doesn't give me a SQL SELECT *, order by score, limit 100, distinct by the same player. No, this is a analytical question, and it has to be updated in real-time, otherwise, you really don't have this thing where I just finished playing. I go to the leaderboard, and within a second or two, if it doesn't update, you kind of lose people along the way. So, this is one of actually a very popular use case, when the scale is much smaller, which is, like, Rockset augments NoSQL database like a Dynamo or a Mongo where you can continue to use that for—or even a Postgres or MySQL for that case where you can use that as your system of record and keep it small, but cover all of your compute-heavy and analytical parts of your application with Rockset.So, it's almost like kind of a CQRS pattern where you use your OLTP database as your system of record, you connect Rockset to it, and so—Rockset comes in with built-in connectors, by the way, so you don't have to write a single line of code for your inserts and updates and deletes in your transactional database to get reflected in Rockset within one to two seconds. And so now, all of a sudden you have a fully indexed, fast SQL replica of your transactional database that on which you can do all sorts of analytical queries and that's fully isolated with your transactional database. So, this is the pattern that I'm talking about. The mobile leaderboard is an example of that pattern where it comes in very handy. But you can imagine almost everybody building some kind of an application has certain parts of it that is very analytical in nature. And by augmenting your transactional database with Rockset, you can have your cake and eat it too.Corey: One of the challenges I think that at least I've run into when it comes to working with data—and let's be clear, I tend to deal with data in relatively small volumes, mostly. The stuff that's significantly large, like, oh, I don't know, AWS bills from large organizations, the format of those is mostly predefined. When I'm building something out, we're using, I don't know, DynamoDB or being dangerous with SQLite or whatnot, invariably I find that even at small-scale, I paint myself into a corner by data model design or how I wind up structuring access or the rest, and the thing that I'm doing that makes perfect sense today winds up being incredibly challenging to change later. And I still, in production and have a DynamoDB table that has the word ‘test' in its name because of course I do.It's not a great place to find yourself in some cases. And I'm curious as to what you've seen, as you've been building this out and watching customers, especially ones who already had significant datasets as they move to you. Do you have any guidance around how to avoid falling down that particular well?Venkat: I will say a lot of the complexity in this world is by solving the right problems using the wrong tool, or by solving the right problem on the wrong part of the stack. I'll unpack this a little bit, right? So, when your patterns change, your application is getting more complex, it is demanding more things, that doesn't necessarily mean the first part of the application you build—and let's say DynamoDB was your solution for that—was the wrong choice. That is the right choice, but now you're expanded the scope of your application and the demand that you have on your backend transactional database. And now you have to ask the question, now in the expanded scope, which ones are still more of the same category of things on why I chose Dynamo and which ones are actually not at all?And so, instead of going and abusing the GSIs and other really complex and expensive indexing options and whatnot, that Dynamo, you know, has built, and has all sorts of limitations, instead of that, what do I really need and what is the best tool for the job, right? What is the best system for that? And how do I augment? And how do I manage these things? And this goes to the first thing I said, which is, like, this tremendous complexity when you start to build a Rube Goldberg machine of sorts.Okay, now, I'm going to start making changes to Dynamo. Oh, God, like, how do I pick up all of those things and not miss a single record? Now, replicate that to another second system that is going to be search-centric or reporting-centric, and do I have to rethink this once in a while? Do I have to build and manage these pipelines? And suddenly, instead of going from one system to two system, you actually end up going from one system to, like, four different things that with all the pipes and tubes going into the middle.And so, this is what we really observed. And so, when you come in to Rockset and you point us at your DynamoDB table, you don't write a single line of code, and Rockset will automatically scan your Dynamo tables, move that into Rockset, and in real-time, your changes, insert, updates, deletes to Dynamo will be reflected in Rockset. And this is all using Dynamo Streams API, Dynamo Scan API, and whatnot, behind the scenes. And this just gives you an example of if you use the right tool for the job here, when suddenly your application is demanding analytical queries on Dynamo, and you do the right research and find the right tool, your complexity doesn't explode at all, and you can still, again, continue to use Dynamo for what it is very, very good at while augmenting that with a system built for analytics with full-featured SQL and other capabilities that I can talk about, for the parts of your application for which Dynamo is not a good fit. And so, if you use the right tool for the job, you should be in very good place.The other thing is part about this wrong part of the stack. I'll give a very kind of naive example, and then maybe you can extrapolate that to, like, other patterns on how people could—you know, accidental complexities the worst. So, let's just say you need to implement access control on your data. Let's say the best place to implement access control is at the database level, just happens to be that is the right thing. But this database that I picked, doesn't really have role-based access control or what have you, it doesn't really give me all the security features to be able to protect the data the way I want it.So, then what I'm going to do is, I'm going to go look at all the places that is actually having business logic and querying the database and I'm going to put a whole bunch of permission management and roles and privileges, and you can just see how that will be so error-prone, so hard to maintain, and it will be impossible to scale. And this is what is the worst form of accidental complexity because if you had just looked at it that one week or two weeks, how do I get something out, or the database I picked doesn't have it, and then the two weeks, you feel like you made some progress by, kind of like, putting some duct tape if conditions on all the access paths. But now, [laugh] you've just painted yourself into a really, really bad corner.And so, this is another variation of the same problem where you end up solving the right problems in the wrong part of the stack, and that just introduces tremendous amount of accidental complexity. And so, I think yeah, both of these are the common pitfalls that I think people make. I think it's easy to avoid them. I would say there's so much research, there's so much content, and if you know how to search for these things, they're available in the internet. It's a beautiful place. [laugh]. But I guess you have to know how to search for these things. But in my experience, these are the two common pitfalls a lot of people fall into and paint themselves in a corner.Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.Corey: A question I have, though, that is an extension is this—and I want to give some flavor to it—but why is there a market for real-time analytics? And what I mean by that is, early on in my tenure of fixing horrifying AWS bills, I saw a giant pile of money being hurled over at effectively a MapReduce cluster for Elastic MapReduce. Great. Okay, well, stream-processing is kind of a thing; what about migrating to that? Well, that was a complete non-starter because it wasn't just the job running on those things; there were downstream jobs, and with their own downstream jobs. There were thousands of business processes tied to that thing.And similarly, the idea of real-time analytics, we don't have any use for that because of, oh I don't know, I only wind up pulling these reports on a once-a-week basis, and that's fine, so what do I need that updated for in real-time if I'm looking at them once a week? In practice, the answer is often something aligned with the, “Well, yeah, but you had a real-time updating dashboard, you would find that more useful than those reports.” But people's expectations and business processes have shaped themselves around constraints that now can be removed, but how do you get them to see that? How do you get them to buy in on that? And then how do you untangle that enormous pile of previous constraint into something that leverages the technology that's now available for a brighter future?Venkat: I think [unintelligible 00:21:40] a really good question, who are the people moving to real-time analytics? What do they see? And why can they do it with other tech? Like, you know, as you say… EMR, you know, it's just MapReduce; can't I just run it in sort of every twenty-four hours, every six hours, every hour? How about every five minutes? It doesn't work that way.Corey: How about I spin up a whole bunch of parallel clusters on different timescales so I constantly—Venkat: [laugh].Corey: Have a new report coming in. It's real-time, except—Venkat: Exactly.Corey: You're constantly putting out new ones, but they're just six hours delayed every time.Venkat: Exactly. So, you don't really want to do this. And so, let me unpack it one at a time, right? I mean, we talked about a very good example of a business team which is building business observability at the buy now, pay later company. That's a very clear value-prop on why they want to go from batch to real-time because it saves their company tremendous losses—potential losses—and also allows them to build a better product.So, it could be a marketing operations team looking to get more real-time observability to see what campaigns are working well today and how do I double down and make sure my ad budget for the day is put to good use? I don't have to mention security operations, you know, needing real-time. Don't tell me I got owned three days ago. Tell me—[laugh] somebody is, you know, breaking glass and might be, you know, entering into your house right now. And tell me then and not three days later, you know—Corey: “Yeah, what alert system do you have for security intrusion?” “So, I read the front page of_The New York Times_ every morning and waiting to see my company's name.” Yeah, there probably are better ways to reduce that cycle time.Venkat: Exactly, right. And so, that is really the need, right? Like, I think more and more business teams are saying, “I need operational intelligence and not business intelligence.” Don't make me play Monday morning quarterback.My favorite analogy is it's the middle of the third quarter. I'm six points down. A couple of people, star players in my team and my opponent's team are injured, but there's some in offense, some in defense. What plays do I do and how do I play the game slightly differently to change the outcome of the game and win this game as opposed to losing by six points. So, that I think is kind of really what is driving businesses.You know, I want to be more agile, I want to be more nimble, and take, kind of, being data-driven decision-making to another level. So that, I think, is the real force in play. So, now the real question is, why can they do it already? Because if you go ask a hundred people, “Do you want fast analytics on real-time data or slow analytics on stale data?” How many people are going to say give me slow and stale? Zero, right? Exactly zero people.So, but then why hasn't it happened yet? I think it goes back to the world only has seen two kinds of databases: Transaction processing systems, built for system of record, don't lose my data kind of systems; and then batch analytics, you know, all these warehouses and data lakes. And so, in real-time analytics use cases, the data never stops coming, so you have to actually need a system that is running 24/7. And then what happens is, as soon as you build a real-time dashboard, like this example that you gave, which is, like, I just want all of these dashboards to automatically update all the time, immediately people respond, says, “But I'm not going to be like Clockwork Orange, you know, toothpicks in my eyelids and be staring at this 24/7. Can you do something to alert or detect some anomalies and tap on my shoulder when something off is going on?”And so, now what happens is somebody's actually—a program more than a person—is actually actively monitoring all of these metrics and graphs and doing some analysis, and only bringing this to your attention when you really need to because something is off, right? So, then suddenly what happens is you went from, accumulate all the data and run a batch report to [unintelligible 00:25:16], like, the data never stops coming, the queries never stopped coming, I never stop asking questions; it's just a programmatic way of asking those things. And at that point, you have a data app. This is not a analytics dashboard report anymore. You have a full-fledged application.In fact, that application is harder to build and scale than any application you've ever built before [laugh] because in those situations, again, you don't have this torrent of data coming in all the time and complex analytical questions you're asking on the data 24/7, you know? And so, that I think is really why real-time analytics platform has to be built as almost a third leg. So, this is what we call data apps, which is when your data never stops coming and your queries never stop coming. So, this is really, I think, what is pushing all the expensive EMR clusters or misusing your warehouse, misusing your data lakes. At the end of the day, is what is I think blowing up your Snowflake bills, is what blowing up your warehouse builds because you somehow accidentally use the wrong tool for the job [laugh] going back to the one that we just talked about.You accidentally say, “Oh, God, like, I just need some real-time.” With enough thrust, pigs can fly. Is that a good idea? Probably not, right? And so, I don't want to be building a data app on my warehouse just because I can. You should probably use the best tool for the job, and really use something that was built ground up for it.And I'll give you one technical insight about how real-time analytics platforms are different than warehouses.Corey: Please. I'm here for this.Venkat: Yes. So really, if you think about warehouses and data lakes, I call them storage-optimized systems. I've been building databases all my life, so if I have to really build a database that is for batch analytics, you just break down all of your expenses in terms of let's say, compute and storage. What I'm burning 24/7 is storage. Compute comes and goes when I'm doing a batch data load, or I'm running—an analyst who logs in and tries to run some queries.But what I'm actually burning 24/7 is storage, so I want to compress the heck out of the data, and I want to store it in very cheap media. I want to store it—and I want to make the storage as cheap as possible, so I want to optimize the heck out of the storage use. And I want to make computation on that possible but not efficient. I can shuffle things around and make the analysis possible, but I'm not trying to be compute-efficient. And we just talked about how, as soon as you get into real-time analytics, you very quickly get into the data app business. You're not building a real-time dashboard anymore, you're actually building your application.So, as soon as you get into that, what happens is you start burning both storage and compute 24/7. And we all know, relatively, [laugh] compute and RAM is about a hundred to a thousand times more expensive than storage in the grand scheme of things. And so, if you actually go and look at your Snowflake bill, if you go look at your warehouse bill—BigQuery, no matter what—I bet the computational part of it is about 90 to 95% of the bill and not the storage. And then, if you again, break down, okay, who's spending all the compute, and you'll very quickly narrow down all these real-time-y and data app-y use cases where you can never turn off the compute on your warehouse or your BigQuery, and those are the ones that are blowing up your costs and complexity. And on the Rockset side, we are actually not storage-optimized; we're compute-optimized.So, we index all the data as it comes in. And so, the storage actually goes slightly higher because the, you know, we stored the data and also the indexes of those data automatically, but we usually fold the computational cost to a quarter of what a typical warehouse needs. So, the TCO for our customers goes down by two to four folds, you know? It goes down by half or even to a quarter of what they used to spend. Even though their storage cost goes up in net, that is a very, very small fraction of their spend.And so really, I think, good real-time analytics platforms are all compute-optimized and not storage-optimized, and that is what allows them to be a lot more efficient at being the backend for these data applications.Corey: As someone who spends a lot of time staring into the depths of AWS bills, I think that people also lose sight of the reality that it doesn't matter what you're spending on AWS; it invariably pales in comparison to what you're spending on people to work with these things. The reason to go to cloud is not because it is the cheapest possible way to get computers to do things; it's because it's a capability story. It's about unlocking capacity and capabilities you do not have otherwise. And that dramatically increases your feature velocity and it lets you to achieve things faster, sooner, with better results. And unlocking a capability is always going to be more interesting to a company than saving money on it. When a company cares first, last, and always about just save money, make the bill lower, the end, it's usually a company in decline. Or alternately, something very strange is going on over there.Venkat: I agree with that. One of our favorite customers told us that Rockset took their six-month roadmap and shrunk it to a single afternoon. And their supply chain SaaS backend for heavy construction, 80% of concrete that are being delivered and tracked in North America follows through their platform, and Rockset powers all of their real-time analytics and reporting. And before Rockset, what did they have? They had built a beautiful serverless stack using DynamoDB, even have AWS Lambdas and what-have-you.And why did they have to do all serverless? Because the entire team was two people. [laugh]. And maybe a third person once in a while, they'll get, so 2.5. Brilliant people, like, you know, really pioneers of building an entire data stack on AWS in a serverless fashion; no pipes, no ETL.And then they were like, oh God, finally, I have to do something because my business demands and my customers are demanding real-time reporting on all of these concrete trucks and aggregate trucks delivering stuff. And real-time reporting is the name of the game for them, and so how do I power this? So, I have to build a whole bunch of pipes, deliver it to, like, some Elasticsearch or some kind of like a cluster that I had to keep up in real-time. And this will take me a couple of months, that will take me a couple of months. They came into Rockset on a Thursday, built their MVP over the weekend, and they had the first working version of their product the following Tuesday.So—and then, you know, there was no turning back at that point, not a single line of code was written. You know, you just go and create an account with Rockset, point us at your Dynamo, and then off you go. You know, you can use start using SQL and go start building your real-time application. So again, I think the tremendous value, I think a lot of customers like us, and a lot of customers love us. And if you really ask them what is one thing about Rockset that you really like, I think it'll come back to the same thing, which is, you gave me a lot of time back.What I thought would take six months is now a week. What I thought would be three weeks, we got that in a day. And that allows me to focus on my business. I want to spend more time with my stakeholders, you know, my CPO, my sales teams, and see what they need to grow our business and succeed, and not build yet another data pipeline and have data pipelines and other things coming out of my nose, you know? So, at the end of the day, the simplicity aspects of it is very, very important for real-time analytics because, you know, we can't really realize our vision for real-time being the new default in every enterprise for whenever analytics concern without making it very, very simple and accessible to everybody.And so, that continues to be one of our core thing. And I think you're absolutely right when you say the biggest expense is actually the people and the time and the energy they have to spend. And not having to stand up a huge data ops team that is building and managing all of these things, is probably the number one reason why customers really, really like working with our product.Corey: I want to thank you for taking so much time to talk me through what you're working on these days. If people want to learn more, where's the best place to find you?Venkat: We are Rockset, I'll spell it out for your listeners ROCKSET—rock set—rockset.com. You can go there, you can start a free trial. There is a blog, rockset.com/blog has a prolific blog that is very active. We have all sorts of stories there, and you know engineers talking about how they implemented certain things, to customer case studies.So, if you're really interested in this space, that's one on space to follow and watch. If you're interested in giving this a spin, you know, you can go to rockset.com and start a free trial. If you want to talk to someone, there is, like, a ‘Request Demo' button there; you click it and one of our solutions people or somebody that is more familiar with Rockset would get in touch with you and you can have a conversation with them.Corey: Excellent. And links to that will of course go in the [show notes 00:34:20]. Thank you so much for your time today. I appreciate it.Venkat: Thanks, Corey. It was great.Corey: Venkat Venkataramani, co-founder and CEO at Rockset. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an insulting crappy comment that I will immediately see show up on my real-time dashboard.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

The Bike Shed
333: Tapas

The Bike Shed

Play Episode Listen Later Apr 12, 2022 41:53


Being pregnant is hard, but this tapas episode is good! Steph discovered and used a #yelling Slack channel and attended a remote magic show. Chris touches on TypeScript design decisions and edge cases. Then they answer a question captured from a client Slack channel regarding a debate about whether I18n should be used in tests and whether tests should break when localized text changes. This episode is brought to you by ScoutAPM (https://scoutapm.com/bikeshed). Give Scout a try for free today and Scout will donate $5 to the open source project of your choice when you deploy. Emma Bostian (https://twitter.com/EmmaBostian) Ladybug Podcast (https://www.ladybug.dev/) Gerrit (https://www.gerritcodereview.com/) Gregg Tobo the Magician (https://astonishingproductions.com/) Sean Wang - swyx - better twitter search (https://twitter.com/swyx/status/1328086859356913664) Twemex (https://twemex.app/) GitHub Pull Request File Tree Beta (https://github.blog/changelog/2022-03-16-pull-request-file-tree-beta/) Sam Zimmerman - CEO of Sagewell Financial on Giant Robots (https://www.giantrobots.fm/414) TypeScript 4.1 feature (https://devblogs.microsoft.com/typescript/announcing-typescript-4-1/) The Bike Shed: 269: Things are Knowable (Gary Bernhardt) (https://www.bikeshed.fm/269) TSConfig Reference - Docs on every TSConfig option (https://www.typescriptlang.org/tsconfig#noUncheckedIndexedAccess) Rails I18n (https://guides.rubyonrails.org/i18n.html) This episode is brought to you by Studio 3T (https://studio3t.com/free). Try Studio 3T's full suite of features for 30 days, no payment details needed. Become a Sponsor (https://thoughtbot.com/sponsorship) of The Bike Shed! Transcript: CHRIS: Hello and welcome to another episode of The Bike Shed, a weekly podcast from your friends at thoughtbot about developing great software. I'm Chris Toomey. STEPH: And I'm Steph Viccari. CHRIS: And together, we're here to share a bit of what we've learned along the way. So, Steph, what's new in your world? STEPH: Hey, Chris. There are a couple of new things in my world, so one of them that I wanted to talk about is the fact that being pregnant is hard. I feel like this is probably a known thing, but I feel like I don't hear it talked about as much as I'd really like, especially in sort of like a professional context. And so I just wanted to share for anyone else that may be listening, if you're also pregnant, this is hard. And I also really appreciate my team. Going through the first trimester is typically where you experience a lot of morning sickness and fatigue, and I had all of that. And so I was at the point that most of my days, I didn't even start till about noon and even some days, starting at noon was a struggle. And thankfully, the thoughtbot client that I'm working with most of the teams are on West Coast hours, so that worked out pretty well. But I even shared a post internally and was like, "Hey, I'm not doing great in the mornings. And so I really can't facilitate any morning meetings. I can't be part of some of the hiring intros that we do," because we like to have a team lead provide a welcoming and then closing for anyone that's coming for interview day. I couldn't do those, and those normally happen around 9:00 a.m. for Eastern Time. And everybody was super supportive of it. So I really appreciate all of thoughtbot and my managers and team being so great about this. Also, the client team they're wonderful. It turns out growing a little human; I'm learning how hard it is and working full time. It's an interesting challenge. Oh, and as part of that appreciation because…so there's just not a lot of women that I've worked with. This may be one of those symptoms of being in tech where one, I haven't worked with tons of women, and then two, working with a woman who is also pregnant and going through that as well. So it's been a little bit isolating in that experience. But there is someone that I follow on Twitter, @EmmaBostian. She's also one of the co-hosts for the Ladybug Podcast. And she has been just sharing some of her, like, I am two months sleep deprived. She's had her baby now, and she is sharing some of that journey. And I really appreciate people who just share that journey and what they're going through because then it helps normalize it for me in terms of what I'm feeling. I hope this helps normalize it for anybody else that might be listening too. CHRIS: I certainly can't speak to the specifics of being pregnant. But I do think it's wonderful for you to use this space that we have here to try and forward that along and say what your experience is like and share that with folks and hopefully make it a little bit better for everyone else out there. Also, you snuck in a sneaky pro-tip there, which is work on the East Coast and have a West Coast team. That just sounds like the obvious correct way to go about this. STEPH: That has worked out really well and been very helpful for me. I'm already not a great morning person; I've tried. I've really strived at times to be a morning person because I just have this idea in my head morning people get more stuff done. I don't think that's true, but I just have that idea. And I'm not the world's best morning person, so it has worked out for many reasons but yeah, especially in helping me get through that first trimester and also just supporting family and other things that are going on. Oh, I also learned a pro-tip about Twitter. This is going to seem totally random, but it was relevant when I was searching for stuff on Twitter [laughs] that was related to tech and pregnancy. But I learned...because I wanted to be able to search for something that someone that I follow what they said but I couldn't remember who said it. And so I found that in the search bar, I can add filter:follows. So you can have your search term like if you're looking for cake or pregnancy, or sleep-deprived and then look for filter:follows, and then that will filter the search results to everybody that you follow. I imagine that that probably works for followers too, but I haven't tried it. CHRIS: I like the left turn you took us on there but still keeping it connected. On the topic of Twitter search, they apparently have a very powerful search, but it's also hidden, and you got to know the specific syntax and whatnot. But there is a wonderful project by Shawn Wang, AKA Swyx, on the internet, bettertwitter.netlify.com is the URL for it. I will share a link to his tweet introducing it. But it's a really wonderful tool that just provides a UI for all of these different filters and configurations. And both make discoverability that much better and then also make it easy to just compose one of these searches and use that. The other thing that I'll recommend is, I think it's a Chrome plugin. I'm guessing is what I'm working with here like a browser extension, but it's called Twemex, T-W-E-M-EX. And there's a sidebar in Twitter now, which just seems wonderful and useful. So as I'm looking at a Swyx post here, or a tweet as they're called on Twitter because I know that vernacular, there's a sidebar which is specific to Shawn Wang. And there's a search at the top so I can search within it. But it's just finding their most popular tweets and putting that on a sidebar. It's a very useful contextual addition to Twitter that I found just awesome. So that combination of things has made my Twitter experience much better. So yeah, we'll have show notes for both of those as well. STEPH: Nice. I did not know about those. This may cause someone to laugh at me because maybe it's easier than I think. But I can never remember that advanced search that Twitter does offer; I have to search it every time. I just go to Google, and I'm like, advanced Twitter search, and then it brings up a site for me, and then I use that as the one that Twitter does provide. But yeah, from the normal UI, I don't know how to get there. Maybe I haven't tried hard enough. Maybe it's hidden. CHRIS: It's like they're hiding it. STEPH: Yeah, one of those. [laughs] CHRIS: It's very costly. They have to like MapReduce the entire internet in order to make that search work. So they're like, well, what if we hide it because it's like 50 cents per query? And so maybe we shouldn't promote this too much. STEPH: [laughs] CHRIS: And let's just live in the moment, everybody. Let's just swim in the Twitter stream rather than look back at the history. I make guesses about the universe now. STEPH: [laughs] On a different note, I also discovered at thoughtbot in our variety of Slack channels that we have a yelling channel, and I had not used it before. I had not hung out there before. It's a delightful channel. It's a place that you just go, and you type in all caps. You can yell about anything that you would like to. And I specifically needed to yell about Gerrit, which is the replacement or the alternative that we're using for GitHub or GitLab, or Bitbucket, or any of those services. So we're using Gerrit, and I've been working to feel comfortable with the UI and then be able to review CRs and things like that. My vernacular is also changing because my team refers to them as change requests instead of pull requests. So I'm floating back and forth between CRs and PRs. And because I'm in Gerrit world, I missed some of the updates that GitHub made to their pull request review screen. And so then I happened to hop in GitHub one day, and I saw it, and I was like, what is this? So that was novel. But going back to yelling, I needed to yell about Gerrit because I have not found a way to collaborate with someone who has already pushed up changes. I have found ways that I can pull their changes which then took a little while. I found it in a sneaky little tab called download. I didn't expect it to be there. But then the actual snippet it's like, run this in your terminal, and this is then how you pull down the changes. And I'm like, okay, so I did that. But I can't push to their existing changes because then I get like, well, you're not the owner, so we're going block you, which is like, cool, cool, cool. Okay, I kind of get that because you don't want me messing up somebody else's content or something that they've done. But I really, really, really want to collaborate with this person, and we're trying to do something together, and you're blocking me. And so I had to go to the yelling channel, and I felt better. And I'm yelling again. [laughs] Maybe I don't feel that great because I'm getting angry again talking about it. CHRIS: You vented a little into the yelling channel; maybe not everything, though. STEPH: Yeah, I still have more to vent because it's made life hard. Every time I wanted to push up a change or pull down someone else's changes, there are now all these CRs that then I just have to go and abandon, which is then the terminology for then essentially closing it and ignoring it, so I'm constantly going through. And if I do want to pull in changes or collaborate, then there's a flow of either where I abandon mine, or I pull in their changes, but then I have to squash everything because if you push up multiple commits to Gerrit, it's going to split those commits into different CRs, don't like that. So there are a couple of things that have been pain points. And yeah, so plus-one for yelling channels, let people get it out. CHRIS: Okay, so definitely some feelings that you are working through here. I'm happy to work together as a team to get through some of them. One thing that I want to touch on is you very quickly hinted at GitHub has got a bunch of new things that are cool. I want to talk about those. But I want to touch [laughs] on an anecdote. You talked about pushing something up to someone else's branch. You're like, oh, you know, I made some changes locally, and I'm going to push them up. I had an interesting experience once where I was interacting with another developer. I had done some code review. They weren't quite understanding where I was. They had a lot of questions. And finally, I said, you know what? This will just be easier. Here, I pushed up a commit to your branch, so now you can see what I'm talking about. And I thought of this as a very innocuous act, but it was not interpreted that way. That individual interpreted it in a very aggressive sort of; it was not taken well. And I think part of that was related to I think of Git commits as just these little ephemeral things where you're like, throw it out, feel free. This is just the easiest way for me to communicate this change in the context of the work that you're doing. I thought I was doing a nice favor thing here. That was not how it went. We had a good conversation after I got to the heart of where we both were emotionally on this thing. It was interesting. The interaction of emotion and tech is always interesting. But as a result, I'm very, very careful with that now. I do think it's a great way as long as I've gotten buy-in from the person beforehand. But I will always spot check and be like, "Hey, just to confirm, I can just push up a commit to your branch, but are you okay with that? Is that fine with you?" So I've become very cautious with that. STEPH: Yeah, that feels like one of those painful moments where it highlights that the people that you work with that you are accustomed to having a certain level of trust or default trust with those individuals, and then working with someone else that they don't have that where the cup is half-full in terms of that trust, or that this person means well kind of feelings towards a colleague or towards someone that they're working with. So it totally makes sense that it's always good to check and just to be like, "Hey, I'd love to push up some changes to your branch. Is that cool?" And then once you've established that, then that just makes it easier. But I do remember that happening, and yeah, that was a bit painful and shocking because we didn't see that coming and then learned from it. CHRIS: I do think it's an important thing to learn, though, because for me, in that moment, this was this throwaway operation that I thought almost nothing of, but then another individual interpreted it in a very different way. And that can happen, that can happen across tons of different things. And I don't even want to live in the idealized world where it's just tech; we're just pushing around zeros and ones; there's no human to this. But no, I actually believe it's a deeply human thing that we're doing here. It's our job to teach the computers to be a little closer to us humans or something like that. And so it was a really pointed clarification of that for me where it was this thing that I didn't even think once about, no less twice, and yet someone else interpreted it in such a different way. So it was a useful learning situation for me. STEPH: Yeah, I totally agree. I think that's a really wise default to have to check in with people before assuming that they'll be comfortable with something that we're comfortable with. CHRIS: Indeed. But shifting back to what you mentioned of GitHub, a bunch of new stuff came in GitHub, and you were super excited about it. And then you went on to say other things about another system. [laughs] But let's talk about the great things in GitHub. What are the particular ones that have caught your eye? I've seen some, but I'm intrigued. Let's compare notes. STEPH: So this is one of those where I hadn't seen GitHub in quite a while, and then I hopped in, and I was like, this is different. But some of the things that did stand out to me right away is that on the left-hand side, I can see all of the files that have been changed, and so that's a really nice tree where I just then immediately know. Because that was one of the things that I often did going to a PR is that I would see what files are involved in this change because it was just a nice overview of what part of the applications am I walking through? Are there tests for this? Have they altered or added tests? And so I really like that about it. I'm sure there's other stuff. But that is the main thing that stood out to me. How about you? CHRIS: Yeah, that sidebar file tree is very, very nice, which I find surprising because I don't use a file tree in my editor. I only do fuzzy finding to jump to files. But I think there's something about whenever GitHub had the file list; these are all the files that are changed. I'm like, this is just noise. I can't look at this and get anything out of it. But the file tree is so much more...there's a shape to it that my brain can sort of pattern match on. And it's just a much more discoverable way to observe that information. So I've really loved that. That was a wonderful one. The other one that I was surprised by is GitHub semantic code analysis; stuff has gotten much, much better over time subtly. I didn't even notice this happening. But I was discussing something with someone today, and we were looking at it on GitHub, and I just happened to click on an identifier, and it popped up a little thing that says, "Oh, do you want to hop to the references or the definition of this?" I was like, that is what I want to do. And so I hopped to the definition, hopped to the definition of another thing, and was just jumping around in the code in a way that I didn't know was available. So that was really neat. But then also, I was in a pull request at one point, and someone was writing a spec, and they had introduced a helper just like stub something at the bottom of a given spec file. And it's like, I feel like we have this one already. And I just clicked on the identifier. I think it might have actually been a matcher in RSpec, so it was like, have alert. And I was like, oh, I feel like we have this one, a matcher specific to flash message alerts on the page. And I clicked on it, and GitHub provided me a nice little inline dialog that showed me all of the definitions of have alert, which I think we were up to like four of them at that point. So it had been copied and pasted across a couple of different files, which I think is totally fine and a great way to start, but they were very similar implementations. I was like, oh, looks like we actually already have this in a couple of places, maybe we clean it up and extract it to a common spec support thing, and ta-da, I was able to do all of that from the GitHub pull requests UI. And I was like, this is awesome. So kudos to the GitHub team for doing some nifty stuff. Also, can I get into the merge queue? Thank you. ... STEPH: [laughs] There it is. That is very cool. I didn't know I could do that from the pull request screen. I've seen it where if I'm browsing code that, then I can see a snippet of where everything's defined and then go there, but I hadn't seen that from the pull request. I did find the changelogs for GitHub that talk about the introduction of having the tree, so we'll be sure to include a link in the show notes for that too. But yes, thank you for letting me use our podcast as a yelling channel. It's been delightful. [laughs] Mid-roll Ad Hi, friends, and now a quick break to hear from today's sponsor, Scout APM. Scout APM is an application performance monitoring tool that's designed to help developers find and fix performance issues quickly. With an intuitive user interface, Scout will tie bottlenecks to source code so you can quickly pinpoint and resolve performance abnormalities like N+1 queries, slow database queries, and memory bloat. Scout also recently implemented external service monitoring, adding even more granularity when it comes to HTTP requests and API calls. So give Scout a try today with a free 14-day trial and experience first-hand why developers worldwide call Scout their best friend. And as an added bonus for Bike Shed listeners, Scout will donate $5 to the open-source project of your choice when you deploy. To learn more, visit scoutapm.com/bikeshed. That's scoutapm.com/bikeshed. CHRIS: Well, speaking of podcasts, actually, there was an interesting thing that happened where the CEO of Sagewell Financial, the company of which I am the CTO of, Sam Zimmerman is his name, and he went on the Giant Robots Podcast with Chad a couple of weeks ago. So that is now available. We'll link to that in the show notes. I'll be honest; it was a very interesting experience for me. I listened to portions of it. If we're being honest, I searched for my name in the transcript, and it showed up, and I was like, okay, that's cool. And it was interesting to hear two different individuals that I've worked with either in the past or currently talking about it. But then also, for anyone that's been interested in what I'm building over at Sagewell Financial and wants to hear it from someone who can probably do a much better job of pitching and describing the problem space that we're working in, and all of the fun challenges that we have, and that we're hopefully living up to and building something very interesting, I think Sam does a really fantastic job of that. That's the reason I'm at the company, frankly. So yeah, if anyone wants to hear a little bit more about that, that is a very interesting episode. It was a little weird for me to listen to personally, but I think everybody else will probably have a normal experience listening to it because they're not the CTO of the company. So that's one thing. But moving on, I feel like today's going to be a grab bag episode or tapas episode, lots of small plates, as we were discussing as we were prepping for this episode. But to share one little thing that happened, I've been a little more removed from the code of late, something that we've talked about on and off in previous episodes. Thankfully, I have a wonderful team that's doing an absolutely fantastic job moving very rapidly through features and bug fixes and all those sorts of things. But also, I'm just not as involved even in code review at this point. And so I saw one that snuck through today that, I'm going to be honest, I had an emotional reaction to. I've talked myself down; we're fine now. But the team collectively made the decision to move from a line length of 80 characters to a line length of 120 characters, and I had some feelings. STEPH: Did you fire everybody? [laughs] CHRIS: No. I immediately said, doesn't really matter. This is the whole conversation around auto-formatting tools is like we're just taking the decision away. I personally am a fan of the smaller line length because I like to have multiple files open left to right. That is my reason for it, but that's my reason. A collective of the developers that are frankly working more in the code than I am at this point decided this was meaningful. It was a thing that we could automate. I think that we can, you know, it's not a thing that we have to manage. So I was like, cool. There we go. The one thing that I did follow up on I was like, okay; y'all snuck this one in, it's fine, I'm fine with it. I feel fine; everything's fine. But let's add that to the git-blame-ignore-revs file, which is a useful thing to know about. Because otherwise, we have a handful of different changes like this where we upgrade Prettier, and suddenly, the manner in which it formats the files changes, so we have to reformat everything at once. And this magical file that exists in Git to say, "Hey, ignore this revision because it is not relevant to the semantic history of the app," and so it also takes that decision out of the consideration like yeah, should we reformat or not? Because then it'll be noisy. That magical file takes that decision away, and so I love that. STEPH: I so love the idea because you took vacation recently twice. So I love the idea of there was a little coup and people are applauding, and they're like, while Chris is on vacation, we're going to merge this change [laughs] that changes the character line. And yeah, that brings me joy. Well, I'm glad you're working through it. Sounds like we're both working through some hard emotional stuff. [laughs] CHRIS: Life's tricky, is all I'm going to say. STEPH: I am curious, what prompted the 80 characters versus 120? This is one of those areas that's like, yeah, I have my default preference like you said. But I'm more intrigued just when people are interested in changing it and what goes with it. So do you remember one of the reasons that 120 just suited their preferences better? CHRIS: Frankly, again, I was not super involved in the discussion or what led them to it. STEPH: [laughs] CHRIS: My guess is 120 is used...I think 80 is a pretty common one. I think 120 is another of the common ones. So I think it's just a thing that exists out there in the mindshare. But also, my guess is they made the switch to 120 and then reformatted a few files that had like, ah, this is like 85 characters, and that's annoying. What does it look like if we bump it up? And so 120 provided a meaningful change of like, this is a thing that splits to four lines if we have an 80 character thing, or it's one line if it's 120 characters, which is a surprising thing to say, but that's actually the way it plays out in certain cases because the way Prettier will break lines isn't just put stuff on the next line always. It's got to break across multiple lines, actually. All right, now that we're back in the opinion space, I have a strong one. STEPH: This is The Bikeshed. We can live up to that name. [laughs] CHRIS: So I do want an additional configuration in Prettier Ruby. This is the thing I'll say. Maybe I can chase down Kevin Newton and see if he's open to this. But when Prettier does break method call with arguments going into it but no parens on that method call, and it breaks out to multiple lines, it does the dangling indent thing, which I do not like. I find it distasteful; I find it noisy, the shape of the code. I'm a big fan of the squint test. I know that from Sandi Metz, I believe, or maybe it's Avdi Grimm. I associate it with both of them in my mind. But it's just a way to look at the code and kind of squint, and you see the shape of it, and it tells you something. And when the lines break in that weird way, and you have these arbitrary dangling indents, the shape of the code is broken up. And I don't feel so strongly. I actually regularly stop myself from commenting on pull requests on this because it's very easy. All you need to do is add explicit parens, and then Prettier will wrap the line in what I believe is a much more aesthetically pleasing, concise, consistent, lots of other good adjectives here that are definitely just my preferences and not facts about the world. But so what I want is, Prettier, hey, if you're going to break this line across multiple lines, insert the parens. Parens are no longer optional for breaking across multiple lines; parens are only optional within a given line. So if we're not breaking across lines, I want that configuration because this is now one of those things where I could comment on this. And if they added the optional parens, then Prettier would reform it in a different way. And I want my auto formatter don't give me ways to do stuff. Like, constrain me more but also within the constraints of the preferences that I have, please, thank you. STEPH: I love all the varying levels there [laughter] of you want a thing, but you know it's also very personal to you and how you're walking that line and hopping back and forth on each side. I also love the idea. We have the idea of clean code. I really want something that's called distasteful code now [laughs] where you just give examples of distasteful code, yes. Well, I wish you good luck in your journey [laughs] and how this goes and how you continue to battle. I also appreciate that you mentioned when you're reviewing code how you know it's something that you really want, but you will refrain from commenting on that. I just appreciate when people have that filter to recognize, like, is this valuable? Is it important? Or, like you said, how can we just make this more of the default so then we don't even have to talk about it? And then lean into whatever the default the team goes with. CHRIS: Well, thank you. I very much appreciate that because, frankly, it's been very difficult. STEPH: I do have something I want to yell about but in a very positive way or pranting as we determined or, you know, raving, the actual real term that wonderful listeners pointed out to us. CHRIS: Prant for life. That's my stance. STEPH: We had a magic show at thoughtbot. It was all remote, but the wonderful Gregg Tobo, the magician, performed a magic show for us where we all showed up on Zoom. And it was interactive, and it was delightful, and it was so much fun. And so if you need something fun for your team that you just want to bring folks together, highly recommend. I had no idea I was going to enjoy a magic show this much, but it was a lot of fun. So I'll be sure to include some links in the show notes in case that interests anyone. But yeah, magic. I'm doing jazz hands. People can't see it, but magic. I like how you referred earlier, saying that today is more of like a tapas episode. And I'm realizing that all of my tapas are related to being pregnant, yelling, and magic shows, and I'm okay with that. [laughs] But on that note, what else is on your tapas plate? CHRIS: Actually, a nice positive one that came into the world...I always like when we get those. So this is interesting because I was actually looking back at the history, and I had Gary Bernhardt on The Bike Shed back in Episode 269. We'll include a link in the show notes. But we talked a bunch about various things, including TypeScript. And I was lamenting what I saw as a pretty big edge case in TypeScript. So the goal of TypeScript is like, all right, JavaScript exists, this is true. What can we do on top of that? Let's not fundamentally change it, but let's build a type system on top of it and try and make it so that we can enforce correctness but understand that JavaScript is a highly dynamic language and that we don't want to overconstrain and that we've got to meet it where it is. And so one of the design decisions early on with TypeScript is if you have an array and you say like it's an array of integers, so you have typed that array to be this is an array of int, or it will be an array of number in JavaScript because JavaScript doesn't have integers; they only have numbers. Cool. [laughs] Setting aside other JavaScript variables here, you have an array of numbers. And so if you use element access to say, like, say the name of array is array of nums and then use brackets and you say zero, so get me the first element of that array. TypeScript will infer the type of that to be a number. Of course, it's a number, right? You got an array of numbers, you take a number out of it, of course, you're going to have a number, except you know what's also an array of numbers? An empty array. Well, of course. So there's no way for TypeScript because that's a runtime thing, whether or not the array is full of things or not. Or imagine you get the third element from the array. Well, JavaScript will either return you the third element, which indeed is a number, or undefined because there's no third element in this array. So that is an unfortunate but very understandable edge case that TypeScript was like, listen, this is how JavaScript works. So we're not going to…frankly, we don't think the people embracing TypeScript and bringing it into their world would accept this amount of noise because this is everywhere. Anytime you interact with an array, you are going to run into this, this sort of uncertainty of did I actually get the thing? And it's like, yeah, no, I know how many things are in the array that I'm working with. Spoiler, you maybe don't is the answer. And so, we ran into this edge case in our codebase. We were accessing an element, but TypeScript was telling us, "Yes, definitively, you have an object of that type because you just got it out of an array, which is an array of that type." But we did not; we had undefined. And so we had, you know blah is not a method on undefined or whatever that classic JavaScript runtime error is. And I was like, well, that's very sad. But now we get to the fun part of the story, TypeScript, as of version 4.1, which came out like the week that I recorded with Gary Bernhardt, which was interesting to look at the timeline here. TypeScript has added a new configuration. So a new strictness dial that you can configure in your tsconfig called noUncheckedIndexedAccess. So if you have an array and you are getting an element out of it by index, TypeScript will say, "Hey, you got to check if that's undefined," because to be clear, very much could be undefined. And I was so happy to find this. We turned it on in our codebase. It found the error in the place that we actually had an error and then found a few others that I think probably had errored at some times. But it was just one of those for me very nice things to be able to dial up the strictness and enforce correctness within our codebase, and so I was very happy about it. Other folks may say that seems like too much work. And, you know, I get that, I get that take. I'm definitely on the side of I'm willing to go through the effort to have enforced correctness, but you know, that's a choice. STEPH: Yeah, that's thoughtful. I like that, how you said you can dial up the strictness so then as you are introducing TypeScript, then people have that option. There is an argument there in the back of my head that's like, well, if you're introducing types, then you want to start more strict because then you're just creating problems for yourself down the road. But I also understand that that can make things very difficult to then introduce it to teams in existing codebases. So that seems like a really nice addition where then people can say, "Yeah, no, I really want the strictness. This is why I'm here," and then they can turn that on. CHRIS: So TypeScript in the configuration has strict mode, so you say strict true. And that is a moving target with each new version of TypeScript. But it's their sort of [inaudible 28:14] set of things that are part of strict, but apparently, this one's not in it. So now I'm like, wait, can I have a stricter? Can I have a strictest option? Can I have dial it to 11, please? [laughs] Really rough me up and make sure my code is correct. But it is the sort of thing like when we turn any of these on; it will find things in our codebase. Some of them, we have to appease the compiler even though we know the code to be correct. But the code is not provably correct as it sits in our file. So I am, again, happy to make that exchange. And I like that TypeScript as a project gives us configurability. But again, I am on team where's the strictest button? I would like to push that as hard as I can and live that life. STEPH: Yeah, I like that phrasing that you just said about provably correct. That's nice. CHRIS: That's the world I want to live in, everything you own in the box to the left, which is probably correct. STEPH: [laughs] That's how that song goes. CHRIS: Yeah. This is a reference to move errors to the left, which I think I've referenced before. But now that I'm just referencing Beyoncé and not the actual article, it's probably worth referencing the article, but the idea of, like, if a user hits an error, that's not great. So let's move it back to QA, that's a little further to the left in sort of the timeline. But what if we could move it to an automated test in CI? But what if we could move it into your editor? What if we could move it even further to the left? And so, a type system tends to be sort of very far ratcheted up to the left. It's as early as possible that you can catch these. So again, to reference Beyoncé, everything you own in a box to the left. STEPH: [singing] Everything you own in the box to the left. CHRIS: Thank you for doing the needful work there. STEPH: [laughs] Mid-roll Ad And now a quick break to hear from today's sponsor, Studio 3T. When you're developing applications, it can often be a chore to work with your underlying data. Studio 3T equips you with a complete set of tools to work with MongoDB data. From building queries with drag and drop, to creating complex aggregation pipelines; Studio 3T makes it easy. And now, there's Studio 3T Free, a free edition of Studio 3T, which delivers an essential core of tools. This means you can get started, for free, with Studio 3T Free, and when you're ready, you can upgrade and enjoy even more features through Studio 3T Pro and Studio 3T Ultimate. The different editions unlock more tools and additional integrations with MongoDB, SQL, Oracle, and Sybase. You can start today by downloading Studio 3T Free, which also includes a 30-day free trial of all the features of Studio 3T Ultimate, so you can try out some of the enterprise features as well. No credit card required. To start your trial, head to studio3t.com/free that's studio3t.com/free. STEPH: I have a question for you that I'd really love to get your opinion on because I myself I'm waffling back and forth where someone brought up some really great points about a concern or just a question they had brought up around testing and i18n specifically. And I agree with the things that they're saying, but yet, there's also a part of me that doesn't, and so I'm Stephanie divided. And so, I'm trying to figure out where I stand on this. So let me dive in and give you some context; I'm going to share the statement/question that they had asked. So here we go. "One of my priorities has been I should be able to review a test without having to reference any other code. References to i18n means that I have to go over to YAML and make sure the right keys have the right values, and that seems error-prone. In some cases, a lack of a hit in the YAML defers to defaults. If the intent is to override the name of model attribute and error messages and it is coded incorrectly, the code fails silently without translating and uses the humanized attribute name, and that would go undetected. If libraries change structure, it might also fail silently as well, so to me, the only failsafe way is to be fully explicit in test." So this goes with the idea that if you're writing tests and then you're testing text, but it's on the screen or perhaps an email, that you're actually going to assert against that string that is shown to the user instead of referencing the i18n keys. And then that also backs up this person's idea that you really want to not have to jump around. If you're reading a test, everything you really need to know about that test should live very close by. And I really agree with that initial statement; I want everything that's very close to the test, especially if it's anywhere in that expectation line, I really want it close, so I can understand what's the expectation, what's under test, what are the inputs, what's the expected outcome. So I wholeheartedly support that idea. But yet, I am in the camp that I then will use YAML keys instead of providing that exact string because I do look at i18n as a helpful abstraction, and I want to trust that i18n is doing its job. And so that way, I don't have to provide that string that's there because then we're also choosing, okay, well, which language are we going to always use for our test? So this is the part where I feel divided. So I'm going to walk you through some of the reasons that I really support this idea and other reasons that I still use the i18n keys and then get your take on it. So there is a part of me that when I'm using the i18n YAML keys, it does make me sad because it reduces the readability in tests. Sometimes the keys are really well named where maybe it's a mailer.welcomemessage. And I'm like, okay, I understand the gist. I don't need to go see the actual string. I also think they highlighted a really good use case where if you're overriding behavior and it could default to something else, your test is still going to pass, and you don't actually know. So I could see the use case there where if you are overriding, then you want to be explicit about the string that you expect back. I also think there are some i18n messages that are fairly complex, and where then I really would like to see the string. So if you are formatting a date or a time or you're passing in just a lot of variables, then there's a chance that I do want to see how did that actually get generated for the person who's going to be reading it versus just maybe it's garbage text that came out? And I want to validate that the message that we think we're crafting is actually the one that the user is going to see. The case against actually being explicit, my biggest one is because then I do see i18n as a helpful abstraction. And I want to trust this abstraction that it's doing its job and it's doing it well. Because then if I do use explicit strings, it makes me sad if I change text from like hello to welcome, and now I have a failing test. I don't like that idea either. So I'm torn between these two worlds of it is very nice to have everything that you need in a test to be able to understand what is the expectation, but then I also lean into this abstraction and reference the i18n keys. So, Chris, with all of that, that was a bit of a whirlwind, [laughs] what are your thoughts? How do you test this stuff? CHRIS: Honestly, I'm surprised that you've got that much division in your own answer because for me, this is very obvious there's one...no, I'm kidding. This is obviously complicated. Similar to you, I think I'm going to have to give a grab bag of answers because I don't have a singular thought of like it is concretely this or that. I tend to go for explicit strings and tests all the way to...so like the readability of a test, and the conciseness of a test is interesting. I will often see developers extract. Say they're creating a user with a specific email, and then they log in with that email later, and then they expect something else. And so the email is referenced a few times, and they'll extract that into a variable called email. And I personally will tend to not do that. I will inline the literal string like user@example.com, and I'll do it in a few places. And I'm fine with that duplication because I like the readability of any given line that you're reading. So I will make that trade-off within tests. This is the thing I think we've talked about before, but the idea of DRY in tests is like I want to be careful applying that idea, Don't Repeat Yourself, to break apart the acronym. Those abstractions I will use them less than tests. And so I want the explicitness, I want the readability, I want to tell a little story, all of that feels true. That said, to flip it around, one of the things that I'm hearing...so I think I'm hearing a part of this that is around well, we can fail silently because we fail symmetrically in both the implementation and our test. Then an assertion may actually match even though it's matching on a fallback. I think that's a configurable thing. I would actually want my test to raise if I'm referencing an i18n key that is not defined. Now, granted, that's different for languages. And maybe this becomes a more complex story of like in production; in a different locale, it will fail because we don't have 100% parity across all our locale files. But fundamentally, I want to make sure that at least exists in our base, which I think typically would be en-US as the locale. I want to make sure all keys are looked up and found, and it's an error otherwise in our test. So that's a feeling. But am I misunderstanding that part of the story or how that configuration typically works? STEPH: No, I think you've got it. But just to make sure we're on the same page, so if you reference a key that doesn't exist, then it is going to fail. So at least you have your test failure is going to let you know that you've referenced something that doesn't exist. But if you are referencing, like if you want to override the defaults that Rails or i18n has provided for a model and say for an error message, if you reference that, but you want to override it, but then you've forgotten, that does exist. So you're not going to get the failure; you're going to get a different message. So it's probably not a terrible experience for the user. It's not going to crash. They're going to see something, but they're not going to see the custom message that you intended them to see. CHRIS: Gotcha. Okay, well, just to name it, the thing that I was describing, I don't know that that would be the configuration for every system. So I would strongly encourage any system where i18n just has a singular behavior which is we fall back to the key. I want my test to absolutely tell me if that's happening. And that should be a failure of the test. But to the discoverability documentation bit, I do wonder if tooling can actually help answer the question. And as I was describing the wonderful experience I had on GitHub the other day, viewing code as just static characters in a file is both true and also, I think increasingly, a limited view of it. We have editors, and we have code hosting tools that can understand semantically our code a little bit better. There's got to be like 20 Different VS Code plugins that, when you hover on an i18n reference, it will do the lookup for you. That feels like a thing that exists, and if it doesn't, well, now I've nerd-sniped myself, and I got a weekend project. JK, I'm definitely not building that this weekend. But that feels like can we use that to solve this? Maybe not. But that's just another thought of where we have these limitations where it's static, like those abstractions can be useful. But if we can very quickly dereference them, then the cost of the abstraction or that separation becomes smaller, and so the pain is reduced. And I wonder if that's a way to sort of offset it. STEPH: If I can poke at that a little bit more, because I think you're touching on something that I haven't expressed or thought through explicitly, but it's the idea of, like, why do I like the abstraction? What is it that's drawing me towards using these keys? And I think it's because most of the cases, I don't care. I don't care what the string is, and so that feels nice. Like, I understand that, yes, we're referencing something. If that key didn't exist, I'm going to see a failure. So I know that there's text there, and that's why I do lean into referencing the keys instead of the text because it feels good to not have to care about that stuff. And if we do make changes to the text, then it suddenly doesn't fail, and then I have to go update a test because we added a period or added a comma. I think that's the path of more sadness for me. And my goal is always a path of least sadness. So I think that's why I lean into it [laughs], I'm guessing. Is that why you lean into it as well? Or what do you like about referencing the keys over the explicit text? CHRIS: No, I think I share your inclination there, and the reason that you're in favor of it, and I think the consistency like if we're going to use i18n, then we should lean in because it's a non-trivial thing to do like porting to i18n projects, and they're tricky. Getting it right from the first step is also tricky. If you're going to do it, then let's lean in, and thus let's use that abstraction overall. But yeah, same ideas as you. STEPH: Cool. I think that helps validate where I'm at in terms of how I rationalize about this where ultimately, I do like leaning into that abstraction. And as you'd mentioned, some of those porting projects, I haven't been on one specifically, but I've seen that they are a lot of work. And so, if we have that in our system, then we want to continue to use it. It does reduce some of the readability. Like you said, maybe there's a VS Code plugin or some way that then we can help people be able to see if they want that full context in the test and not have to jump over to YAML. But yeah, otherwise, unless it's overriding default behavior or complex, then that's what I'm going to go with is with the keys. But I really appreciate this person's very thoughtful question and approach to testing because, normally or typically, I fully agree with I want full context in the test. And this one was one of those outliers that came up for me, and I had to really think through all the feelings and the reasons that I have for those feelings. On that note, shall we wrap up? CHRIS: Let's wrap up. The show notes for this episode can be found at bikeshed.fm. STEPH: This show is produced and edited by Mandy Moore. CHRIS: If you enjoyed listening, one really easy way to support the show is to leave us a quick rating or even a review on iTunes, as it really helps other folks find the show. STEPH: If you have any feedback for this or any of our other episodes, you can reach us at @_bikeshed or reach me on Twitter @SViccari. CHRIS: And I'm @christoomey. STEPH: Or you can reach us at hosts@bikeshed.fm via email. CHRIS: Thanks so much for listening to The Bike Shed, and we'll see you next week. ALL: Byeeeeeee!!!!!! ANNOUNCER: This podcast was brought to you by thoughtbot. thoughtbot is your expert design and development partner. Let's make your product and team a success.

Screaming in the Cloud
Diving Duckbill First into the Depths of Data with Alex Rasmussen

Screaming in the Cloud

Play Episode Listen Later Mar 17, 2022 39:59


About AlexAlex holds a Ph.D. in Computer Science and Engineering from UC San Diego, and has spent over a decade building high-performance, robust data management and processing systems. As an early member of a couple fast-growing startups, he's had the opportunity to wear a lot of different hats, serving at various times as an individual contributor, tech lead, manager, and executive. Prior to joining the Duckbill Group, Alex spent a few years as a freelance data engineering consultant, helping his clients build, manage and maintain their data infrastructure. He lives in Los Angeles, CA.Links: Twitter: https://twitter.com/alexras/ Personal page: https://alexras.info Old Consulting website with blog: https://bitsondisk.com TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: The company 0x4447 builds products to increase standardization and security in AWS organizations. They do this with automated pipelines that use well-structured projects to create secure, easy-to-maintain and fail-tolerant solutions, one of which is their VPN product built on top of the popular OpenVPN project which has no license restrictions; you are only limited by the network card in the instance. To learn more visit: snark.cloud/deployandgoCorey: Today's episode is brought to you in part by our friends at MinIO the high-performance Kubernetes native object store that's built for the multi-cloud, creating a consistent data storage layer for your public cloud instances, your private cloud instances, and even your edge instances, depending upon what the heck you're defining those as, which depends probably on where you work. It's getting that unified is one of the greatest challenges facing developers and architects today. It requires S3 compatibility, enterprise-grade security and resiliency, the speed to run any workload, and the footprint to run anywhere, and that's exactly what MinIO offers. With superb read speeds in excess of 360 gigs and 100 megabyte binary that doesn't eat all the data you've gotten on the system, it's exactly what you've been looking for. Check it out today at min.io/download, and see for yourself. That's min.io/download, and be sure to tell them that I sent you. Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm the chief cloud economist at The Duckbill Group, which people are generally aware of. Today, I'm joined by our most recent principal cloud economist, Alex Rasmussen. Alex, thank you for joining me today, it is a pleasure to talk to you, as if we aren't talking to each other constantly, now that you work here.Alex: Thanks, Corey. It's great being here.Corey: So, I followed a more, I'd say traditional path for a cloud economist, but given that I basically had to invent the job myself, the more common path because imagine that you start building a role from scratch and the people you wind up looking for initially look a lot like you. And that is grumpy sysadmin, historically, turned into something, kind of begrudgingly, that looks like an SRE, which I still maintain are the same thing, but it is imperative people not email me about that. Yes, I know, you work at Google. But instead, what I found during my tenure as a sysadmin, is that I was working with certain things an awful lot, like web servers, and other things almost never, like databases and data warehouses. Because if you screw up a web server, we all have a good laugh, the site's down for a couple of minutes, life goes on, you have a shame trophy on your desk if that's your corporate culture, things continue.Mess up the data severely enough, and you don't have a company anymore. So, I was always told to keep my aura away from the expensive spendy things that power a company. You are sort of the first of a cloud economist subtype that doesn't resemble that. Before you worked here, you were effectively an independent consultant working on data engineering. Before that, you had a couple of jobs, but you had gotten a PhD in computer science, which means, first, you are probably one of the people in this world most qualified to pass some crappy job interview of solving a sorting algorithm on a whiteboard, but how did you get here from where you were?Alex: Great question. So, I like to joke that I kind of went to school until somebody told me that I had to stop. And I took that and went and started—or didn't start, but I was an early engineer at a startup and then was an executive at another early-stage one, and did a little bit of everything. And went freelance, did that for a couple of years, and worked with all kinds of different companies—vast majority of those being startups—helping them with data infrastructure problems. I've done a little bit of everything throughout my career.I've been, you know, IC, manager, manager, manager, IT guy, everything in between. I think on the data side of things, it just sort of happened, to be honest with you, it kind of started with the stuff that I did for my dissertation and parlayed that into a job back when the big data wave was starting to kind of truly crest. And I've been working on data infrastructure, basically my entire career. So, it wasn't necessarily something that was intentional. I've just been kind of taking the opportunity that makes the most sense for me it kind of every juncture. And my career path has been a little bit strange, both by academic and industrial standards. But I like where I'm at and I gained something really valuable from each of those experiences. So.Corey: It's been an interesting area of I won't say weakness here, but it's definitely been a bit of a challenge when we look at an AWS environment and even talking about a typical AWS customer without thinking of any of them in particular, I can already tell you a few things are likely to be true. For example, the number one most expensive line item in their bill is going to be EC2, and compute is the thing that powers it. Now, maybe that is they're running a bunch of instances the old-fashioned way. Maybe they're running Kubernetes but that's how it shows up. There's a lot of things that could be, and we look at what rounds that out.Now, the next item down should almost certainly not be data transfer and if so we should have a conversation, but data in one form or another is very often going to be number two. And that can mean a bunch of different things, historically. It could mean, “Oh, you have a whole bunch of stuff in S3. Let's talk about access patterns. Let's talk about lifecycle policies. Let's talk about making sure the really important stuff is backed up somewhere. Maybe you want to spend more on that particular aspect of it.”If it's on EBS volumes, that's interesting and definitely worth looking into and trying to understand the context of what's going on. Periodically we'll see a whole bunch of additional charges that speak to some of that EC2 charge in the form of EMR, AWS's Elastic MapReduce, which charges a per-hour instance charge, but also charges you for the instances that are running under the hood and under the EC2 line item. So, there's a lot of data lifecycle stuff, there's a lot of data ecosystem stories, that historically we've consulted out with experts in that particular space. And that's great, but we were starting to have to drag those people in on more and more engagements as we saw them. And we realized that was really something we had to build out as a core competency for ourselves.And we started out not intending to hire for someone with that specialty, but the more we talked to you, the more it became clear that this was a very real and very growing need that we and our customers have. How closely it is what you're doing now as far as AWS bill analysis and data pattern deep-dive align with what you were doing as a freelance consultant in the space?Alex: A lot more than you might expect. You know, I think that increasingly, what you're seeing now is that a company's core differentiator is its data, right, how much of it they have, what they do with it. And so, you know, to your point, I think when you look at any company's cloud spend, it's going to be pretty heavy on the data side in terms of, like, where have you put it? What are you doing to process it? Where is it going once it's been processed? And then how is that—Corey: And data transfer is a very important first word in that two-word sequence.Alex: Oh, sure is. And so I think that, like, in a lot of ways, the way that a customer's cloud architecture looks and the way that their bill looks kind of as a consequence of that is kind of a reification in a way of the way that the data flows from one place to another and what's done with it at each step along the way. I think what complicates this is that companies that have been around for a little while have lived through this kind of very amorphous, kind of, polyglot way that we're approaching data. You know, back when I was first getting started in the big data days, it was MapReduce, MapReduce, MapReduce, right? And we quickly [crosstalk 00:07:29]—Corey: Oh, yes. The MapReduce white paper out of Google, a beautiful April Fool's Day prank that the folks at Yahoo fell for hook, line, and sinker. They wrote Hadoop, and now we're all stuck with that pattern. Great gag, they really should have clarified they were kidding. Here we are.Alex: Exactly. So—Corey: I mostly kid.Alex: No, for sure. But I think especially when it comes to data, we tend to over-index on what the large companies do and then quickly realize that we've made a mistake and correct backwards, right? So, there was this big push toward MapReduce for everything until people realize that it was just a pain in the neck to operate and to build. And so then we moved into Spark, so kind of up-leveled a little bit. And then there was this kind of explosion of NoSQL and NewSQL databases that hit the market.And MongoDB inexplicably won that war and now we're kind of in this world where everything is cloud data warehouse, right? And now we're trying to wrestle with, like, is it actually a good idea to put everything in one warehouse and have SQL be the lingua franca on top of it? But it's all changing so rapidly. And when you come into a customer that's been around for 10 or 15 years, and has, you know, been in the cloud for a substantial—Corey: Yeah, one of those ancient customers. That is—Alex: I know, right?Corey: —basically old enough to almost get a driver's license? Oh, yeah.Alex: Right. It's one of those things where it's like, “Ah, yes, in startup years, you're, like, a hundred years old,” right? But still, you know, I think you see this, kind of—I wouldn't call it a graveyard of failed experiments, right, but it's a collection of, like, “Well, we tried this, and it kind of worked and we're keeping it around because the cost of moving this stuff around—the kind of data gravity, so to speak—is high enough that we're not going to bother transitioning it over.” But then you get into this situation where you have to bend over backwards to integrate anything with anything else. And we're still kind of in the early days of fixing that.Corey: And the AWS bill pattern that we see all the time across the board of those experiments were not successful and do not need to exist, but there's no context into that. The person that set them up left five years ago, the jobs are still running on time. What's happening with them? Well, we could stop them and see who screams, but very often, that's not the right answer either.Alex: And I think there's also something to note there, too, which is like, getting rid of data is very scary, right? I mean, if you resize a Kubernetes cluster from 15 nodes to 10, nobody's going to look at you sideways. But if you go, “Hey, we're just going to drop these tables.” The immediate reaction that you get, particularly from your data science team more often than not is, “Oh, God, what if we need that?” And so the conversation never really happens, and that causes this kind of snowball of data debt that persists in some cases for many, many years.Corey: Yeah, in some cases, what I found has been successful on those big unknown questions is don't delete the data, but restrict access to it for a few weeks and see what happens. Look into it a bit and make sure that it's not like, “Oh, cool. We just did for a month, and now we don't need that data. Let's get rid of it.” And then another month goes by it's like, “So, time to report quarterly earnings. Where's the data?”Oh, dear, that's not going to go well, for anyone. And understanding what's happening, the idea of cloning a petabyte of data so you can run an experiment on it. And okay, turns out the experiment wasn't needed. Do we still need to keep all of that?Alex: Yeah.Corey: The underlying platform advancements have been helpful toward this as well, a petabyte of data now in Glacier Deep Archive cost the princely sum of a thousand bucks a month, which is pretty close to the idea of why would I ever delete data ever again? I can get it back within a day if I need it, so let's just put it there instead.Alex: Right. You know, funny story. When I was in graduate school, we were dealing with, you know, 100 terabyte datasets on the regular that we had to generate every time because we only had 200 terabytes of raw storage. [laugh]. And this was before cloud was yet mature enough that we could get the kind of performance numbers that we wanted off of it.And we would end up having to delete the input data to make room for the output data. [laugh]. And thankfully, we don't need to do that anymore. But there are a lot of, kind of, anti-patterns that arise from that too, right? If data is easy to keep around forever, it stays around forever.And if it's easy to, let's say, run a SQL command against your Snowflake instance that scans 20 terabytes of data, you're just going to do it, and the exposure of that to you is so minimal that you can end up causing a whole bunch of problems for yourself by the fact that you don't have to deal with stuff at that low-level of abstraction anymore.Corey: It's always fun watching how this stuff manifests—because I'm dipping a toe into it from time to time—the easy, naive answer that we could give every customer but we don't is, “Huh. So, you have a whole bunch of EMR stuff? Well, you know, if you migrate that into something else, you'll save a whole bunch of money on that.” With no regard for the 500 jobs that run against that EMR cluster on a consistent basis that form is a key part of business process. “Yeah, if you could just do the entire flow of how data is operated with throughout your entire business that would be swell because you can save tens of thousands of dollars a month on that.” Yeah, how about we don't suggest things that are just absolute buffoonery.Alex: Well, and it's like, you know, you hit on a good point. Like, one of my least favorite words in the English language is the word ‘just.' And you know, I spent a few years as a freelance data consultant, and you know, a lot of what I would hear sometimes from customers is, “Well, why don't we ‘just' deprecate X?”Corey: “Why don't we just—” “I'm going to stop you there because there is no ‘just.'”Alex: Exactly.Corey: There's always context that we cannot have as outsiders.Alex: Precisely. Precisely. And digging into that really is—it's the fun part of the job, but it's also the hard part of the job.Corey: Before we created The Duckbill Group, which was really when I took Mike Julian on as business partner and CEO and formed the entity, I had something in common with you; I was freelancing for a couple of years beforehand. Now, I know why I wound up deciding, all right, we're going to turn this into a company, but what was it that I guess made you decide to, you know, freelancing is all well and good, but it's time to get something that looks a lot more like a quote-unquote, “Traditional job.”Alex: So, I think, on one level, I went freelance because I wasn't exactly sure what I wanted to do next. And I knew what I was good at. I knew what I had a lot of experience at, and I thought, “Well, I can just go out and kind of find a bunch of people that are willing to hire me to do what I'm good at doing, and then maybe eventually I'll find one of them that I like enough that I'll go and work for them. Or maybe I'll come up with some kind of a business model that I can repeat enough times that I don't have to worry that I wake up tomorrow and all of my clients are gone and then I have to go live in a van down by the river.”And I think when I heard about the opening at The Duckbill Group, I had been thinking for a little while about well, this has been going fine for a long time, but effectively what I've been doing is I've been you know, a staff-level data engineer for hire. And do I want to do something more than that, you know? Do I want to do something more comp—perhaps more sophisticated or more complex than that? And I rapidly came to the conclusion that in order to do that, I would have to have sales and marketing, and I would have to, you know, spend a lot of my time bringing in business. And that's just not something that I have really any experience in or I'm any good at.And, you know, I also recognize that, you know, I'm a relatively small fish in a relatively large pond, and if I wanted to get the kind of like, large scale people, the like the big, you know, Fortune 1000 company kind of customers, they may not pay attention to somebody like me. And so I think that ultimately, what I saw with The Duckbill Group was, number one, a group of people that were strongly aligned to the way that I wanted to keep doing this sort of work, right? Cultural alignment was really strong, good people, but also, you know, you folks have a thing that you figured out, and that puts you 10 to 15 steps ahead of where I was. And I was kind of staring down the barrel that, I'm like, am I going to have to take six months not doing client work so that I can figure out how to make this business sustain? And, you know, I think that ultimately, like, I just looked at it, and I said, this just makes sense to me, like, as a next step. And so here we all are.Corey: This episode is sponsored by our friends at Oracle Cloud. Counting the pennies, but still dreaming of deploying apps instead of “Hello, World” demos? Allow me to introduce you to Oracle's Always Free tier. It provides over 20 free services and infrastructure, networking, databases, observability, management, and security. And—let me be clear here—it's actually free. There's no surprise billing until you intentionally and proactively upgrade your account. This means you can provision a virtual machine instance or spin up an autonomous database that manages itself, all while gaining the networking, load balancing, and storage resources that somehow never quite make it into most free tiers needed to support the application that you want to build. With Always Free, you can do things like run small-scale applications or do proof-of-concept testing without spending a dime. You know that I always like to put asterisks next to the word free? This is actually free, no asterisk. Start now. Visit snark.cloud/oci-free that's snark.cloud/oci-free.Corey: It's always fun seeing how people perceive what we've done from the outside. Like, “Oh, yeah, you just stumbled right onto the thing that works, and you've just been going, like, gangbusters ever since.” Then you come aboard, it's like, “Here, look at this pile of things that didn't pan out over here.” And it's, you get to see how the sausage is made in a way that we talk about from time to time externally, but surprisingly, most of our marketing efforts aren't really focused on, “And here's this other time we screwed up as well.” And we're honest about it, but it's not sort of the thing that we promote as the core message of what we do and who we are.A question I like to ask people during job interviews, and I definitely asked you this, and I'll ask you now, which is going to probably throw some folks for a loop because who talks to their current employees like this? But what's next for you? When it comes time for you to leave the Duckbill Group, what do you want to do after this job?Alex: That's a great question. So, I mean, as we've mentioned before, you know, my career trajectory has been very weird and circuitous. And, you know, I would be lying to you if I said that I had absolute certainty about what the rest of that looks like. I've learned a few things about myself in the course of my career, such as it is. In my kind of warm, gooey center, I build stuff. Like, that is what gives me joy, it is what makes me excited to wake up in the morning.I love looking at big, complicated things, breaking them down into pieces, and figuring out how to make the pieces work in a way that makes sense. And, you know, I've spent a long time in the data ecosystem. I don't know, necessarily, if that's something that I'm going to do forever. I'm not necessarily pigeonholing myself into that part of the space just yet, but as long as I get to kind of wake up in the morning, and say, “I'm going to go and build things and it's not going to actively make the world any worse,” I'm happy with that. And so that's really—you know, might go back to freelancing, might go and join another group, another company, big small, who knows. I'm kind of leaving that up to the winds of destiny, so to speak.Corey: One thing that I have found incredi—sorry. Let me just address that first. Like that—Alex: Sure.Corey: —is the right way to think about it. My belief has always been that you don't necessarily have, like, the ten-year plan, or the five-year plan or whatever it is because that's where you're going to go so much as it gives you direction and forces you to keep moving so you don't wind up sitting in the same place for five years with one year of experience repeated five times. It helps you remember the bigger picture. Because I've always despised this fiction that we see in job interviews where average tenure in our industry is 18 to 36 months, give or take, but somehow during the interviews, we all talk like this is now your forever job, and after 25 years, you'll retire. And yeah, let's be a little more realistic than that.My question is always what is next and how can we align in a way that helps you get to what's coming? That's the purpose behind the question, and that's—the only way to make that not just a drippingly insincere question is to mean it and to continue to focus on it from time to time of, great. What are you learning what's next? Now, at the time of this recording, you've been here, I believe three weeks if I'm not mistaken?Alex: I've—this is week two for me at time of recording.Corey: Excellent. Yes, my grasp of time is sort of hazy at the best of times. I have a—I do a lot of things.Alex: For sure.Corey: But yeah, it has been an eye-opening experience for me, not because, “Oh, wow, we have an employee.” Yeah, we've done that a few times before. But rather because of your background, you are asking different questions than we typically get during onboarding. I had a blog post go out recently—or will be by the time this airs—about a question that you asked about, “Wow, onboarding into our internal account structure for AWS is way more polished than I've ever seen it before. Is that something you built in-house? What is that?”And great. Oh, terrific, I'd forgotten that this is kind of a novel thing. No. What we're using is AWS's SSO offering, which is such a well-built, polished product that I can only assume that it's under NDA because Amazonians don't talk about it ever. But it's great.It has a couple of annoyances, but beyond that, it's something that I'm a big fan of, but I'd forgotten how transformative that is, compared to the usual approach of all right, here's your username, here's a password you're going to have to change, here are your IAM credentials to store on disk forever. It's the ability to look at what we're doing through the eyes of someone who is clearly deep into the technical weeds, but not as exposed to all of the minutiae of the 300-some-odd AWS services is really a refreshing thing for all of us, just because it helps us realize what it's like to see some of this stuff for the first time, as well as gives me content ideas because if it's new to you, I promise you are not the only person who's seeing it that way. And if you don't really understand something well enough to explain it, I would argue you don't really understand the thing, so it forces me to get more awareness around exactly how different facets work. It's been an absolutely fantastic experience so far, from my perspective.Alex: Thank you. Right back at you. I mean, spending so many years working with startups, my kind of level of expected sophistication is, “I'm going to write your password on the back of a napkin. I have fifteen other things to do. Go figure it out.” And so you know, it's always nice to see—particularly players like AWS that are such 800-pound gorillas—going in and trying to uplevel that experience in a way that feels like—because I mean, like, look, AWS could keep us with the, “Here's a CSV with your username and password. Good luck, have fun.” And you know, they would still make—Corey: And they're going to have to because so much automation is built around that—Alex: Oh yeah—Corey: In so many places.Alex: —so much.Corey: It's always net-additive, they never turn anything off, which is increasingly an operational burden.Alex: Yeah, absolutely. Absolutely. But yeah, it's nice to see them up-level this in a way that feels like they're paying attention to their customers' pain. And that's always nice to see.Corey: So, we met a few years ago—in the before times—at a mixer that we wound up throwing—slash meetup. It was in Southern California for some AWS event or another. You've been aware of who we are and what we do for a while now, so I'm very curious to know—and the joy of having these conversations is that I don't actually know what the answer is going to be, so this may never see the light of day if it goes to weird—Alex: [laugh].Corey: —in the wrong direction, but—no I'm kidding. What has been, I guess, the biggest points of dissonance or surprises based upon your perception of who we are and what we do externally, versus joining and seeing how the sausage is made?Alex: You know, I think the first thing is—um, well, how to put this. I think that a lot of what I was expecting, given how much work you all do and how big—well, ‘you all;' we do—and how big the list of clients is and how it gets bigger every day, I was expecting this to be, like, this very hyper put together, like, every little detail has been figured out kind of engagement where I would have to figure out how you all do this. And coming in and realizing that a lot of it is just having a lot of in-depth knowledge born from experience of a bunch of stuff inside of this ecosystem, and then the rest of it is kind of free jazz, is kind of encouraging. Because as someone that was you know, as a freelancer, right, who do you see, right? You see people who have big public presences or people who are giant firms, right?On the GCP side, SADA Systems is a great example. They're another local company for me here in Los Angeles, and—Corey: Oh, yes. [unintelligible 00:24:48] Miles has been a recurring guest on the show.Alex: Yeah. And he's great. And, like, they have this enormous company that's got, like, all these different specializations and they're basically kind of like the middleman for GCP on a lot of things. And, like, you see that, and then you kind of see the individual people that are like, “Yeah, you know, I'm not really going to tell you that I only have two clients and that if both of them go away, I'm screwed, but, like, I only have two clients, and if both of them go away, I'm screwed.” And so, you know, I think honestly seeing that, like, what you've built so far and what I hope to help you continue to build is, you know, you've got just enough structure around the thing so that it makes sense, and the rest of it, you're kind of admitting that no plan ever survives contact with the client, right, and that everybody's going to be different than that everybody's problems are going to be different.And that you can't just go in and say, “Here's a dashboard, here's a calculator, have fun, give me my money,” right? Because that feels like—in optimization spaces of any kind, be that cloud, or data or whatever, there's this, kind of, push toward, how do I automate myself out of a job, and the realization that you can't for something like this, and that ultimately, like, you're just going to have to go with what you know, is something that I kind of had a suspicion was the case, but this really made it clear to me that, like, oh, this is actually a reasonable way of going about this.Corey: We thought otherwise at one point. We thought that this was something could be easily addressed their software. We launched our DuckTools SaaS platform in beta and two months later, did the—our incredible journey has come to an end, and took it off of a public offering. Because it doesn't lend itself to solving these problems in software in any reasonable way. I am ever more convinced over time that the idea of being able to solve cloud cost optimization with software at VC-scale is a red herring.And yeah, it just isn't going to work because it's one size fits some. Our customers are, by definition, exceptional in many respects, and understanding the context behind why things are the way that they are mean that we can only go so far with process because then it becomes a let's have a conversation and let's be human. Otherwise, we try to overly codify the process, and congratulations, we just now look like really crappy software, but expensive because it's all people doing it. It doesn't work that way. We have tools internally that help smooth over a lot of those edges, but by and large, people who are capable of performing at especially at the principal level for a cloud economics role, inherently are going to find themselves stifled by too much process because they need to have the freedom to dig into the areas that are relevant to the customer.It's why we can't recraft all of our statements of work in ways that tend to shy away from explicitly defined deliverables. Because we deliver an outcome, but it's going to depend entirely, in most cases, up on what we discover along the way. Maybe a full-on report isn't the best way of presenting the data in the way that we see it. Maybe it's a small proof of concept script or something like that. Maybe it's, I don't know, an interpretive dance in front of the company's board.Alex: [laugh]. Right.Corey: I'm open to exploring opportunities. But it comes down to what is right for the customer. There's a reason we only ever charge a fixed fee for these things, and it's because at that point, great, we're giving you the advice that we'd implement ourselves. We have no partnerships with any vendor in the space just to avoid bias or the perception of same. It's important that we are the authoritative source around these things.Honestly, the thing that surprised me the most about all this is how true to that vision we've stayed as we've as we flushed out what works, what doesn't. And we can distantly fail to go out of business every month. I am ecstatic about that. I expected this to wind up cratering into a mountain four months after I went freelance. Not yet.Alex: Well, I mean, I think there's another aspect of this too, right? Because I've spent a lot of my career working inside of venture capital-backed companies. And there's a lot of positive things to be said about having ready access to that kind of cash, but it does something to your business the second you take it. And I've been in a couple of situations where, like, once you actually have that big bucket of money, the incentive is grow, right? Hire more people get more customers, go, go, go, go, go.And sometimes what you'll find is that you'll spend the time and the money on an initiative and it's clearly not working. And you just kind of have to keep doubling down because now you've got customers that are using this thing and now you have to maintain it, and before you know it, you've got this albatross hanging around your neck. And like one of the things that I really respect about the way that Duckbill Group is is handling this by not taking outside cash is, like, it frees you up to make these kinds of bets, and then two months later say, “Well, that didn't work,” and try something else. And you know, that's very difficult to do once you have to go and convince someone with, you know, money flowing out of their ears, that that's the right thing to do.Corey: We have to be intentional about what we're doing. One of the benefits of bringing you aboard is that one, it does improve our capacity for handling more engagements at the same time, but it also improves the quality of the engagements that we are delivering. Instead of basically doing a round-robin assignment policy we can—Alex: Right.Corey: —we consult with each other; we talk about specific areas in which we have specific expertise. You get dragged into a lot of data portions of existing engagements, and the rest of us get pulled into other areas in which you might not be as strong. For example, “What are all of these ridiculous services? I can't make heads or tails have the ridiculous naming side of it.” Surprise, that's not a you problem.It comes down to being able to work collaboratively and let each other shine in a way that doesn't mean we load people up with work. We're very strict about having a 40-hour or less work week, just because we're not rushing for an exit. We want to enjoy our time working, we want to enjoy what we're doing, and then we want to go home and don't think about work until it's time to come back and think about these things. Like, it's a lifestyle company, but that lifestyle doesn't need to be run, run, run, run, run all the time, and it doesn't need to be something that people barely tolerate.Alex: Yeah. And I think that, you know, especially coming from being an army of one in a lot of engagements, it is really refreshing to be able to—see because, you know, I'm fortunate enough, I have friends in the industry that I can go and say like, “I have no idea how to make heads or tails of X.” And you know, I can get help that way, but ultimately, like, the only other outlet that I have here is the customer and they're not bringing me in if they have those answers readily to hand. And so being able to bounce stuff off of other people inside of an organization like this has been really refreshing.Corey: One of the things I've appreciated about your tenure here so far is the questions that you ask are pitched at the perfect level, by which I mean, it is never something you could answer with a three-second visit to Google, but it's also not something that you've spent three days spinning your wheels on trying to understand. You do a bit of digging; it's a little unclear, especially since there are multiple paths to go down, and then you flag it for clarification. And there's really so much to be said for that. Really, when we're looking for markers of seniority in the interview process, it's admitting you don't know something, but then also talking about how you would go about getting the answer. And it's—because no one has all this stuff in their head. I spend a disturbing amount of time looking at search engines and trying to reformulate queries and to get answers that make sense.I don't have the entirety of AWS shoved into my head. Yet. I'm sure there's something at re:Invent that's going to be scary and horrifying that will claim to do it and basically have a poor user interface, but all right. When that comes, we'll reevaluate then because this industry is always changing.Alex: For sure. For sure. And I think it's, it's worth pointing out that, like, one of the things that having done this for a long time gives you is this kind of scaffolding in your head that you can hang things over. We're like, you don't need to have every single AWS service memorized, but if you've got that scaffold in your head going, “Oh, like, this thing sounds like it hangs over this part of the mental scaffold, and I've seen other things that do that, so I wonder if it does this and this and this,” right? And that's a lot of it, honestly.Because especially, like, when I was solely in the data space, there's a new data wareho—or a new, like, data catalog system coming out every other week. You know, there are a thousand different things that claim to do MLOps, right? And whenever, like, someone comes to me and says, “Do you have experience with such and such?” And the answer was usually, “Well if you hum a few bars, I can fake it.” And, you know, that tends to help a great deal.Corey: Yeah. “No, but I'll find out and get back to you,” the right answer. Making it up and being wrong is the best way to get rejected from an environment. That's not just consulting; that's employment, too. If 95% of the time, you give the right answer, but that one time and 20 you're going to just make it up, well, I have to validate the other 19 because I never know when someone's faking it or not. There's that level of earned trust that's important.Alex: Well, yeah. And you're being brought in to be the expert in the room. That doesn't necessarily mean that you are the all-seeing, all-knowing oracle of knowledge but, like, if you say a thing, people are just going to believe you. And so, you know, it's beholden on you—Corey: If not, we have a different problem.Alex: Well, yeah, exactly. Hopefully, right? But yeah, I mean, it's beholden on you to be honest with your customer at a certain point, I think.Corey: I really want to thank you for taking the time out of your day to got with me about this. And I would love to have you back on in a couple of months once you're fully up to speed and spinning at the proper RPMs and see what's happened then. I—Alex: Thank you. I'd—Corey: —really appreciate—Alex: —love to.Corey: —your time where's the best place for people to learn more about you if they haven't heard your name before?Alex: Well, let's see. I am @alexras on Twitter, A-L-E-X-R-A-S. My personal website is alexras.info.I've done some writing on data stuff, including a pretty big collection of blog posts on the data side of the AWS ecosystem that are still on my consulting page, bitsondisk.com. Other than that—I mean, yeah, Twitter is probably the best place to find me, so if you want to talk more about any weird, nerd data stuff, then please feel free to reach out there.Corey: And links to that will, of course, be in the [show notes 00:35:57]. Thanks again for your time. I really appreciate it.Alex: Thank you. It's been a pleasure.Corey: Alex Rasmussen, principal cloud economist here at The Duckbill Group. I am Corey Quinn, cloud economist to the stars, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an angry, insulting comment that you then submit to three other podcast platforms just to make sure you have a backup copy of that particular piece of data.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Podlodka Podcast
Podlodka #258 – Распределенные вычисления

Podlodka Podcast

Play Episode Listen Later Mar 7, 2022 68:12


Мы уже обсуждали, как работают распределенные системы, но сознательно не стали затрагивать задачи, которые в них решаются. Исправляемся! Поговорили о том, что такое и зачем нужен MapReduce и какие еще есть модели распределенных вычислений. А помог нам в этом Егор Хайруллин из Яндекса. Как «Золотое Яблоко» перестроилось на развитие e-com и за год завоевало лидерство в отрасли: https://incrussia.ru/specials/goldapple Вакансии в IT команду: https://career.habr.com/companies/goldapple/vacancies Поддержи лучший подкаст про IT: www.patreon.com/podlodka Также ждем вас, ваши лайки, репосты и комменты в мессенджерах и соцсетях!
 Telegram-чат: https://t.me/podlodka Telegram-канал: https://t.me/podlodkanews Страница в Facebook: www.facebook.com/podlodkacast/ Twitter-аккаунт: https://twitter.com/PodlodkaPodcast Ведущие в выпуске: Евгений Кателла, Катя Петрова

Elixir Outlaws
Episode 107: Nineteen-Something Cats

Elixir Outlaws

Play Episode Listen Later Jan 20, 2022 41:17


The Elixir Outlaws now have a Patreon (https://www.patreon.com/user?u=5332239). If you're enjoying the show then please consider throwing a few bucks our way to help us pay for the costs for the show. Elixir Outlaws, 01/19/2021 On today's episode of the Elixir Outlaws, Sean Cribbs and Amos will talk about WASM (Web assembly) to implement some core parts of the app and discuss the server-side too. Rusts for loop syntax is sugar for iterators, says Sean. So, you can also sort of do method chaining type thing in rust. There is an interesting proposal on the Elixir form for loops. Episode Highlights FOR loops are not loops, they are a special form in the compiler, basically a macro with special privileges that generates some code, says Sean. As per Sean, if you have a lead, then you have for loop variable, and you have to return a two-two pole that has the accumulator as the second, or if you don't, then it is just the accumulator that becomes quite confusing. Amos says that when one uses MapReduce and has FOR loop and you want to step through something and maybe at the same time get an account and a sum, and you want to adjust the current values, then we are getting three things out in adjustment and then two other data points. It is hard to step away from an imperative mindset when you have done it forever and adding imperative things back into the language is going to make it even harder for people to step outside of that imperative mindset, says Amos. People use Monads to write things that look more imperative because it is easier for us to think that way sometimes, and it's going to create less maintainable code. OCaml is very much like in the same syntactical flavor as Haskell. But it doesn't have that whole lazy evaluation thing that gets so confusing at times, and it also has the much simpler type of system. Sean had tried hard to encourage his coworkers to use things like in the lists module or use list comprehensions or use fold wherever they could. But some people liked making software recursive functions that had a bunch of arguments to them. In MapReduce there is a trailing option you can put on for loops. It is like reducing given initial value of the accumulator and then you match the accumulator coming in. Using ENUM reduce, there are some tactical forms that represent something, and special form will generate, but they are not things in themselves. The things that are browser based we can't rewrite completely in Rust. There is always going to be JavaScript at some level. Sean has seen attempts where people want to have JavaScript running the same code on the front end as in the back end. In graphic production, there are many things about memory allocators, but these are all sorts of things that we might have to think about with rust when we are trying to implement. 3 Key Points With MapReduce in the ENUM module, you can do something on each element of the collection, but you are also collecting something about the entire collection as you flow through, says Sean. The flipside of list comprehension is that you can only do so many things in the right side of the comprehension. It may be explicit what you were returning from expression because the entire expression is inside the list brackets. The biggest thing that Amos have ever had to deal with when working on stuff on the front end or on edge computing is if you don't control the resource at the endpoint, it may be the slowest thing ever, and it may not work that well. Tweetable Quotes “I am not against pipes, and you can write non-imperative code with pipes, but it looks imperative.” – Amos “In FOR loop, FOR is an expression that returns a value. You can choose to ignore that value that's returned, but it returns a value, usually a list.” – Sean “You can have only one let, which is the other thing that's a little bit surprising. It works in if statements.” – Sean “Being able to have code on the back end, you control the hardware and the performance.” – Amos “The book Kill it with fire I wish I had picked up a year ago because the author worked for US digital service, updating mainframe applications, and there's a lot of sage advice in that book.” - Sean Resources Mentioned: Podcast Editing Elixir Outlaws: Website

Reversim Podcast
424 Melio's payment processor

Reversim Podcast

Play Episode Listen Later Oct 25, 2021


[קישור לקובץ mp3] שלום וברוכים הבאים לפודקאסט מספר 424 של רברס עם פלטפורמה - יצא מספר פלידנרומי, איזה מגניב! . . . - (אורי) 424?! . . . - (רן) 424 . . . התאריך היום הוא 17 באוקטובר, השעה היא 21:30 עוד מעט והשנה היא 2021 - (אורי) והטמפרטורות התחילו לרדת היום, גם היה גשם . . . - (רן) היה גשם היום, נכון, סוף סוף . . . היום אנחנו מתכבדים לארח את אילן ואת אור מחברת Melio - זה Mi-lio או Me-lio? . . .(אילן) האמת שזו שאלה מאוד טובה, כי כשהקמנו את החברה אז קראנו לה באמת Me-lio, אבל כשהתחלנו לדבר עם אנשים מארה”ב, אמרו לנו שיש משהו שנקרא Long e ו- Short e, שזה משהו שלא הכרנו . . . אז חלקם הוגים “Mi-lio” וחלקם הוגים “Me-lio” . . . מבחינתנו זה “Mi-lio”.(אורי) זה בטח ה-Domain שהיה פנוי . . . (אילן) האמת שה-Domain שהיה פנוי היה Mi-lio(paymnets.com), אבל ככל שהצלחנו לגדול והחלטנו שהשם זה ממש משהו שאנחנו שלמים איתו - כי היה גם שם תהליך, אבל זה סיפור לפודקאסט אחר - אז קנינו את melio.com, שהיה קצת יקר אבל הצלחנו להשיג, ארבע אותיות וגם com., אירוע קצת . . . (אורי) טוב, אז זה היה אילן . . .(רן) כן, אנחנו נעשה היום פודקאסט הפוך - נתחיל מהסוף . . . .כן - אז אנחנו שמחים ומתכבדים לארח פה את אילן ואור מחברת Melio - אנחנו נדבר על Melio ועל פלטפורמת התשלומים והטכנלוגיה שפיתחתם כדי באמת לממש את כל הסיפור הזה.אבל לפני זה - בואו נכיר אתכם: אילן - בבקשה:(אילן) תודה רבה שאתם מארחים אותנו - כבוד גדול.אני מכיר את אורי ורן עוד מלפני מספר שנים, כבוד הוא לנו לבוא לפודקאסט אני אילן, אחד ה-Co-Founders וה-CTO של Melio - הקמנו את החברה לפני כשלוש-וחצי שנים רשמית - קצת לפני עבדנו עוד בגראז', להבין מה אנחנו רוצים לעשותאני אספר על זה קצת עוד מעטשנים יחסית אינטנסטיביות בשנים האחרונותלפני כן הייתי ה-VP Engineering בחברה בשם Winward, ולפני זה עבדתי ב-Outbrain כשנתיים + . . .איתי פה נמצא אור . . .(רן) אור - ברוך הבא!(אור) תודה רבה - אני היום ב-Melio ה-Principal Engineer, הצטרפתי יחסית ממש בהתחלה. זה היה . . .לפני כן הייתי Co-Founder בסטארטאפ אחר, ולפני כן הייתי יועץ באיזושהי חברת נקרא לזה “בוטיק-DevOps” קטן שנקרא FewBytes ואחרי שעזבתי את ה-Startup שלי בתור Co-Founder - הייתי Co-Founder ממש לא טוב - מישהו שידך ביני לבין אילן והאמת שממש בשיחות הראשונות עם הפאונדרים של Melio זה פשוט . . . אני יכול להגיד באופן אישי שזה היה מעיין “אהבה ממבט ראשון”, ממש “עפתי עליהם” עד הסוף ואמרתי “אני רוצה לעבוד פה” וכל השאר פחות או יותר היסטוריה . . .(רן) אז Melio - אני מניח שיש כאלה שכבר שמעו את השם, אבל למי שעוד לא שמע: מה עושה Melio?(אילן) אנחנו פיתחנו ומפתחים פלטפורמה לעסקים קטנים, להעברות תשלומים.ככל שזה יהיה אולי מופלא ואולי לא לחלק מהמאזינים או למי שמקשיב, תשלומים, בארה”ב בעיקר, עדיין רובם ככולם מועברים על גבי פיסות נייר - שהם שיקים . . .סדר גודל של 18 טריליון דולר נעים בארה”ב בין עסקים קטנים בכל שנהסדר גודל של כחמישה מיליארד שיקים נכתבים בין עסקיםכשאנחנו . . . זה סדר גודל שראינו לפני ארבע שנים, ואמרנו “רגע - זה לא הגיוני”בעולם שה-Digital Payments קורים בין חברים, כלומר - היום להעביר כסף בין Friends & Family קורה בצורה מאוד פשוטה, יש הרבה מאוד אפליקציות שאתה יכול באמצעותן להעביר כסף בצורה סופר-קלה.אם אתה עכשיו בתור Consumer שרוצה לעשות Check-out ב-Online, התהליך הוא מאוד מאוד מתקדם, כל עולם ה-eCommerce.אתה יכול לעשות Check Out עם Stripe או עם כרטיסי אשראי או Affirm או עם Klarna או עם כל שיטת Check out אחרת.עדיין, תשלומים לספקים, רובם ככולם, מועברים בעצם על פני פיסות נייר - שיקים, העברות בנקאיות - דרך כלים שהם מחוץ . . . בעצם כלי מערכת, שהם בעיקר כלים של הבנקים.(רן) אפילו בקרנות הון סיכון אומרים “I'll write you a check” . . . (אילן) I'll write you a check”, Yes” . . . וגם אנחנו היום, כ-Melio - אנחנו כותבים שיקים . . .בעצם, יש לנו ספקים שרוצים לקבל רק שיקים . . .והבעיה הזו נראתה לנו די מעניינת ומאוד מאתגרת - אמרנו “איך זה יכול להיות, בעולם ש-Payments עוברים ו-Shifting ל-Digital בצורה מאוד מאסיבית, עדיין עולם ה-Supplier Payments נמצא על גבי פיסות נייר” . . .(רן) על אילו סוגי עסקים אנחנו מדברים? מספרות, וטרינרים, . . . ?(אילן) אז אנחנו מדברים כמעט על כל סוגי העסקים - זה יכול להיות כמו שאמרת - מספרות וטרינרים, מסעדות, Doctor Offices למינהם, Professional Services, צלמים . . . (אורי) גולדמן-סאקס?(אילן) גולדמן-סאקס . . . גם כאלה, הגדולים . . . ובאמת לחברות כמו Nike או Fortune-500's יש כלים, היום, לעשות גם Procurement וגם Paymentsאבל כשאתה הולך לעסק הקטן - מה שנקרא Owner-Operated Business - לבעל העסק כיום אין כלי מתאים כדי לנהל את תשלומי הספקים שלו.ומה שמאפיין בעצם את אותם עסקים זה שאין להם היום איזה Bookkeeper או איזשהו Accounts-Payables Expert שעושה עבורם את ה-Paymentsאנחנו היום ב-Melio, או אצלכם ב-Outbrain - יש בעצם Finance Department, שמתעסקים ב-Accounts-Payablesאבל אם אני עסק קטן, אם אני עכשיו בעל מסעדה ויש לי חמישה-עשרה עובדים - בדרך כלל מי שמטפלים בזה זה או אני או מישהו שהוא Trusted Employee.והיום עסק קטן ממוצע - העסקים שאנחנו מטרגטים (Targeting), של 5-10 עובדים, סדר גודל של 1-2 מיליון דולר Revenue בשנה - מוציאים סדר גודל של 50-60 Payments בחודש.וה”אירוע” הזה הוא בדרך כלל Heavy . . . בדרך כלל נעשה ידנית . . .(רן) אנחנו תיכף נצלול לסיפור הטכנולוגי שם, אבל קצת בכל אופן כדי להבין את הרקע - פתאום קם אדם בבוקר ומרגיש שהוא חייב לעשות מערכת Payments? זאת אומרת - איך קורה שילד-טוב-ירושלים, אילן, אחד מהפאונדרים של החברה, מחליט שבא לו להרים מערכת Payments לעסקים בארה”ב?(אילן) אז מה שבעיקר משך אותנו זה גודל ההזדמנות - באנו ואמרנו רגע, עסקים קטנים - סליחה על הקלישאה אבל זה The Backbone of the economy . . . בסופו של יום, הדרך שהם מתנהלים - גם ברמת האופרציה של להוציא את התשלומים וגם האופרציה גם גוררת . . . אופרציה לא יעילה גורמת לניהול תזרימים מאוד לא טוב עבור העסק.עסקים קטנים - אם אתה מסתכל על הסיבות שעסקים נסגרים לרוב - אז חלק נותנים שירות לא טוב או מוצר לא טוב, אבל בהרבה מאוד פעמים זה נובע מכך שהם לא יודעים לנהל נכון את “האירוע התזרימי”, אתה-Cash Flow.והרבה פעמים זה קורה בגלל היעדר יכולת אופרטיבית והבנה של מה בעצם צריך להוציא היום ומה אפשר להוציא מחר ומה אפשר לנהל בצורה יותר חכמה.באמת, מה שהדליק אותנו, מה שבעצם גרם לנו להגיד זה איך אנחנו יכולים לעזור לעסקים קטנים? - על ידי זה שנוכל בעצם לקחת את עולם ה-Payments שלהם לעולם ה-Digital, ולנהל בעיקר את ה-Cash Flow.(רן) אוקיי, אז אני לא מבין גדול בעולם ה-Finance, אבל אני כן יודע שיש כמה חברות וכמה ספקי תשלומים גדולים - הזכרתם אני חושב את Stripe ויש עוד כל מיני גדולים אחרים . . . (אורי) . . . האם בין ה-CRM לניהול הכספי - CRM זה יותר לצד הלקוחות . . . (רן) . . . כן, נשים לרגע את הסיפור העסקי בצד - אני מניח שיש סיבה למה Stripe לא מתאים להם, אבל אתם גם החלטתם לייצר מערכת תשלומים פנימית, זאת אומרת - לנהל את הכל אצלכם. למה לעשות את זה ולמה לא להשתמש באיזשהו צד שלישי - איזשהו בנק, ב-Stripe או כל דבר אחר כזה?(אילן) לפני שאני אענה על השאלה הזאת, אני אקח לרגע צעד אחורה - הסיבה בעצם כיום לכך שעסקים בעיקר מתנהלים - לתשלומי ספקים - בעיקר עם שיקים, זה בגלל חוסר ההסכמה, לרוב, הבסיסי בין איך שצד אחד רוצה לשלם לאיך שהצד השני בעצם רוצה לקבל את הכסף.היום, כשאני הולך ועושה Check out online, ויש שם איזשהו Check out עם Stripe - אז אני יכול לשלם בכרטיס אשראי, והצד השני יקבל את זה לחשבון הבנק שלו, בעצם.יש איזשהו “נדל”ן”, שזה ה-Point-of-Sale, שיכול לסלוק את כרטיס האשראי שלי - והצד השני יקבל את הכסף.בעולם ה-B2B, לרוב הטרנזקציות (Transactions), החלק הארי של הטרנזקציות קורה OTC - Over the Counter.אין בעצם היום איזשהו Point of Sale - לא לרכישה ולא לתשלום - וה-Point of Sale שבעצם קיים זה ה-Invoice.כשאני מזמין, לדוגמא, מהספק דגים שלי עשרה ק”ג סלמון למסעדה - יחד עם הדגים אני מקבל בעצם Invoice, ושם אני אמור לשלם את התשלום עבור הדגים באיזה Net Terms.עכשיו - אני ספק דגים שכבר קיים בשוק 20 שנה, ואני עכשיו מקבל ממאות לקוחות כסף - ובאיזשהו מקום אני לא בהכרח רוצה לתמוך בעוד שיטת תשלוםכי כל תהליך ה-Finance שבניתי או כל תהליך ה-Reconciliation שבניתי בעצם בנוי מעל שיקים, שמגיעים אליאני יודע איך הכסף מגיע ואיך לקשור אותו לחשבונית המתאימה.אבל אותה מסעדה שנפתחה עכשיו, מסעדה חדשה שלא בהכרח רוצה לשלם בשיקים - רוצה לשלם בכרטיס אשראי, רוצה לשלם בהעברה בנקאית . . . .שני הצדדים לא מסכימים על אמצעי התשלום.(אורי) . . . ואז יורדים למכנה המשותף הכי נמוך - שזה השיקים . . .(אילן) בול . . . ולכן מגיעים בדיוק למכנה המשותף הנמוך ביותר שזה השיקים - שיק - ברור שהוא מתקבל בכל מקום, ברור שהוא “ג'וקר”, ואתה יכול בעצם לתת אותו, ובעצם זה סוג של סטנדרט . . . (אורי) זה נייר - אפשר לעטוף איתו דגים . . .(רן) יש יותר נמוך - יש Cash . . . אבל לשם עוד לא ירדנו . . . יש מטבעות זהב . . .(אורי) נייר . . .(רן) אז בעצם אתם החלטתם שאתם בונים איזשהו Transpiler - משהו שמתרגם דיגיטלי לנייר, נייר לדיגיטלי או כל מיני תרגומים אחרים שקיימים . . . (אילן) בדיוק, ולשאלתך של למה בעצם בנינו Payment Infrastructure - כדי להגיע למצב שאנחנו בעצם נוכל לבוא ולשרת את אותם עסקים, הרי היינו צריכים לייצר איזושהי “חוסר תלות” בין הצדדים - Decupling בין המשלם למקבלובנינו בעצם Payments Infrastructure חדש, היום כבר מעל שלושה בנקים - Evolve Bank & Trust ו-Silicon Valley Bank ו-JPMorgan Chaseבעצם בנינו יכולת לבוא ולסלוק כסף מהמשלם בכל דרך שנרצה - זה יכול להיות כרטיס אשראי, זה יכול להיות Debit Card, זה יכול להיות בנק, זה יכול להיות PayPal, זה יכול להיות Apple Pay . . . אנחנו יכולים לסלוק כסף בכל דרך אפשרית - ולהוציא אותו מהצד השני בכל דרך שהצד השני יחפוץ בהבעצם יצרנו ניתוק בין שני הצדדים - מה שנותן לנו היום המון כוח לבוא לעסק - לבוא למסעדה או לאיזשהו צלם או מספרה או כל מקום אחר - ולהגיד “אוקיי, לא משנה עכשיו, אתה לא צריך לשכנע את הצד השני איך לקבל את הכסף, תן להם באיזו דרך שהם יחפצו, ואתה תשלם איך שאתה רוצה”.(אורי) יש גם, כאילו את “הדרך של Melio”, את ה . . . לא יודע, כרטיס או סוג של bit כזה . . . אפליקציה שהיא אפליקציית-סליקה, שאם היא מתאימה לשני הצדדים אז מה טוב, אבל אתה יכול גם דרכה לקבל ול . . .?(אילן) אז הדרך שאנחנו היום . . . בדוגמא שנתתי, נניח שאני מסעדה, אז אני יכול לסלוק, אני יכול עכשיו לבחור לשלם באשראי, יכול לבחור לשלם בבנק - ב-Bank Transfer - ואתה תקבל איך שתחפוץ, נניח שיקים או העברה בנקאית או כל דרך אחרת.אנחנו כן מייצרים . . . אנחנו נייצר בעצם סוג של . . . אם אני מבין נכון את השאלה שלך, מעיין Wallet, כך שאפשר בעצם, ברגע ששני הצדדים ב-Network, אז בעצם נוכל להעביר כסף - שהוא בעצם Wallet, שכל אחד יוכל להשתמש בו ב-Network עצמו.עכשיו, אחד הדברים הנוספים שיצרנו ב-Payments Infrastructure הזה זה בעצם, שלהבדיל ממערכות כמו bit או Pepper, או בארה”ב Venmo או PayPal, ששני הצדדים צריכים להיות ב-Network על מנת שצד אחד יוכל לשלם לצד השני - אנחנו בעצם יצרנו יכולת של מה שאנחנו קוראים לו Open Network - רק צד אחד צריך להיות ברשת על מנת שהצד השני יקבל את הכסף.על ידי כך, בעצם הורדנו את העומס ממי שכרגע משתמש בנו, כדי לשכנע שהצד השני יכנס.(רן) כן, אז החלטתם והבנתם שאתם רוצים להציע מערכת תשלומים נורא גמישה שהיא Open ואתה יכול לשלם איך שאתה רוצה ואתה יכול לקבל את הכסף איך שאתה רוצה - אתה בא לאור, “המתכנת המסכן”, אומר לו: “אור, בוא תבנה לי כזה!” . . . איך מתחילים? מה האתגרים פה? איך בכלל מתחילים לבנות מערכת Payments כזו מאפס?(אורי) אז אור מוציא לו חשבונית . . . (אור) אז באמת, Melio זה קצת יותר ממערכת תשלומים, מן הסתם - חלק גדול מאוד מהמערכת מבוסס על Interface ממש נוח - בגלל שזה Small Business, בגלל שאין להם כל הרבה זמן להתעסק עכשיו עם איזושהי מערכת Business-ית מורכבת, שבדרך כלל פונה לעסקים, אז בגדול, מה שאני הצעתי לאילן כשהתחלנו היה שאמרתי “אילן, תשמע - אנחנו נמצאים עכשיו On the verge of Serverless”, יש לנו הזדמנות לא לתחזק שרתים! יש לנו הזדמנות להינות מהיתרונות . . . “(רן) . . . ואומר את זה אחד שתחזק כבר הרבה שרתים, אמרת ב-Intro . . . (אור) בדיוק - מה שאצלי בראש היה זה שאני לא רוצה להגדיר יותר NTP בחיים, לעולם . . . אז אמרתי לאילן “בוא נעשה Serverless! בוא נלך על זה ובוא נראה אם זה עובד לנו”.[משלנו!]ואילן זרם איתי . . . עשה לי בהתחלה פרצוף של “אתה חושב? אתה בטוח?”, אבל אמרנו “יאללה, בוא נלך על זה” . . .(רן) “זה לא Hype, זה לא כמו GraphQL שיעבור עוד מעט? . . . .”(אור) בדיוק . . . באמת, היו לו ספקות קצת בהתחלה, ואמרתי לו “שמע - עלי! מה שלא יעבוד, אנחנו נסדר”.ואז באמת בנינו את המערכת - ה-Payments Processing שלנו בעצם רץ Serverless.האמת שחלק גדול מאוד מתשתית של Melio רץ רק על Serverless, רק על Lambda, ספציפית על Lambda.(רן) והמוטיבציה היא באמת “אני לא רוצה את כאב הראש הזה של NTP”, או שיש גם סיבות ארכיטקטוניות אחרות?(אור) זה מאוד . . . .(אורי) . . . זה מאוד Stream-oriented, נכון? זה Processing של Streams של Data, וזה נשמע מתאים . . . (אילן) אז זו נקודה מאוד חשובה, מה שאמרת עכשיו - בסופו של יום, תשלומים - רובם יוצאים או בהעברות בנקאיות מצד אחד, או בשיקים . . . עדיין אנחנו מוציאים שיקים - Melio מוציאה היום סדר גודל של מאות אלפי שיקים כל חודש, כי עדיין הספקים רוצים לקבל שיקים . . .תהליך גביית התשלום הוא באמת Stream-oriented, כלומר - אני יכול להיכנס למערכת ולקבוע תשלום.אני יכול לקבוע אותו לעכשיו, אני יוכל לקבוע אותו להיום או למחר לעוד חודש - אבל בסופו של יום, כל או רוב התשלומים מתמקדים בעצם בנקודת זמן אחת.בסופו של יום, כדי להעביר כסף בהעברה בנקאית או בשיק - זה דווקא Batch-oriented, כלומר הכל מתרכז בנקודה אחת, כי הבנקים בסוף עובדים ב-Cut-off-ים . . .זאת אומרת שכשאני רוצה להעביר כסף מנקודה A ל-B, בעצם יש Cut-off של הבנקה-Cut-off של הבנק הוא ב-2300 או 2400 Central Time בארה”ב, ואז בעצם בנקודה הזאת אנחנו לוקחים את כל השלבים שנקבעו להיום, או שנקבעו למועד שאנחנו רוצים - ובעצם מוציאים אותם.מה שאומר שהמערכת מקבלת Event-ים, מקבלת פקודות, ב-Stream - אבל בסופו של יום, היא מתנקזת לנקודה אחת, שבה צריכים להעלות את אותו קובץ, אותו Ledger, לבנק, כדי לבצע את התשלומים השונים - או להוציא שיקים או . . .(אורי) הבנתי שאופי הטרנזקציות האלה זה אופי שלא מצריך State, כמעט . . . (אור) נכון - אז באמת, אני מוכרח להודות שבהתחלה המוטיבציה הייתה מאוד “לנהל כמה שפחות” ולאט לאט, עם הזמן - האמת שדי מהר - ראינו שזה משחק לטובתינו בעוד מקרים, כי יש לנו את הצד . . .צריך להבין שהשוק הזה הוא נורא נוח, כי . . . במובנים מסויימים הוא מאוד נוח ונקרא לזה “פריוויבלגי” לנו, כ-Business - כי מדובר בעסקים, אז הם עובדים 0900-1700, זה רוב העומס שיש לנו במערכתהם לא עובדים בשבת, הם לא עובדים בראשוןהבנקים לא עובדים בשבת ולא עובדים בראשון - אז אנחנו לא עושים Processing בימים האלו.יש לנו פריווילגיה מאוד גדולה להפעיל את המערכת רק בזמנים מסויימים,וגם בתוך אותם ימים - רק בשעות מסויימות(רן) אילו זה רק היה בשעון ישראל אז זה היה אידיאלי . . . (אור) כן, זה היה מושלם . . . אז במובן הזה, Serverless מאוד עזר לנו, כי אם ניקח לרגע רק את ה-Payments Processing -אז 90% מהיום זה 0, לא קורה שום דבר . . .אולי יש כל מיני Management ו-Logistic tasks וכאלו שרצים ברקע, אבל חוץ מזה - כלום.ואז, ב-Trigger מסויים ביום, במערכת מתחילה לעבוד, עושה את כל ה-Processing שהיא צריכה לעשות - וחוזרת לישון.(אורי) זה מזכיר לי קצת ב”רמזור” כשהוא מלמד ריקוד במשרד רואה חשבון - “מה קורה כל החודש? כלום-כלום-כלום . . . 15 לחודש?! אוו . . . .”.(רן) אז אתה אומר שהיכולת היפה של פונקציות Lambda לעשות Scale-up באופן מיייד ואחר כך לכבות לכמעט אפס - זה יתרון ארכיטקטוני אחד . . . דרך אגב, לגבי ה-State שהזכרתם פה, אז לפחות בדרך שבה אני מדמיין, דווקא ב-Payments אני מדמיין שקיים הרבה מאוד State, רק שהוא תמיד צריך להיות Persistent, את אומרת - הוא אף פעם לא In-Memory, כי אסור לאבד אותו . . . אז אולי זה לא נכון להגיד ש”לא קיים State”, אבל ה-State תמיד חייב להיות Persistent . . . .(אור) נכון - ה-State, במקרה שלנו, בוא נגיד . . . . אנחנו לא “Serverless קלאסי”, נקרא לזהה-State שלנו יושב על Database טרנזקציוני (Transactional), הטרנזקציות שלנו הן בתוך ה-Lambda, מן הסתם גם חלק מה-Processing של מה שה-Lambda עושהחלק ממה שאנחנו עושים בעבודה מול הבנקים זה בעצם חלק מהטרנזקציה שקוראת מול ה-Database, זאת אומרת - אנחנו מתייחסים ל-Lambda כאל Volatile לחלוטין - שאם היא תיפול, לא יקרה כלום מבחינת “לא יזוז כסף לשום מקום”.וה-State עצמו באמת נשאר ב-Database.(רן) איך נראים ה-API-ים מול אותם בנקים? אני זוכר מהפעם האחרונה שעשיתי איזשהו Payment, זה היה איזשהו CORBA זוועתי עם Perl וכאלה דברים . . . .מה המצב היום?(אור) אז באמת, תשלומים מול בנקים זה סיפור שלם לגמרי, שאפשר לספר עליו . . . אני אתן לרגע את הראשי תיבות ACH - זה Automated Clearing House, שזה בעצם אוטומציה למשהו ענתיקה שנקרא Clearing House . . .(רן) . . . ושום דבר שם לא אוטומטי . . . .(אור) . . . והאוטומציה . . . אני אתן שתי אנקדוטות, אבל בגדול זה קובץ עם המון Records בפניםאתה שם אותו באיזשהו Server בצד השני של העולם - וזה “חור שחור” . . . .אין שום דבר - לא מודל של Request Response . . . יש Response מסויים, אבל זה לא בדיוק אומר לך “אה, כן - אנחנו בדיוק העברנו את הכל!” - אתה יודע רק אחרי כמה ימים אילו מה-Records נכשלו.מה שלא נכשל - הצליח . . . זה בערך המודל לפרוטוקול של הדבר הזה.(רן) כנראה . . . (אור) “כנראה” . . . בדיוק.עכשיו, תוך כדי שאנחנו עובדים אתה אומר לעצמך אוקיי, זה מודל ש . . . יש שם איזשהו מחשב שעובר על הרשומות אחת-אחת, ה-Processing מעביר אותן הלאה ומחזיר אלינו מה שנכשל ומה שלא נכשל.ואז גילינו שאחת הטרנזקציות שחזרה ונכשלה - אנחנו ראינו איזשהו Meta-data בפנים שאנחנו שומרים כדי למפות את זה אחר כך לטרנזקציות פנימיות שלנו וכו' - ואצלנו זה התחיל נגיד עם “t” קטנה ומספר מאוד ארוךוחזרה אלינו טרזקציה שאנחנו לא מזהים - זיהינו אותה, כי שהסתכלנו בעין זה היה “T” גדולה ומספר מאוד ארוך . . . .ואז הבנו שאיפשהו ב-Chain של הבנקים, יש פשוט איזשהו בנאדם שפשוט הקליד “T” . . . . המחשב לא טועה בין “t” ל-”T”, זה שני דברים שונים לגמרי, אבל בנאדם שמקליד T באיזשהו אקסל או email או משהו - כנראה התחלף לו פעם אחת ל-T גדולה ומשם זה נשאר גדול וחזר אלינו בחזרה עם אות גדולה . . .(רן) זה היום שנשפך לו הקפה על ה-Shift . . . .(אור) משהו כזה, בדיוק . . . אז המערכת הזאת היא כאילו סמי-אוטומטית, כי הדברים הם Triggered בצורה אוטומטית - אבל יש שם הרבה מאוד עבודה אנושיתוגם השגיאות שחוזרות הרבה פעמים זו עבודה אנושית, כל מיני מיפויים שמסתכלים על ה-Owner של החשבון בנק - הרבה פעמים זה שם . . . .הם אשכרה ממפים את זה לשם, והרבה פעמים הם לא מוצאים את המספר . . . יש ממש הרבה מאוד תהליכים, ואני רוצה להגיד אולי - אם המערכת היא סוג של . . . ה-Input-ים שהיא מקבלת מהמכונה - אנחנו מתייחסים אליהם גם כאל Input-ים אנושיים, כדי לוודא שבאמת לא נפלנו גם במקרה הזה.(אורי) אבל רגע - קודם, אילן דיבר על זה שאתם מוציאים המון שיקים. זה כאילו . . . אשכרה יש מדפסות שמדפיסות נייר? Serverless מפה ועד להודעה חדשה, אבל מדפסות . . .(אילן) חבל על הזמן . . . (רן) בטח יש שירות של אמאזון שמדפיס שיקים, לא? . . . .(אילן) א - נכון, יש שירות של Amazon שמדפיס שיקים [?], אבל אנחנו משתמשים בשירותים של הבנקים שמדפיסים שיקיםאור דיבר בעצם על קובץ ACH, שזה קובץ מקודד, ששולחים אותו כדי לבצע העברות בנקאיותיש קובץ עם פורמט אחר, קצת יותר מתקדם, ב-JSON, שמעלים לבנקים והם מוציאים שיקים.בעצם, אנחנו נותנים פקודה לבנק - אתה צריך להעביר את זה עד שעה מסויימת, את הקובץ עצמוכשאתה אומר להם “הנה הפרטים” - ומהצד השני יש מדפסות, ומוציאים בעצם שיקים . . . עכשיו, הם עוברים, נכנסים למעטפות, עוברים ל-USPS - ומגיעים ליעד שלהם . . .(אורי) אני חייב להגיד שכאילו . . . נגיד ב-Outbrain, כשאנחנו עושים תשלומים לספקים - וזה הרבה מאוד ספקים, פעמיים בחודש - אנחנו עובדים עם איזושהי מערכת נוראית שנקראית מס”ב, מכירים? (אילן) [מהנהן ביאוש כנראה](רן) ישראלית?(אורי) כן, “מרכז סליקה בנקאי” או משהו כזה . . . (אור) זה די מזכיר את המבנה של ה-ACH, באיזשהו מקום - מאנקדוטות ששמעתי . . . (אורי) אני, אישית, מעדיף לחזור לשיקים, אחרי החווייה עם המס”ב הזה . . . כאילו, מעלים שם איזשהו קובץ Excel, זה אותו דבר כנראה . . . נורא.(רן) יותר בטוח מלשלם ביטוח לעובדים, שגם זה בדרך כלל לא מגיע, אבל לא משנה . . . .(אורי) נכון . . . .אבל יש שם גם . . . לפעמים מתחלפות להם . . . השמות מתחלפים בצדדים כי הכל בעברית, ואתה צריך לקרוא ביוונית, וזה . . .(רן) אז היום אתה מומחה ל-Payments, אור? את היום והלילה שלך אתה מבלה בפיענוח של קבצים כאלה?(אור) אז אני, בוא נגיד במרכאות “למזלי”, יש צוות הרבה יותר גדול שמתעסק בזה.אני עשיתי את זה תקופה יחסית ארוכה, אני . . . זה כמו במטריקס, שהוא רואה את הקוד ויודע מה מופיע מאחורי זה בלי להסתכל על התמונות? אז זה אותו דבר - אני מסתכל על הקובץ ACH ואני יודע - זה המספר של הזה, המבנה הזה זה שם . . .(רן) זה “T” גדולה אז היום Rachel עבדה, זה “t” אז . . .(אור) כן . . . גם בשיקים, אגב, זה מאוד . . . שוב, שיקים זה תהליך אנושי - זה נשלח בדואר אז זה הולך לאיבודיש גם דברים . . . לדוגמא, כשהתחלנו שלחנו מעטפות בצבע הלא נכון . . . שלחנו מעטפות סגולות של שיקים, של Melio, סגול . . . וגילינו שיש אנשים שפשוט מניחים את השיק בצד ולא עושים איתו כלום, כי הם חושבים שזה פרסומות . . . אז שינינו את זה ללבן - ופתאום אנשים כן הפקידו את השיקים . . . .יש כל מיני דברים . . . זה באמת, הערבוב הזה של תהליך אנושי ותהליך שאנחנו מייצרים דברים אוטומטית, שמים ב-API באיזשהו מקום איזו JSON או לא JSON - אנשים בצד השני בסוף צריכים לעשות פעולה, וזה הופך את הכל להרבה יותר מורכב.(אורי) יש לי משהו שמעניין אותי - דיברת על זה שבאים ועושים תשלום, ומקבלים מצד אחד ומשלמים מצד שני - ואתה רגיל שהטרנזקציה נסגרת, נכון?עכשיו, “נסגרת” זה אומר “הכסף עבר”, אני יודע, אבל זה לא בדיוק ככה - אתה . . . הכסף לא עבר, אתה רק העברת את הקובץ ל-Processing של מישהו אחר או שהשיק בדואר, זה . . . ואין היזון-חוזר.(אילן) אין היזון חוזר, זה נכון, וגם במקרים מסויימים, כמו שאור אמר - ב-Bank Transfer, ב-ACH, הפרוטוקול עובד בזה שהוא אומר “כל עוד לא חזרתי אליך אז הכל בסדר”, ואם חזרתי אליך עם שגיאה אז הנה הדברים שנכשלו”אבל כן בנינו מערכת - בנינו מערכת, בסופו של דבר אנחנו מעבירים היום בקצבים של עשרות מיליארדים של דולרים בשנה, יש לנו עשרות אלפי לקוחות ואנחנו חייבים שהכל יהיה מאוזן.ה-SLA הוא מאוד מאוד חשוב - בסופו של יום, אנחנו חייבים . . . לא יכולים להפסיד שדולר לא יגיע לצד אחד או תשלום או שניים יפלו, כי בסופו של דבר מדובר על עסקים שהכסף שלהם לא הגיע לספקים, וזה המון המון Relations שבין העסק לספק.ולכן בנינו מערכות שיושבות בעצם מחוץ ל-Payment Processing, שבעצם בודקות שהספרים מאוזניםנכנסות לבנקים, לוקחות קבצים שאנחנו . . . שמחוץ ל-Transactions, שהם קבצים שמגיעים אלינו - כדי לאזן את הספריםכדי לראות בעצם שכל מה שאנחנו ייצאנו - אנחנו אחרי זה מתשאלים את הבנק, אז אנחנו מבינים . . .כל מה שאנחנו שלחנו לבנק כהוראה, כשאנחנו אחרי זה מתשאלים את הבנק, אנחנו מוודאים שהבנק באמת הוציא את זה.כל המערכות של ה-FinOps שאנחנו . . . .(רן) אני מניח אגב, שזה ערך מוסף משמעותי, מעבר ליכולת הטכנית להעביר תשלום - לוודא שזה מאוזן, לוודא שהדברים עברו, אני מנחש שזה ערך מוסף . . . אני יכול להגיד, שוב - אם נחזור לאנקדוטה של החברת ביטוח - אני זוכר פגישה עם סוכן ביטוח שהבטיח לי ש”פה יש מחשב שבודק!”, אז שאלתי אותו “מה, לפעמים אין מחשב שבודק?”, אז הוא אמר לי “לא . . . בחברות זה אנשים, אצלי זה מחשב!”. אז ברוך הבא למאה העשרים . . . (אורי) אבל אתה אומר “אני מבצע את הטרנזקציות, ואחרי זה יש לי Sweeper כזה שעובר ובודק שבאמת כל הטרנזקציות - "באמת הבנק שילם את זה”, זה מה שתכל'ס סוגר את הטרנזקציה.(אילן) זה יוצא אצלנו כדוח במערכתאנחנו עושים בדיקות אצלנו, עוד במהלך העלאת הקבצים - גם שם יכולות להיות נפילות שלנויש הרבה Lambda-ות שרצות, יש הרבה קבצים, אנחנו עושים תהליך של MapReduce, שעוברים בעצם שורה-שורה בקבצים ופותחים אותם ב-Lambda-ות שיש לנובסופו של דבר אנחנו צריכים להבין שכל מה שקראנו מה-Database עולה לתוך הבנקים - עוד לפני בכלל שיכולים לסגור את הטרנזקציה.אז גם שם פיתחנו יכולת שבאנו ואמרנו שאנחנו לא מחכים - בגלל שאין שגיאות ואין היזון חוזר . . .(אורי) זה לא שאין שגיאות - אין הודעות שגיאה . . . (אילן) אין הודעות שגיאה, בדיוק - אז אנחנו, בתהליך העלאת הקבצים, אנחנו כל הזמן בודקים מה העלינו לעומת מה שהיה כתוב ב-Database - כי התהליך בעצם חיצוני ונפרד - כדי להקפיד שהדברים מאוזנים.רק בשלב שלאחר מכן, יש תהליך שבעצם מתשאל את הבנקים ובודק מה בעצם אנחנו העלינו, ואז רואה שהכל מאוזן.(אורי) עד כדי “t” קטנה ו-”T” גדולה . . . .(אילן) . . .שרק אור תופס, כן . . . (אור) יש פה באמת . . .אפשר להגיד שאנחנו עושים Reconciliation בכמה רמות שונות, מכמה Check-Points שונים בתוך התהליךגם מיד אחרי שאנחנו מעבירים את הכסף, גם כמה ימים אחר כך, גם כשמהבנק מודיעים לנו, בדיעבד, מה הצליח ומה לא הצליח, גם אחר כך במאזן של של הבנק, הסופי, שאנחנו רואים . . . .אנחנו מנסים באמת לקבל את התמונה השלמה, כי שוב, כמו שאילן אמר - אנחנו לא יכולים להרשות שבגללנו ה-Customer שלנו לא ישלם חשבון אחר, כי אז הוא, שוב, בבעיה מול הספק שלויש לו עכשיו Cash-flow problems . . . בשבילו זה 100% - תשלום אחד בשבילו . . .אצלנו תשלום אחד זה פרומיל-של-הפרומיל - אצלו זה 100% מהדברים שהוא מתעסק בהם.(אורי) זה גם פוגע לו לפעמים בדירוגי אשראי או כאילו . . . Credit Score.(אור) יכול . . .(אילן) זה יחס עם הספק . . . זה יחס עם הספק, שהוא אומר לו “The Cheque is on its way” - והוא לא באמת on its way, ואז היחסים ביניהם עשויים להיפגע.(רן) איך עוד נראה הסיפור הטכנולוגי? זאת אומרת - האם עצם זה שאתם עוסקים ב-Domain הזה, של פיננסים, יש לזה השלכות טכנולוגיות, לצורך העניין - באילו שפות אתם כותבים? איזה Security זה אומר מבחינתכם? השלכות אחרות, טכנולוגיות שקיימות?(אור) מבחינת שפות, אנחנו די “סטנדרטיים”, נקרא לזה ככה, לפחות בתעשייה היום.אנחנו עובדים ב-JavaScript, גם קצת Python בכל מיני מקומות בתוך המערכת - אבל בגדול רוב המערכת כתובה ב-Node.זה מאפשר לנו, פשוט בגלל ש-Lambda ו-Node זה מאוד . . . נקרא לזה “Native” ב- Runtime.לא ניסינו יותר מדי להתחכם שם - אנחנו בודקים את עצמנו כמה שיותר.מבחינת Security, גם - Lambda משחק יחסית לידיים שלנו במקרה הזה: אין Server . . . אין Port לפרוץ אליו אפילוזה לא קיים, כקונספט . . . גם ל-Compliance, אגב - גם מאוד עזר לנו, כל מה שקשור ל-Serverless.כשעברנו Compliance - עברנו כבר שני תהליכים - ופשוט, יש חברות . . .(אורי) מי הגוף שמבקש מכם את ה-Compliance? זה Compliance עם מי?(אור) Compliance ISO 27001 . . . (אורי) שהוא יותר פיננסי או . . .(אור) זה של אירופה יותר, אם אני לא טועה . . . (אילן) האירופאי זה בעיקר Security, ועכשיו אנחנו בעצם בתהליך, מסיימים אותו, של SOC 2 Type 2מי שדורש מאיתנו את הרגולציות האלה זה (א) השותפים שלנו, זה הבנקים שאיתם אנחנו עובדים - זה ה-Rails שאיתם אנחנו מעבירים את הכסף ושותפים - Melio בסוף . . . עוד לא נגענו בזה, אבל נחזור רגע לחלק הטכנולוגי - ל-Melio יש שני קווי מוצר עיקריים - הקו הראשון זה Stand-alone Experienceהקו השני בעצם זה ה-Platform - “היכולת לאמבד (To Embed) את ה-Experience בנדל”ן של מישהו אחר”השותף הכי גדול שלנו היום זה Intuit, ב-QuickBooks - בעצם שמו את היכולות שלנו בתוך QuickBooksושותף שמקבל שירות פיננסי רוצה לדעת שאנחנו Well-Secured.(רן) אז אמרת . . . למה אתה מוריד את ה-Attack-Surface? . . .(אור) דילגנו על זה . . . גם בתוך ה-Compliance יש סעיפים שלמים של Port management וכל מיני דברים כאלו, ברמת המכונות וה-Server-ים, שזה פשוט לדלג עליהם . . . לקצר מאוד את הזמן של ה-Compliance, באופן מפתיע . . . זה מפתיע את הצד השני, שעושה לנו את ה-Review - כמה חתכנו.זה היה מאוד נוח בהקשר הזה.(רן) למרות שאתה יודע - אני מניח שה-Compliance הזה יזוז עם הזמן ויתרגל, ויגלה שגם לצורך העניין, ב-Serverless צריך פשוט לבדוק דברים אחרים . . . אין יותר Port-ים פתוחים, אוקיי . . . אין יותר File Descriptors, אבל כן יש דברים אחרים . . .(אור) יש Dependencies, יש Static code analysis . . . עדיין יש הרבה API-ים שחשופים החוצה לעולם, מן הסתם . . .(רן) אני מבין שה-Compliance עוד לא הגיע לשם . . . .(אור) אנחנו מנסים כמה שיותר לדאוג בעצמנו, כי שוב - ה-Compliance חשוב לנו בגלל שזה חשוב לפרטנרים שלנו, זה חשוב לנראותחשוב לנו שלא יקרה לנו שום דבר, לשמור בעצם על כל הלקוחות שלנו, אז יש כאן את האספקט של האם אנחנו מרגישים מספיק אחריות בשביל לעשות את זה.כן . . . .אז בהקשר הזה, השימוש ב-Lambda ובאופן כללי ב-Serverless - אני רוצה רגע להגיד מילה על Serverless - אני תמיד שומע “Serverless, Serverless” . . . כשהתחלנו להתעסק עם זה, אני פחות התעניינתי בזה שזה Serverless, אפילו קראתי לזה הרבה פעמים Management-less . . . .יש Server, הוא קיים - יש Lambda, זה Server, יש Instance, יש לנו Connection ל-Database שאנחנו עושים לו Re-use, יש RAM ואנחנו מחזיקים שם כל מיני דברים, יש CPU . . . . יש הכל.זה מבחינתינו כאילו מתנהג קצת כמו Server שמריץ קוד ב-Check-point-ים - רק שאנחנו כאילו לא מנהלים אותו.אז במדרג, אנחנו כן מסתכלים על זה כי Lambda זה ה-Holy Grail מהבחינה הזו של Management-lessמתחת לזה יש לנו FarGate, יש לנו זה . . . אז אנחנו לא Pure-Serverless - אנחנו משתמשים במה שמתאים לנו באותה נקודת זמן.(רן) איך זה משפיע על חווית הפיתוח? זאת אומרת - אם אני עכשיו בא ומתקן איזשהו Bug ב-Service, שהוא כנראה חלק מ-70 רכיבים אחרים - איך אני מפתח אותו? איך אני בודק אותו?(אור) יפה, אז זה אחד הדברים הראשונים שגם אני חשבתי עליהם כשאמרנו “בואו נעשה Serverless” . . .אז יש לנו כרגע שתי גישות - אחת שהיא קצת יותר Legacy בתוך החברה ואחת שהיא יותר חדשה, שאנחנו ככה מתחילים לעשות לה סוג של Imploy מבפניםהגישה הראשונה, שהיא עדיין עובדת בחלק גדול מה-Service-ים - מה שעשינו איתה בעצם . . . ה-Service-ים עצמם, היה להם מבנה פנימי מאוד ספציפי, הם היו נראים כמו איזשהו Web Application, והייתה איזושהי מעטפת קטנה שסידרה בעצם את כל התשתית מסביב, שהיא כאילו תיקרא ל-Routing בתוך ה-Web Applicationאם זה Event מ-SQS אז הוא מול איזשהו Route עם Fake payload, שזה בעצם ה-Payload מ-SQS, ועוד כל מיני דברים בסגנון הזהאם יש S3 אז הוא מול איזשהו Payload מ-S3ואז זה מאפשר לנו בעצם להריץ את הדבר הזה בתוך Lambda כרגיל, עם Event-ים ו-Listeners והכל . . . (רן) וב-Commit אתה מייצר איזשהו Container שעוטף את זה . . .(אור) אפילו לא Container - הלכנו ממש npm-start . . . פשוט, מה שהיה . . . היו פשוט, בכל פרויקט, היו שני סקריפטיםאחד שמתאים ל-Lambda והשני שהוא Server עם איזושהי מעטפת.כשמפתחים עבדו לוקאלית, אז בעצם הם . . . ה-Service שלהם דיבר ישירות עם ה-Cloud, לא עבדנו עם RabbitMQ לוקאלית ו-SQS ב-Cloud, עם DynamoDB ב-Cloud ועם Redis לוקאליתפשוט הכל - לכל Developer יש תשתית שלמה - “שלד” כזה של התשתית - בלי ה-Computeהוא פשוט בוחר איזשהו Service שהוא רוצה, npm-start - וזה מתחיל “לנגן” מול התשתית, מול ה-SQS הרלוונטי, מול ה-DynamoDB הרלוונטיה-RDS, במקרה הזה MySQL, עדיין לוקאליתזאת הייתה הגישה הראשונה - זה עבד יחסית טוב, רצנו עם זה יחסית הרבה זמן.עכשיו אנחנו נהיינו קצת יותר Powerhouse של -Lambda, ואנחנו עובדים לגישה שהיא קצת שונה - אנחנו עובדים עם SAM היום - SAM זה המתחרה-Serverlss, זה “ה-Serverless.com של AWS“. . . הרעיון זה שהוא מייצר לנו CloudFormation templates, אנחנו עושים לזה Deployments כחבילה שלמה, כ-Stack שלםואז, ברגע שיש לך Stack כזה, של . . . בגדול, לכל מפתחת אצלנו נגיד יש חשבון AWS פרטי, זה כרגע . . . עדיין אנחנו בסוג של נקרא-לזה-POC כדי לבדוק שזה . . . שההתיכנות של זה היא ממש בסדר.לכל מפתחת יש חשבון AWS - בפנים יש בעצם את המיני-Production של Melio - איזה שירות שהיא רוצה להריץ שם, את ה-email Service שלה, גם את ה-Payments Processing, הכל . . . ואז, אם היא רוצה לפתח Lambda מסויימת, אז כתבנו איזשהו כלי משלנו, שבעצם משתלט על ה-Lambda הזאת, ומעביר את ה-Compute אליה למחשבואז היא יכולה לעשות Break-points, לוקאלית - זה רץ ממש על המחשב . . .(רן) כמו Telepresence בעולם של Kubernetes . . . .(אור) בדיוק - רק עם פחות משחקיםפחות משחקים עם Port-ים, פחות משחקים עם Networking - רק לקחת את ה-Message, לשלוח אותו למחשב, לעשות את ה-Compute . . .כי ה-Resources של AWS בכל מקרה זמינים - SQS זמין ב-API Call ו-SNS זמין ב-API Call, אז ה-Compute שרץ לוקאלית על המחשב “מדבר עם ה-Cloud כאילו הוא ב-Cloud”אז ה-Telepresence במובן הזה זה רק להעביר את ה-Messaging למקום הנכון ב . . . נקרא לזה “ב-Network הגלובאלי העולמי”, למחשב הספציפי שבו זה נמצא כרגע.(רן) אז מפתח חדש שמצטרף אליכם - אנחנו כבר לקראת סיום, וזו שאלה אחרונה אולי - מפתח חדש שמצטרף אליכם, שמעולם לא חווה Serverless ולא חווה את ה-Concept - עד כמה, להערכתם, קל או קשה לו להכנס ל-Mindset הנכון, של Serverless, של Stateless, וכו'?(אור) אז אני מודה שזה אתגר . . . אנחנו, ככה, מנסים בתקופת ה-Onboarding של המפתחים והמפתחות, אנחנו מנסים להכניס את זה מעיין ל-Mindset של “אנחנו חיים על Lambda”, עם האתגרים - מה שיבוא, אנחנו נתמודד איתו.בגדול, הגענו למצב שיש כבר הרבה מאוד Engineers שכבר עובדים עם זה, אז ברגע שמישהו מצטרף, יש את ה . . . נקרא לזה תמיכה, ה-Ecosystem הפנימי של החברה שיודע לעזור.אני יכול להגיד שהחבר'ה של ה-Payments Processing מדהימים בקטע הזה - ממש אימצו את זה לגמרי והם הולכים עם זה עד הסוף.גם עם ה-Pitfalls ועם ה-Challenges שיש לזה - הם הולכים עם זה ורצים עם זה קדימה ממש יפה.רציתי לגעת דווקא בנקודה, בהקשר של Serverless, אם יש לנו זמן - בהקשר של Pricing . . .יש איזושהי מנטרה כזאת, ש-”Serverless הרבה יותר יקר” [תלוי . . . 412 Serverless at Via], בגלל שזה בעצם שירות Premium כדי להריץ פונקציה אחת בודדתאנחנו, מה שנקרא, מוצאים - בהשאלה מאנגלית [we find it] - אנחנו מוצאים את זה יחסית - אם לא יותר זול אז מקביל לדברים אחרים.יש לזה כמה סיבות - מן הסתם, אחת הסיבות העיקריות זה שאם לא הרצנו אז אנחנו לא משלמים, אבל באיזשהו מקום . . .(אורי) אין דבר כזה “להשאיר Instance באוויר” . . .(אור) בדיוק - אין Instance באוויר . . . כשהוא כן באוויר זה יקר יותר, אבל רוב הזמן אצלנו הוא לא באוויר.בוא נגיד לא “רוב הזמן”, אני מגזים - אבל חלק גדול מהחודש הוא לא באוויר.ויש פיצ'ר מאוד נחמד, בהקשר הזה, שיחסית מאוד קל לנו לעשות לו מה שנקרא Unit economicsכי בעצם כל Processing אצלנו - אנחנו יודעים כמה הוא עולה, אנחנו יכולים לעשות איזשהו חישוב גס ולדעת כמה בעצם לתרגם- ממש לחשבונית AWS - לתרגם כמה עולה הפעילות העסקית, ואפילו לתת תחזיות על סמך זה.וזה יתרון מאוד גדול בשבילנו.(אורי) זה מחזיר אותי לשאלה שמחכה מההתחלה . . . מה המודל העסקי? זאת אומרת - אתם פר-טרנזקציה? אתם . . .(רן) עושים פרסומות! מה בעיה? . . . (אילן) Recommendations, כן . . .במערכת שלנו, בסופו של דבר, יש שני סוגים של Transactions - יש את מה שאנחנו קוראים לו Basic Transactions, ה-Fundamental - להעביר ACH ל-ACH או ACH לשיק - התשלומים האלה הם חינם, בעצם Engagement Flywheel עבור העסק ועבורנו בעצם - שהעסק ישתמש בנו.הסוג השני של התשלומים זה בעצם Premium Payments - אם עכשיו עסק רוצה להשתמש בכרטיס אשראי - אז לא מתאפשר לו כרטיס אשראי, כי רוב הספקים לא מקבלים אשראי בעולמות ה-B2Bאנחנו, בזכות ה-Decupling, מאפשרים לעסק בעצם לשלם בכרטיס אשראי - והצד השני יקבל שיק.וע”י כך, בעצם לעזור לעסק ולדחות תשלום בעוד 30 או 45 יום ל-Billing cycle הבא שלך, של כרטיסי האשראיהדבר הזה יעלה למשלם 2.9% . . . (רן) “אשראי ישראלי” - שוטף פלוס . . .(אורי) “השיק בדואר” . . . .(אילן) ותשלומים אחרים שהם Premium Services זה אם אני עכשיו בתור . . . אם אני רוצה . . . ACH, לוקח לו שלושה ימים להגיע בין צד אחד לצד אחר, זו המערכת הבנקאית בארה”ב [גם בארץ…]אם עכשיו רוצים שהתשלום יגיע באותו יום, או Instant - אז בעצם זו עלות שאחד הצדדים יכול לספוג בינתיים - מי שרוצה להאיץ את התשלום או לקבל יותר מהר את התשלום.(רן) דרך אגב, גם במערכת הבנקאית - אני מניח שאתה מכיר את זה - יש גם אפשרות לזרז את התשלום תמורת “תשלום סמלי” . . .(אילן) בדיוק - International payouts -אנחנו היום נכנסים לתשלומים בינלאומיים - ותשלומים כאלה עולים כסף, Domestic wire.אז אנחנו נותנים את התשלומים, את ה-Fundamental payments, בחינם - אבל התשלומים היותר Premium הם בעצם עולים, לאחד הצדדים, תלוי למי אתה מוכר אותו.משם מגיעים ה-Unit Economics שלנו.(אורי) אבל יש, נקרא לזה “הלימה”, בין כמות הטרנזקציות שאתם תבצעו - תכל'ס תשמשו ב-Lambda-ות, נכון? - לבין כמה כסף שתרוויחו, זאת אומרת - זה יחס ישר, מסויים, אבל . . .(אילן) זה לגמרי ככה . . . בסופו של יום, כשאנחנו בעצם מודדים, אנחנו מסתכלים בעצם על סך כל ה-Volume ש-Melio הוציאה באותו יום או באותו חודש - וכמה מה-Volume הזה הוא בעצם volume ש-Melio קיבלה עליו Revenueויש לנו איזשהו יעד שאנחנו באים ואומרים - “רגע, מה היחס?”אם מסתכלים, נגיד, על Check out באונליין, בוא נניח על Check out ב-Stripe - בסופו של יום, כש-Stripe מסתכלת על 100% מהטרנזקציות, הם מרוויחים רווח כזה או אחר, 2.5% או Whatever.אז ב-Melio זה עובד קצת אחרת, בגלל שיש Blend - יש Blend של תשלומים שהם בחינם ותשלומים שהם עולים, ש-Melio בעצם מקבלת עליהם Revenue.כשמסתכלים על הכל, אז יש לנו איזשהו יעד של כמה “Bips-ים” בעצם מסך כל ה-TPV הוא בעצם רווח או Revenue ל-Melio(רן) תרגם שנייה . . . Bips זה?(אילן) זה בעצם האחוזים שבעצם עליהם אנחנו . . .(אורי) זה רווח . . .(אילן) זה הרווח . . . זה ה-Revenue[בערך . . Basis points (BPS) refers to a common unit of measure for interest rates and other percentages in finance. One basis point is equal to 1/100th of 1%]וזה בעצם יעד שאנחנו מסתכלים עליו כל הזמןויש הלימה, בדיוק כמו שאמרת, אורי - בעצם, זה שאנחנו רואים שעסק משתמש בנו יותר, או מבצע יותר תשלומים, אז כמות ה-Premium Payments היחסית שקוראת שם בעצם עולה.ולכן אנחנו באים, וזה עדיין כלכלי עבורנו לבוא ולהגיע למצב שאנחנו רוצים שה-Engegement יעלה - כי אנחנו יודעים שאפשר אחרי זה To derive more revenue.(רן) הנושא הזה, של Unit Economy, אני לגמרי מזדהה איתו - אני נמצא גם במקום שמאוד קשה להבין כמה דברים עולים ואני יודע שזה משמעותי - אבל אני תוהה עד כמה זה בכלל זה משמעותי, עלות מרכיב הענן אצלכם היום - זה בכלל משהו משמעותי? אתם בכלל שמים לב אליו בשלב הזה של הגדילה?(אורי) . . . כאחוז מה-Revenue, ה-Cost of Sales . . .(אילן) בוא נגיד ככה, אם אני יכול ככה “לשתף ולא לשתף”, מה שנקרא . . .(רן) אם המשקיעים לא מקשיבים . . . [אבל אולי קוראים?](אילן) יש לנו עלויות עסקיות, שהן לא עלויות של הענן, בעצם - העלויות מול הבנקים, מול השותפים “הטבעיים”, נקרא לזהוכשמסתכלים על התמונה הכוללת, כשכוללים בפנים את העלות של הענן - אז זה לא כל כך מפחיד.(רן) זה בסדר, ואני חושב שהרבה חברות נמצאות במקום כזה, בעיקר בשלב של גדילה, שבו יש עלויות הרבה הרבה יותר משמעותיות - והן במכוון “שופכות כסף” על הענן, נקרא לזה.הבעיה שהן אחר כך מגיעות לנקודה שממנה מאוד קשה לחזור, של “אוקיי, עכשיו אני רוצה לצמצם את עלויות הענן - אבל עכשיו זה כבר ממש ממש קשה”.[השלמות למיטבי שמע - 421 The Cost of Cloud, a Trillion Dollar Paradox with Martin Casado ו - 418 Carboretor 31 Cost of cloud paradox](אילן) אז אני אהיה איתך כנה - זה שיקול מאוד . . . זה שיקול שעובר לנו גם.בסופו של יום, כשאמרנו שאנחנו רוצים להיות Management-less, אנחנו מעדיפים להתרכז ב-Core Businessכי Melio זו חברה שגדלה - גדלה וגדלה מאוד מהר - בשנה האחרונה הגדלנו את נפח הפעילות ב-5000% אחוז . . . ה-Covid, הקורונה, נתנה Boost מאוד גדול לעסקים להיפטר ממשהו פיזי או לפגוש אחד את השני כדי לבצע תשלומים ולעבור לתשלומים Online.דרך אגב - ה-Serverless או ה-Lambda-ות עזרו לנו To scale out בצורה מאוד טובה - מראש בנינו את המערכת שנוכל To Scale out בצורה טובה, וזה עזר לנו בגדילה הבאמת מאוד מהירה שקרתה לנו.אבל לנקודה שלך - כן, אנחנו הרבה יותר מפוקסים ביכולת שלנו להגדיל את ה-Business מאשר ללכת ולהבין איך אנחנו נחסוך בעלויות עיבוד.(רן) אבל עושים הכנה למזגן? זאת אומרת - מתישהו תתקינו את המזגן הזה . . . .(אילן) לגמרי . . . בחברות Payments זה נהוג להבין בעצם “כמה עולה תשלום”כשאני מסתכל שנייה רגע על . . . Melio בעצם ביצעה מיליונים של טרנזקציות - מה העלות הכוללת שלי, מתהליך העיבוד, עלויות שותפים - Per-Transactionהיכולת לחשב את זה היא יכולת מאוד חשובה כדי להביא את ה-Business to Scale(רן) אז הזכרנו שאתם גדלים - לא אמרתם איפה אתם גרים . . . איפה המשרד?(אילן) המשרד שלנו נמצא בתל אביב, ברחוב הארבעה, מגדלי הארבעהמאוד נגיש מבחינת “קרוב לרכבת” - מאוד נגיש למי שנמצא מחוץ לתל אביב, מאוד נגיש למי שבתוך תל אביבמשרדים יפים, חדשים, שתי קומות - וגדלים . . . (רן) מה אתם מחפשים היום?(אילן) היום ה-Engineering ב-Melio הוא כ-80 אנשים, שנמצאים בארבע קבוצות - אנחנו רוצים להכפיל את גודל הקבוצה, את קבוצת ה-Engineering בשנה הקרובה . . .מחפשים קצת “הכל מהכל” - מחפשים Full-Stack Engineers, יותר לצוותים שהם Product-facing, שמתעסקים בחווייה - לא דיברנו על זה הרבה היום, אבל יש חווייה - אחד הדברים, ואור הזכיר את זה קצת, דיברנו בעיקר על ה-Payments Processing, אבל בסופו של יום אנחנו מוכרים חווייה - חווייה שתיהיה מאוד מאוד נוחה ופשוטה לבעל עסק קטן כדי לנהל את התשלומים שלואז יש צוותים שהם Product-facing שהם בעיקר Full-Stack Engineers.מחפשים Data Science - כי -Melio עושה את כל ה-Risk של ה-Payments, כי Risk “לא קיים” בכל עולמות ה-B2B כמשהו שהוא off the shelf, אז היינו צריכים לפתח את כל המודלים בעצמנואז גם Big Data Engineers וגם Data Science לקבוצות של ה-Risk וה-Data.ו-Backend engineers ל-Payment Processing, שדיברנו עליו עכשיו . . . (רן) יופי - אז שיהיה בהצלחה, תודה רבה על הביקור, השיק בדואר, להתראות! האזנה נעימה ותודה רבה לעופר פורר על התמלול!

Screaming in the Cloud
The Value of Analysts and Observability with Nick Heudecker

Screaming in the Cloud

Play Episode Listen Later Oct 20, 2021 40:42


About NickNick Heudecker leads market strategy and competitive intelligence at Cribl, the observability pipeline company. Prior to Cribl, Nick spent eight years as an industry analyst at Gartner, covering data and analytics. Before that, he led engineering and product teams at multiple startups, with a bias towards open source software and adoption, and served as a cryptologist in the US Navy. Join Corey and Nick as they discuss the differences between observability and monitoring, why organizations struggle to get value from observability data, why observability requires new data management approaches, how observability pipelines are creating opportunities for SRE and SecOps teams, the balance between budgets and insight, why goats are the world's best mammal, and more.Links: Cribl: https://cribl.io/ Cribl Community: https://cribl.io/community Twitter: https://twitter.com/nheudecker Try Cribl hosted solution: https://cribl.cloud TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by Thinkst. This is going to take a minute to explain, so bear with me. I linked against an early version of their tool, canarytokens.org in the very early days of my newsletter, and what it does is relatively simple and straightforward. It winds up embedding credentials, files, that sort of thing in various parts of your environment, wherever you want to; it gives you fake AWS API credentials, for example. And the only thing that these things do is alert you whenever someone attempts to use those things. It's an awesome approach. I've used something similar for years. Check them out. But wait, there's more. They also have an enterprise option that you should be very much aware of canary.tools. You can take a look at this, but what it does is it provides an enterprise approach to drive these things throughout your entire environment. You can get a physical device that hangs out on your network and impersonates whatever you want to. When it gets Nmap scanned, or someone attempts to log into it, or access files on it, you get instant alerts. It's awesome. If you don't do something like this, you're likely to find out that you've gotten breached, the hard way. Take a look at this. It's one of those few things that I look at and say, “Wow, that is an amazing idea. I love it.” That's canarytokens.org and canary.tools. The first one is free. The second one is enterprise-y. Take a look. I'm a big fan of this. More from them in the coming weeks.Corey: This episode is sponsored in part by our friends at Jellyfish. So, you're sitting in front of your office chair, bleary eyed, parked in front of a powerpoint and—oh my sweet feathery Jesus its the night before the board meeting, because of course it is! As you slot that crappy screenshot of traffic light colored excel tables into your deck, or sift through endless spreadsheets looking for just the right data set, have you ever wondered, why is it that sales and marketing get all this shiny, awesome analytics and inside tools? Whereas, engineering basically gets left with the dregs. Well, the founders of Jellyfish certainly did. That's why they created the Jellyfish Engineering Management Platform, but don't you dare call it JEMP! Designed to make it simple to analyze your engineering organization, Jellyfish ingests signals from your tech stack. Including JIRA, Git, and collaborative tools. Yes, depressing to think of those things as your tech stack but this is 2021. They use that to create a model that accurately reflects just how the breakdown of engineering work aligns with your wider business objectives. In other words, it translates from code into spreadsheet. When you have to explain what you're doing from an engineering perspective to people whose primary IDE is Microsoft Powerpoint, consider Jellyfish. Thats Jellyfish.co and tell them Corey sent you! Watch for the wince, thats my favorite part.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. This promoted episode is a bit fun because I'm joined by someone that I have a fair bit in common with. Sure, I moonlight sometimes as an analyst because I don't really seem to know what that means, and he spent significant amounts of time as a VP analyst at Gartner. But more importantly than that, a lot of the reason that I am the way that I am is that I spent almost a decade growing up in Maine, and in Maine, there's not a lot to do other than sit inside for the nine months of winter every year and develop personality problems.You've already seen what that looks like with me. Please welcome Nick Heudecker, who presumably will disprove that, but maybe not. He is currently a senior director of market strategy and competitive intelligence at Cribl. Nick, thanks for joining me.Nick: Thanks for having me. Excited to be here.Corey: So, let's start at the very beginning. I like playing with people's titles, and you certainly have a lofty one. ‘competitive intelligence' feels an awful lot like jeopardy. What am I missing?Nick: Well, I'm basically an internal analyst at the company. So, I spend a lot of time looking at the broader market, seeing what trends are happening out there; looking at what kind of thought leadership content that I can create to help people discover Cribl, get interested in the products and services that we offer. So, I'm mostly—you mentioned my time in Maine. I was a cryptologist in the Navy and I spent almost all of my time focused on what the bad guys do. And in this job, I focus on what our potential competitors do in the market. So, I'm very externally focused. Does that help? Does that explain it?Corey: No, it absolutely does. I mean, you folks have been sponsoring our nonsense for which we thank you, but the biggest problem that I have with telling the story of Cribl was that originally—initially it was, from my perspective, “What is this hokey nonsense?” And then I learned and got an answer and then finish the sentence with, “And where can I buy it?” Because it seems that the big competitive threat that you have is something crappy that some rando sysadmin has cobbled together. And I say that as the rando sysadmin, who has cobbled a lot of things like that together. And it's awful. I wasn't aware you folks had direct competitors.Nick: Today we don't. There's a couple that it might be emerging a little bit, but in general, no, it's mostly us, and that's what I analyze every day. Are there other emerging companies in the space? Are there open-source projects? But you're right, most of the things that we compete against are DIY today. Absolutely.Corey: In your previous role, which you were at for a very long time in tech terms—which in a lot of other cases is, “Okay, that doesn't seem that long,” but seven and a half years is a respectable stint at a company. And you were at Gartner doing a number of analyst-like activities. Let's start at the beginning because I assure you, I'm asking this purely for the audience and not because I don't know the answer myself, but what exactly is the purpose of an analyst firm, of which Gartner is the most broadly known and, follow up, why do companies care what Gartner thinks?Nick: Yeah. It's a good question, one that I answer a lot. So, what is the purpose of an analyst firm? The purpose of an analyst firm is to get impartial information about something, whether that is supply chain technology, big data tech, human resource management technologies. And it's often difficult if you're an end-user and you're interested in say, acquiring a new piece of technology, what really works well, what doesn't.And so the analyst firm because in the course of a given year, I would talk to nearly a thousand companies and both end-users and vendors as well as investors about what they're doing, what challenges they're having, and I would distill that down into 30-minute conversations with everyone else. And so we provided impartial information in aggregate to people who just wanted to help. And that's the purpose of an analyst firm. Your second question, why do people care? Well, I didn't get paid by vendors.I got paid by the company that I worked for, and so I got to be Tron; I fought for the users. And because I talk to so many different companies in different geographies, in different industries, and I share that information with my colleagues, they shared with me, we had a very robust understanding of what's actually happening in any technology market. And that's uncommon kind of insight to really have in any kind of industry. So, that's the purpose and that's why people care.Corey: It's easy from the engineering perspective that I used to inhabit to make fun of it. It's oh, it's purely justification when you're making a big decision, so if it goes sideways—because find me a technology project that doesn't eventually go sideways—I want to be able to make sure that I'm not the one that catches heat for it because Gartner said it was good. They have an amazing credibility story going on there, and I used to have that very dismissive perspective. But the more I started talking to folks who are Gartner customers themselves and some of the analyst-style things that I do with a variety of different companies, it's turned into, “No, no. They're after insight.”Because it turns out, from my perspective at least, the more that you are focused on building a product that solves a problem, you sort of lose touch with the broader market because the only people you're really talking to are either in your space or have already acknowledged and been right there and become your customer and have been jaded to see things from your point of view. Getting a more objective viewpoint from an impartial third party does have value.Nick: Absolutely. And I want you to succeed, I want you to be successful, I want to carry on a relationship with all the clients that I would speak with, and so one of the fun things I would always ask is, “Why are you asking me this question now?” Sometimes it would come in, they'd be very innocuous;, “Compare these databases,” or, “Compare these cloud services.” “Well, why are you asking?” And that's when you get to, kind of like, the psychology of it.“Oh, we just hired a new CIO and he or she hates vendor X, so we have to get rid of it.” “Well, all right. Let's figure out how we solve this problem for you.” And so it wasn't always just technology comparisons. Technology is easy, you write a check and you hope for the best.But when you're dealing with large teams and maybe a globally distributed company, it really comes down to culture, and personality, and all the harder factors. And so it was always—those were always the most fun and certainly the most challenging conversations to have.Corey: One challenge that I find in this space is—in my narrow niche of the world where I focus on AWS bills, where things are extraordinarily yes or no, black or white, binary choices—that I talked to companies, like during the pandemic, and they were super happy that, “Oh, yeah. Our infrastructure has auto-scaling and it works super well.” And I look at the bill and the spend graph over time is so flat you could basically play a game of pool on top of it. And I don't believe that I'm talking to people who are lying to me. I truly don't believe that people make that decision, but what they believe versus what is evidenced in reality are not necessarily congruent. How do you disambiguate from the stories that people want to tell about themselves? And what they're actually doing?Nick: You have to unpack it. I think you have to ask a series of questions to figure out what their motivation is. Who else is on the call, as well? I would sometimes drop into a phone call and there would be a dozen people on the line. Those inquiry calls would go the worst because everyone wants to stake a claim, everyone wants to be heard, no one's going to be honest with you or with anyone else on the call.So, you typically need to have a pretty personal conversation about what does this person want to accomplish, what does the company want to accomplish, and what are the factors that are pushing against what those things are? It's like a novel, right? You have a character, the character wants to achieve something, and there are multiple obstacles in that person's way. And so by act five, ideally everything wraps up and it's perfect. And so my job is to get the character out of the tree that is on fire and onto the beach where the person can relax.So, you have to unpack a lot of different questions and answers to figure out, well, are they telling me what their boss wants to hear or are they really looking for help? Sometimes you're successful, sometimes you're not. Not everyone does want to be open and honest. In other cases, you would have a team show up to a call with maybe a junior engineer and they really just want you to tell them that the junior engineer's architecture is not a good idea. And so you do a lot of couples therapy as well. I don't know if this is really answering the question for you, but there are no easy answers. And people are defensive, they have biases, companies overall are risk-averse. I think you know this.Corey: Oh, yeah.Nick: And so it can be difficult to get to the bottom of what their real motivation is.Corey: My approach has always been that if you want serious data, you go talk to Gartner. If you want [anec-data 00:09:48] and some understanding, well, maybe we can have that conversation, but they're empowering different decisions at different levels, and that's fine. To be clear, I do not consider Gartner to be a competitor to what I do in any respect. It turns out that I am not very good at drawing charts in varying shades of blue and positioning things just so with repeatable methodology, and they're not particularly good at having cartoon animals as their mascot that they put into ridiculous situations. We each have our portion of the universe, and that's working out reasonably well.Nick: Well, and there's also something to unpack there as well because I would say that people look at Gartner and they think they have a lot of data. To a certain degree they do, but a lot of it is not quantifiable data. If you look at a firm like IDC, they specialize in—like, they are a data house; that is what they do. And so their view of the world and how they advise their clients is different. So, even within analyst firms, there is differentiation in what approach they take, how consultative they might be with their clients, one versus another. So, there certainly are differences that you could find the more exposure you get into the industry.Corey: For a while, I've been making a recurring joke that Route 53—Amazon's managed DNS service—is in fact a database. And then at some point, I saw a post on Reddit where someone said, “Yeah, I see the joke and it's great, but why should I actually not do this?” At which point I had to jump in and say, “Okay, look. Jokes are all well and good, but as soon as people start taking me seriously, it's very much time to come clean.” Because I think that's the only ethical and responsible thing to do in this ecosystem.Similarly, there was another great joke once upon a time. It was an April Fool's Day prank, and Google put out a paper about this thing they called MapReduce. Hilarious prank that Yahoo fell for hook, line, and sinker, and wound up building Hadoop out of it and we're still paying the price for that, years later. You have a bit of a reputation from your time at Gartner as being—and I quote—“The man who killed Hadoop.” What happened there? What's the story? And I appreciate your finally making clear to the rest of us that it was, in fact, a joke. What happened there?Nick: Well, one of the pieces of research that Gartner puts out every year is this thing called a Hype Cycle. And we've all seen it, it looks like a roller coaster in profile; big mountain goes up really high and then comes down steeply, drops into a valley, and then—Corey: ‘the trough of disillusionment,' as I recall.Nick: Yes, my favorite. And then plateaus out. And one of the profiles on that curve was Hadoop distributions. And after years of taking inquiry calls, and writing documents, and speaking with everybody about what they were doing, we realized that this really isn't taking off like everyone thinks it is. Cluster sizes weren't getting bigger, people were having a lot of challenges with the complexity, people couldn't find skills to run it themselves if they wanted to.And then the cloud providers came in and said, “Well, we'll make a lot of this really simple for you, and we'll get rid of HDFS,” which is—was a good idea, but it didn't really scale well. I think that the challenge of having to acquire computers with compute storage and memory again, and again, and again, and again, just was not sustainable for the majority of enterprises. And so we flagged it as this will be obsolete before plateau. And at that point, we got a lot of hate mail, but it just seemed like the right decision to make, right? Once again, we're Tron; we fight for the users.And that seemed like the right advice and direction to provide to the end-users. And so didn't make a lot of friends, but I think I was long-term right about what happened in the Hadoop space. Certainly, some fragments of it are left over and we're still seeing—you know, Spark is going strong, there's a lot of Hive still around, but Hadoop as this amalgamation of open-source projects, I think is effectively dead.Corey: I sure hope you're right. I think it has a long tail like most things that are there. Legacy is the condescending engineering term for ‘it makes money.' You were at Gartner for almost eight years and then you left to go work at Cribl. What triggered that? What was it that made you decide, “This is great. I've been here a long time. I've obviously made it work for me. I'm going to go work at a startup that apparently, even though it recently raised a $200 million funding round”—congratulations on that, by the way—“It still apparently can't afford to buy a vowel in its name.” That's C-R-I-B-L because, of course, it is. Maybe another consonant, while you're shopping. But okay, great. It's oddly spelled, it is hard to explain in some cases, to folks who are not already feeling pain in that space. What was it that made you decide to sit up and, “All right, this is where I want to be?”Nick: Well, I met the co-founders when I was an analyst. They were working at Splunk and oddly enough—this is going to be an interesting transition compared to the previous thing we talked about—they were working on Hunk, which was, let's use HDFS to store Splunk data. Made a lot of sense, right? It could be much more cost-effective than high-cost infrastructure for Splunk. And so they told me about this; I was interested.And so I met the co-founders and then I reconnected with them after they left and formed Cribl. And I thought the story was really cool because where they're sitting is between sources and destinations of observability data. And they were solving a problem that all of my customers had, but they couldn't resolve. They would try and build it themselves. They would look at—Kafka was a popular choice, but that had some challenges for observability data—works fantastically well for application data.And they were just—had a very pragmatic view of the world that they were inhabiting and the problem that they were looking to solve. And it looked kind of like a no-brainer of a problem to solve. But when you double-click on it, when you really look down and say, “All right, what are the challenges with doing this?” They're really insurmountable for a lot of organizations. So, even though they may try and take a DIY approach, they often run into trouble after just a few weeks because of all the protocols you have to support, all the different data formats, and all the destinations, and role-based access control, and everything else that goes along with it.And so I really liked the team. I thought the product inhabited a unique space in the market—we've already talked about the lack of competitors in the space—and I just felt like the company was on a rocket ship—or is a rocket ship—that basically had unbounded success potential. And so when the opportunity arose to join the team and do a lot of the things I like doing as an analyst—examining the market, talking to people looking at competitive aspects—I jumped at it.Corey: It's nice when you see those opportunities that show up in front of you, and the stars sort of align. It's like, this is not just something that I'm excited about and enthused about, but hey, they can use me. I can add something to where they're going and help them get there better, faster, sooner, et cetera, et cetera.Nick: When you're an analyst, you look at dozens of companies a month and I'd never seen an opportunity that looked like that. Everything kind of looked the same. There's a bunch of data integration companies, there's a bunch of companies with Spark and things like that, but this company was unique; the product was unique, and no one was really recognizing the opportunity. So, it was just a great set of things that all happen at the same time.Corey: It's always fun to see stars align like that. So—Nick: Yeah.Corey: —help me understand in a way that can be articulated to folks who don't have 15 years of grumpy sysadmin experience under their belts, what does Cribl do?Nick: So, Cribl does a couple of things. Our flagship product is called LogStream, and the easiest way to describe that is as an abstraction between sources and destinations of data. And that doesn't sound very interesting, but if you, from your sysadmin background, you're always dealing with events, logs, now there's traces, metrics are also hanging around—Corey: Oh, and of course, the time is never synchronized with anything either, so it's sort of a giant whodunit, mystery, where half the eyewitnesses lie.Nick: Well, there's that. There's a lot of data silos. If you got an agent deployed on a system, it's only going to talk to one destination platform. And you repeat this, maybe a dozen times per server, and you might have 100,000 or 200,000 servers, with all of these different agents running on it, each one locked into one destination. So, you might want to be able to mix and match that data; you can't. You're locked in.One of the things LogStream does is it lets you do that exact mixing and matching. Another thing that this product does, that LogStream does, is it gives you ability to manage that data. And then what I mean by that is, you may want to reduce how much stuff you're sending into a given platform because maybe that platform charges you by your daily ingest rates or some other kind of event-based charges. And so not all that data is valuable, so why pay to store it if it's not going to be valuable? Just dump it or reduce the amount of volume that you've got in that payload, like a Windows XML log.And so that's another aspect that it allows you to do, better management of that stuff. You can redact sensitive fields, you can enrich the data with maybe, say, GeoIPs so you know what kind of data privacy laws you fall under and so on. And so, the story has always been, land the data in your destination platform first, then do all those things. Well, of course, because that's how they charge you; they charge you based on daily ingest. And so now the story is, make those decisions upfront in one place without having to spread this logic all over, and then send the data where you want it to go.So, that's really, that's the core product today, LogStream. We call ourselves an observability pipeline for observability data. The other thing we've got going on is this project called AppScope, and I think this is pretty cool. AppScope is a black box instrumentation tool that basically resides between the application runtime and the kernel and any shared libraries. And so it provides—without you having to go back and instrument code—it instruments the application for you based on every call that it makes and then can send that data through something like LogStream or to another destination.So, you don't have to go back and say, “Well, I'm going to try and find the source code for this 30-year old c++ application.” I can simply run AppScope against the process, and find out exactly what that application is doing for me, and then relay that information to some other destination.Corey: This episode is sponsored in part by Liquibase. If you're anything like me, you've screwed up the database part of a deployment so severely that you've been banned from touching every anything that remotely sounds like SQL, at at least three different companies. We've mostly got code deployments solved for, but when it comes to databases we basically rely on desperate hope, with a roll back plan of keeping our resumes up to date. It doesn't have to be that way. Meet Liquibase. It is both an open source project and a commercial offering. Liquibase lets you track, modify, and automate database schema changes across almost any database, with guardrails to ensure you'll still have a company left after you deploy the change. No matter where your database lives, Liquibase can help you solve your database deployment issues. Check them out today at liquibase.com. Offer does not apply to Route 53.Corey: I have to ask because I love what you're doing, don't get me wrong. The counterargument that always comes up in this type of conversation is, “Who in their right mind looks at the state of the industry today and says, ‘You know what we need? That's right; another observability tool.'” what differentiates what you folks are building from a lot of the existing names in the space? And to be clear, a lot of the existing names in the space are treating observability simply as hipster monitoring. I'm not entirely sure they're wrong, but that's a different fight for a different time.Nick: Yeah. I'm happy to come back and talk about that aspect of it, too. What's different about what we're doing is we don't care where the data goes. We don't have a dog in that fight. We want you to have better control over where it goes and what kind of shape it's in when it gets there.And so I'll give an example. One of our customers wanted to deploy a new SIEM—Security Information Event Management—tool. But they didn't want to have to deploy a couple hundred-thousand new agents to go along with it. They already had the data coming in from another agent, they just couldn't get the data to it. So, they use LogStream to send that data to their new desired platform.Worked great. They were able to go from zero to a brand new platform in just a couple days, versus fighting with rolling out agents and having to update them. Did they conflict with existing agents? How much performance did it impact on the servers, and so on? So, we don't care about the destination. We like everybody. We're agnostic when it comes to where that data goes. And—Corey: Oh, it's not about the destination. It's about the journey. Everyone's been saying it, but you've turned it into a product.Nick: It's very spiritual. So, we [laugh] send, we send your observability data on a spiritual [laugh] journey to its destination, and we can do quite a bit with it on the way.Corey: So, you said you offered to go back as well and visit the, “Oh, it's monitoring, but we're going to call it observability because otherwise we get yelled out on Twitter by Charity Majors.” How do you view that?Nick: Monitoring is the things you already know. Right? You know what questions you want to ask, you get an alert if something goes out of bounds or something goes from green to red. Think about monitoring as a data warehouse. You shape your data, you get it all in just the right condition so you can ask the same question over and over again, over different time domains.That's how I think about monitoring. It's prepackaged, you know exactly what you want to do with it. Observability is more like a data lake. I have no idea what I'm going to do with this stuff. I think there's going to be some signals in here that I can use, and I'm going to go explore that data.So, if monitoring is your known knowns, observability is your unknown unknowns. So, an ideal observability solution gives you an opportunity to discover what those are. Once you discover them. Great. Now, you can talk about how to get them into your monitoring system. So, for me, it's kind of a process of discovery.Corey: Which makes an awful lot of sense. The problem I've always had with the monitoring approach is it falls into this terrible pattern of enumerate the badness. In other words, “Imagine all the ways that this system can fail,” and then build an alerting that lets you know when any of those things happen. And what happens next is inevitable to anyone who's ever dealt with the tricksy devils known as computers, and what happens, of course, is that they find new ways to fail and you generally get to add to the list of things to check for, usually at two o'clock in the morning.Nick: On a Sunday.Corey: Oh, absolutely. It almost doesn't matter when. The real problem is when these things happen, it's, “What day, actually, is it?” And you have to check the calendar to figure out because your third time that week being woken up in the dead of night. It's like an infant but less than endearing.So, that has been the old school approach, and there's unfortunately still an awful lot of, we'll just call it nonsense, in the industry that still does exactly the same thing, except now they call it observability because—hearkening back to earlier in our conversation—there's a certain point in the Gartner Hype Cycle that we are all existing within. What's the deal with that?Nick: Well, I think that there are a lot of entrenched interests in the monitoring space. And so I think you always see this when a new term comes around. Vendors will say, “All right, well, there's a lot of confusion about this. Let me back-fit my product into this term so that I can continue to look like I'm on the leading edge and I'm not going to put any of my revenues in jeopardy.” I know, that's a cynical view, but I've seen it over and over again.And I think that's unfortunate because there's a real opportunity to have a better understanding of your systems, to better understand what's happening in all the containers you're deploying and not tearing down the way that you should, to better understand what's happening in distributed systems. And it's going to be a real missed opportunity if that is what happens. If we just call this ‘Monitoring 2.0' it's going to leave a lot of unrealized potential in the market.Corey: The big problem that I've seen in a lot of different areas is—I'll be direct—consolidation where you have a company that starts to do a thing—and that's great—and then they start doing other things that are tied to it. And in turn, they start, I guess, gathering everything in the ecosystem. If you break down observability into various constituent parts, I—know, I know, the pillars thing is going to upset people; ignore that for now—and if you have an offering that's weak in a particular area, okay, instead of building it organically into the product, or saying, “Yeah, that's not what we do,” there's an instinct to acquire a company or build that functionality out. And it turns out that we're building what feels the lot to me like the SaaS equivalent of multifunction printers: they can print, they can scan, they can fax, and none of those three very well, so it winds up with something that dissatisfies everyone, rather than a best-of-breed solution that has a very clear and narrow starting and stopping point. How do you view that?Nick: Well, what you've described is a compromise, right? A compromise is everyone can work and no one's happy. And I think that's the advantage of where LogStream comes in. The reality is best-of-breed. Most enterprises today have 30 or more different monitoring tools—call them observability tools if you want to—and you will never pry those tools from the dead hands of those sysadmins, DevOps engineers, SREs, et cetera.They all integrate those tools into how they work and their processes. So, we're living in a best-of-breed world. It's like that in data and analytics—my former beat—and it's like that in monitoring and observability. People really gravitate towards the tools they like, they gravitate towards the tools their friends are using. And so you need a way to be able to mix and match that stuff.And just because I want to stay [laugh] on message, that's really where the LogStream story kind of blends in because we do that; we allow you to mix and match all those different pieces.Corey: Joke's on you. I use Nagios and I have no friends. I'm not convinced those two things are entirely unrelated, but here we are. So here's, I guess, the big burning question that a lot of folks—certainly not me, but other undefined folks, ‘lots of people are saying'—so you built something interesting that actually works. I want to be clear on this.I have spoken to customers of yours. They swear by it instead of swearing at it, which happens with other companies. Awesome. You have traction, you're moving forward, things are going great. Here's $200 million is the next part of that story, and on some level, my immediate reaction—which does need updating, let's be clear here—is like, all right.I'm trying to build a product. I can see how I could spend a few million bucks. “Well, what can you do with I don't know, 100 times that?” My easy answer is, “Something monstrous.” I don't believe that is the case here. What is the growth plan? What are you doing that makes having that kind of a war chest a useful and valuable thing to have?Nick: Well, if you speak with the co-founders—and they've been open about this—we view ourselves as a generational company. We're not just building one product. We've been thinking about, how do we deliver on observability as this idea of discovery? What does that take? And it doesn't mean that we're going to be less agnostic to other destinations, we still think there's an incredible amount of value there and that's not going away, but we think there's maybe an interim step that we build out, potentially this idea of an observability data lake where you can explore these environments.Certainly, there's other types of options in the space today. Most of them are SQL-based, which is interesting because the audience that uses monitoring and observability tools couldn't care less about SQL right? They want search, they want regex, and so you've got to have the right tool for that audience. And so we're thinking about what that looks like going forward. We're doubling down on people.Surprisingly, this is a very—like anything else in software, it is people-intensive. And so certainly those are other aspects that we're exploring with the recent investment, but definitely, multiproduct company is our future and continued expansion.Corey: Expansion is always a fun one. It's the idea of, great, are you looking at going deeper into the areas you're already active within, or is it more of a, “Ah, so we've solved the, effectively, log routing problem. That's great. Let's solve other problems, too.” Or is it more of a, I guess, a doubling down and focusing on what's working? And again, that probably sounds judgmental in a way I don't intend it to at all. I just have a hard time contextualizing that level of scale coming from a small company perspective the way that I do.Nick: Yeah. Our plan is to focus more intently on the areas that we're in. We have a huge basis of experience there. We don't want to be all things to all people; that dilutes the message down to nothing, so we want to be very specific in the audiences we talk to, the problems we're trying to solve, and how we try to solve them.Corey: The problem I've always found with a lot of the acquisition, growth thrashing of—let me call it what I think it is: companies in decline trying to strain relevancy, it feels almost like a, “We don't see a growth strategy. So, we're going to try and acquire everything that hold still long enough, at some level, trying to add more revenue to the pile, but also thrashing in the sense of, okay. They're going to teach us how to do things in creative, awesome ways,” but it never works out that way. When you have a 50,000 person company acquiring a 200 person company, invariably the bigger culture is going to dominate. And I don't understand why that mistake seems to continually happen again, and again, and again.And people think I'm effectively alluding to—or whenever the spoken word version of subtweeting is—a particular company or a particular acquisition. I'm absolutely not, there are probably 50 different companies listening right now who thinks, “Oh, God. He's talking about us.” It's the common repeating trend. What is that?Nick: It's hard to say. In some cases, these acquisitions might just be talent. “We need to know how to do X. They know how to do X. Let's do it.” They may have very unique niche technology or software that another company thinks they can more broadly apply.Also, some of these big companies, these may not be board-level or CEO-level decisions. A business unit might decide, “Oh, I like what that company is doing. I'm going to go acquire it.” And so it looks like MegaCorp bought TinyCorp, but it's really, this tiny business unit within MegaCorp bought tiny company. The reality is often different from what it looks like on the outside.So, that's one way. Another is, you know, if they're going to teach us to be more effective with tech or something like that, you're never going to beat culture. You're never going to be the existing culture. If it's 50,000, against 200, obviously we know who wins there. And so I don't know if that's realistic.I don't know if the big companies are genuine when they say that, but it could just be the messaging that they use to make people happy and hopefully retain as many of those new employees for as long as they can. Does that make sense?Corey: No, it makes perfect sense. It's the right answer. It does articulate what is happening there, and I think I keep falling prey to the same failure. And it's hard. It's pernicious, but companies are not monolithic entities.There's no one person at all of these companies each who is making these giant unilateral decisions. It's always some product manager or some particular person who has a vision and a strategy in the department. It is not something that the company board is agreeing on every little decision that gets made. They're distributed entities in many respects.Nick: Absolutely. And that's only getting more pervasive as companies get larger [laugh] through acquisition. So, you're going to see more and more of that, and so it's going to look like we're going to put one label on it, one brand. Often, I think internally, that's the exact opposite of what actually happened, how that decision got made.Corey: Nick, I want to thank you for taking so much time to speak with me about what you're up to over there, how your path has shaped, how you view the world, and also what Cribl does these days. If people want to learn more about what you're up to, how you think about the world, or even possibly going to work at Cribl which, having spoken to a number of people over there, I would endorse it. How do they find you?Nick: Best place to find us is by joining our community: cribl.io/community, and Cribl is spelled C-R-I-B-L. You can certainly reach out there, we've got about 2300 people in our community Slack, so it's a great group. You can also reach out to me on Twitter, I'm @nheudecker, N-H-E-U-D-E-C-K-E-R. Tell me what you thought of the episode; love to hear it. And then beyond that, you can also sign up for our free cloud tier at cribl.cloud. It's a pretty generous one terabyte a day processing, so you can start to send data in and send it wherever you'd like to be.Corey: To be clear, this free as in beer, not free as an AWS free tier?Nick: This is free as in beer.Corey: Excellent. Excellent.Nick: I think I'm getting that right. I think it's free as in beer. And the other thing you can try is our hosted solution on AWS, fully managed cloud at cribl.cloud, we offer a free one terabyte per day processing, so you can start to send data into that environment and send it wherever you'd like to go, in whatever shape that data needs to be in when it gets there.Corey: And we will, of course, put links to that in the [show notes 00:35:21]. Thank you so much for your time today. I really appreciate it.Nick: No, thank you for having me. This was a lot of fun.Corey: Nick Heudecker, senior director, market strategy and competitive intelligence at Cribl. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with a comment explaining that the only real reason a startup should raise a $200 million funding round is to pay that month's AWS bill.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.

Microsoft Research India Podcast
Accelerating AI Innovation by Optimizing Infrastructure. With Dr. Muthian Sivathanu

Microsoft Research India Podcast

Play Episode Listen Later Sep 29, 2021 27:31


Episode 010 | September 28, 2021Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI.Muthian's interests lie broadly in the space of large-scale distributed systems, storage, and systems for deep learning, blockchains, and information retrieval.Prior to joining Microsoft Research, he worked at Google for about 10 years, with a large part of the work focused on building key infrastructure powering Google web search — in particular, the query engine for web search. Muthian obtained his Ph.D from University of Wisconsin Madison in 2005 in the area of file and storage systems, and a B.E. from CEG, Anna University, in 2000.For more information about the Microsoft Research India click here.RelatedMicrosoft Research India Podcast: More podcasts from MSR IndiaiTunes: Subscribe and listen to new podcasts on iTunesAndroidRSS FeedSpotifyGoogle PodcastsEmail TranscriptMuthian Sivathanu: Continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.[Music]Sridhar Vedantham: Welcome to the Microsoft Research India podcast, where we explore cutting-edge research that's impacting technology and society. I'm your host, Sridhar Vedantham.[Music]Sridhar Vedantham: Artificial intelligence, Machine Learning, Deep Learning, and Deep Neural Networks are today critical to the success of many industries. But they are also extremely compute intensive and expensive to run in terms of both time and cost, and resource constraints can even slow down the pace of innovation. Join us as we speak to Muthian Sivathanu, Partner Research Manager at Microsoft Research India, about the work he and his colleagues are doing to enable optimal utilization of existing infrastructure to significantly reduce the cost of AI.[Music]Sridhar Vedantham: So Muthian, welcome to the podcast and thanks for making the time for this.Muthian Sivathanu: Thanks Sridhar, pleasure to be here.Sridhar Vedantham: And what I'm really looking forward to, given that we seem to be in some kind of final stages of the pandemic, is to actually be able to meet you face to face again after a long time. Unfortunately, we've had to again do a remote podcast which isn't all that much fun.Muthian Sivathanu: Right, right. Yeah, I'm looking forward to the time when we can actually do this again in office.Sridhar Vedantham: Yeah. Ok, so let me jump right into this. You know we keep hearing about things like AI and deep learning and deep neural networks and so on and so forth. What's very interesting in all of this is that we kind of tend to hear about the end product of all this, which is kind of, you know, what actually impacts businesses, what impacts consumers, what impacts the health care industry, for example, right, in terms of AI. It's a little bit of a mystery, I think to a lot of people as to how all this works, because... what goes on behind the scenes to actually make AI work is generally not talked about. Muthian Sivathanu: Yeah.Sridhar Vedantham: So, before we get into the meat of the podcast you just want to speak a little bit about what goes on in the background.Muthian Sivathanu: Sure. So, machine learning, Sridhar, as you know, and deep learning in particular, is essentially about learning patterns from data, right, and deep learning system is fed a lot of training examples, examples of input and output, and then it automatically learns a model that fits that data, right. And this is typically called the training phase. So, training phase is where it takes data builds a model how to fit. Now what is interesting is, once this model is built, which was really meant to fit the training data, the model is really good at answering queries on data that it had never seen before, and this is where it becomes useful. These models are built in various domains. It could be for recognizing an image for converting speech to text, and so on, right. And what has in particular happened over the last 10 or so years is that there has been significant advancement both on the theory side of machine learning, which is, new algorithms, new model structures that do a better job at fitting the input data to a generalizable model as well as rapid innovation in systems infrastructure which actually enable the model to sort of do its work, which is very compute intensive, in a way that's actually scalable that's actually feasible economically, cost effective and so on.Sridhar Vedantham: OK, Muthian, so it sounds like there's a lot of compute actually required to make things like AI and ML happen. Can you give me a sense of what kind of resources or how intensive the resource requirement is?Muthian Sivathanu: Yeah. So the resource usage in a machine learning model is a direct function of how many parameters it has, so the more complex the data set, the larger the model gets, and correspondingly requires more compute resources, right. To give you an idea, the early machine learning models which perform simple tasks like recognizing digits and so on, they could run on a single server machine in a few hours, but models now, just over the last two years, for example, the size of the largest model that's useful that state of the art, that achieves state of the art accuracy has grown by nearly three orders of magnitude, right. And what that means is today to train these models you need thousands and thousands of servers and that's infeasible. Also, accelerators or GPUs have really taken over the last 6-7 years and GPUs. A single V-100 GPU today, a Volta GPU from NVIDIA can run about 140 trillion operations per second. And you need several hundreds of them to actually train a model like this. And they run for months together to train a 175 billion model, which is called GPT 3 recently, you need on the order of thousands of such GPUs and it still takes a month.Sridhar Vedantham: A month, that's sounds like a humongous amount of time. Muthian Sivathanu: Exactly, right? So that's why I think just as I told you how the advance in the theory of machine learning in terms of new algorithms, new model structures, and so on have been crucial to the recent advance in the relevance in practical utility of deep learning.Equally important has been this advancement in systems, right, because given this huge explosion of compute demands that these workloads place, we need fundamental innovation in systems to actually keep pace, to actually make sure that you can train them in reasonable time, you can actually do that with reasonable cost.Sridhar Vedantham: Right. Ok, so you know for a long time, I was generally under the impression that if you wanted to run bigger and bigger models and bigger jobs, essentially you had to throw more hardware at it because at one point hardware was cheap. But I guess that kind of applies only to the CPU kind of scenario, whereas the GPU scenario tends to become really expensive, right?Muthian Sivathanu: Yep, yeah.Sridhar Vedantham: Ok, so in which case, when there is basically some kind of a limit being imposed because of the cost of GPUs, how does one actually go about tackling this problem of scale?Muthian Sivathanu: Yeah, so the high-level problem ends up being, you have limited resources, so let's say you can view this in two perspectives, right. One is from the perspective of a machine learning developer or a machine learning researcher, who wants to build a model to accomplish a particular task right. So, from the perspective of the user, there are two things you need. A, you want to iterate really fast, right, because deep learning, incidentally, is this special category of machine learning, where the exploration is largely by trial and error. So, if you want to know which model actually works which parameters, or which hyperparameter set actually gives you the best accuracy, the only way to really know for sure is to train the model to completion, measure accuracy, and then you would know which model is better, right. So, as you can see, the iteration time, the time to train a model to run inference on it directly impacts the rate of progress you can achieve. The second aspect that the machine learning researcher cares about is cost. You want to do it without spending a lot of dollar cost.Sridhar Vedantham: Right.Muthian Sivathanu: Now from the perspective of let's say a cloud provider who runs this, huge farm of GPUs and then offers this as a service for researchers, for users to run machine learning models, their objective function is cost, right. So, to support a given workload you need to support it with as minimal GPUs as possible. Or in other words, if you have a certain amount of GPU capacity, you want to maximize the utilization, the throughput you can get out of those GPUs, and that's where a lot of the work we've been doing at MSR has focused on. How do you sort of multiplex lots and lots of jobs onto a finite set of GPUs, while maximizing the throughput that you can get from them?Sridhar Vedantham: Right, so I know you and your team have been working on this problem for a while now. Do you want to share with us some of the key insights and some of the results that you've achieved so far, because it is interesting, right? Schedulers have been around for a while. It's not that there aren't schedulers, but essentially what you're saying is that the schedulers that exist do not really cut it, given the, intensity of the compute requirements as well as the jobs, as the size of the jobs and models that are being run today in terms of deep learning or even machine learning models, right?Muthian Sivathanu: That's right.Sridhar Vedantham: So, what are your, key insights and what are some of the results that you guys have achieved?Muthian Sivathanu: So, you raise a good point. I mean, schedulers for distributed systems have been around for decades, right. But what makes deep learning somewhat special is that it turns out, in contrast to traditional schedulers, which have to view a job as a black box, because they're meant to run arbitrary jobs. There is a limit to how efficient they can be. Whereas in deep learning, first of all because deep learning is such high impact area with lots, and I mean from an economic perspective, there are billions of dollars spent in these GPUs and so on. So, there is enough economic incentive to extract the last bit of performance out of these expensive GPUs, right. And that lends itself into this realm of- what if we co-design? What if we custom design a scheduler for the specific case of deep learning, right. And that's what we did in the Gandiva project which we published at OSDI in 2018. What we said was, instead of viewing a deep learning job as just another distributed job which is opaque to us, let's actually exploit some key characteristics that are unique to deep learning jobs, right? And one of those characteristics, is that although, as I said, a single deep learning training job can run for days or even months, right, deep within it is actually composed of millions and millions of these what are called mini batches. So, what is a mini batch? A mini batch is an iteration in the training where it reads one set of input training examples, runs it through the model, and then back propagates the loss, and essentially, changes the parameters to fit that input. And this sequence this mini batch repeats over and over again across millions and millions of mini batches. And what makes it particularly interesting and relevant from a systems optimization viewpoint is that from a resource usage perspective and from a performance perspective, mini batches are identical. They may be operating on different data in each mini batch, but the computation they do is pretty much identical. And what that means is we can look at the job for a few mini batches and we can know what exactly is going to do for the rest of its life time, right. And that allows us to, for example, do things like, we can automatically decide which hardware generation is the best fit for this job, because you can just measure it in a whole bunch of hardware configurations. Or when you're distributing the job, you can compare it across a whole bunch of parallelism configurations, and you can automatically figure out, this is the right configuration, right hardware assignment for this particular job, which you couldn't do in an arbitrary job with a distributed scheduler because the job could be doing different things at different times. Like a MapReduce job for example, it would keep fluctuating across how we'd use a CPU, network, storage, and so on, right. Whereas with deep learning there is this remarkable repeatability and predictability, right. What it also allows us to do is, we can then look within a mini batch what happens, and it turns out, one of the things that happens is, if you look at the memory usage, how much GPU memory the training loop itself is consuming, somewhere at the middle of a mini batch, the memory peaks to almost fill the entire GPU memory, right. And then by the time the mini batch ends, the memory usage drops down by like a factor of anywhere between 10 to 50x. Right, and so there is this sawtooth pattern in the memory usage, and so one of the things we did in Gandiva was proposed this mechanism of transparently migrating a job, so you should be able to, on demand checkpoint a job. The scheduler should be able to do it and just move it to a different machine, maybe even essentially different GPU, different machine, and so on, right. And this is very powerful from load balancing. Lots of scheduling things become easy if you do this. Now, when you're doing that, when you are actually moving a job from one machine to another, it helps if the amount of state you need to move is small, right. And so that's where this awareness of mini batch boundaries and so on helps us, because now you can choose when exactly to move it so that you move 50x, smaller amount of state.Sridhar Vedantham: Right. Very interesting, and another part of this whole thing about resources and compute and all that is, I think, the demands on storage itself, right?Muthian Sivathanu: Yeah.Sridhar Vedantham: Because if the models are that big, that you need some really high-powered GPUs to compute, how do you manage the storage requirements?Muthian Sivathanu: Right, right. So, it turns out the biggest requirement from storage that deep learning poses is on the throughput that you need from storage, right. So, as I mentioned, because GPUs are the most expensive resource in this whole infrastructure stack, the single most important objective is to keep GPUs busy all the time, right. You don't want them idling, at all. What that means is the input training data that the model needs in order to run its mini batches, that is to be fed to it at a rate that is sufficient to keep the GPUs busy. And GPUs process, I mean the amount of data that the GPU can process from a compute perspective has been growing at a very rapid pace, right. And so, what that means is, you know, when between Volta series and an Ampere series, for example, of GPUs there is like 3X improvement in compute speed, right. Now that means the storage bandwidth should keep up with that pace, otherwise faster GPU doesn't help. It will be stalling on IO. So, in that context one of the systems we built was the system called Quiver, where we say a traditional remote storage system like the standard model for running this training is...the datasets are large- I mean the data sets can be in terabytes, so, you place it on some remote cloud storage system, like Azure blob or something like that, and you read it remotely from whichever machine does the training, right. And that bandwidth simply doesn't cut it because it goes through network backbone switches and so on, and it becomes insanely expensive to sustain that level of bandwidth from a traditional cloud storage system, right. So what we need, to achieve here is hyper locality. So, ideally the data should reside on the exact machine that runs the training, then it's a local read and it has to reside on SSD and so on, right. So, you need several gigabytes per second read bandwidth.Sridhar Vedantham: And this is to reduce network latency?Muthian Sivathanu: Yes, this is to reduce network latency and congestion, like when it goes through lots of back end, like T1 switches, T2 switches etc. The end-to-end throughput that you get across the network is not as much as what you can get locally, right?Sridhar Vedantham: Right.Muthian Sivathanu: So, ideally you want to keep the data local in the same machine, but as I said, for some of these models, the data set can be in tens of terabytes. So, what we really need is a distributed cache, so to speak, right, but a cache that is locality aware. So, what we have is a mechanism by which, within each locality domain like a rack for example, we have a copy of the entire training data, so, a rack could comprise maybe 20 or 30 machines, so across them you can still fit the training data and then you do peer to peer across machines in the rack for the access to the cache. And within a rack, network bandwidth is not a limitation. You can get nearly the same performance as you could from local SSD, so that's what we did in Quiver and there are a bunch of challenges here, because if every model wants the entire training data to be local to be within the rack, then there is just no cache space for keeping all of that.Sridhar Vedantham: Right.Muthian Sivathanu: Right. So we have this mechanism by which we can transparently share the cache across multiple jobs, or even multiple users without compromising security, right. And we do that by sort of intelligent content addressing of the cache entries so that even though two users may be accessing different copies of the same data internally in the cache, they will refer to the same instance.Sridhar Vedantham: Right, I was actually just going to ask you that question about how do you maintain security of data, given that you're talking about distributed caching, right? Because it's very possible that multiuser jobs will be running simultaneously, but that's good, you answered it yourself. So, you know I've heard you speak a lot about things like micro design and so on. How do you bring those principles to bear in these kind of projects here?Muthian Sivathanu: Right, right. So, I alluded to this a little bit in one of my earlier points, which is the interface, I mean, if you look at a traditional scheduler which we use the job as a black box, right. That is an example of traditional philosophy to system design, where you build each layer independent of the layer above or below it, right, so that, there are good reasons to do it because you know, like multiple use cases can use the same underlying infrastructure, like if you look at an operating system, it's built to run any process, whether it is Office or a browser or whatever, right.Sridhar Vedantham: Right.Muthian Sivathanu: But, in workloads like deep learning, which place particularly high demands on compute and that are super expensive and so on, there is benefit to sort of relaxing this tight layering to some extent, right. So that's the philosophy we take in Gandiva, for example, where we say the scheduler no longer needs to think of it as a black box, it can make use of internal knowledge. It can know what mini batch boundaries are. It can know that mini batch times are repeatable and stuff like that, right. So, co-design is a philosophy that has been gaining traction over the last several years, and people typically refer to hardware, software co-design for example. What we do in micro co-design is sort of take a more pragmatic view to co-design where we say look, it's not always possible to rebuild entire software layers from scratch to make them more tightly coupled, but the reality is in existing large systems we have these software stacks, infrastructure stacks, and what can we do without rocking the ship, without essentially throwing away everything in building everything from a clean slate. So, what we do is very surgical, carefully thought through interface changes, that allow us to expose more information from one layer to another, and then we also introduce some control points which allow one layer to control. For example, the scheduler can have a control point to ask a job to suspend. And it turns out by opening up those carefully thought through interface points, you leave the bulk of the infrastructure unchanged, but yet achieve these efficiencies that result from richer information and richer control, right. So, micro co-design is something we have been adopting, not only in Gandiva and Quiver, but in several other projects in MSR. And MICRO stands for Minimally Invasive Cheap and Retrofittable Co-design. So, it's a more pragmatic view to co-design in the context of large cloud infrastructures.Sridhar Vedantham: Right, where you can do the co-design with the minimum disruption to the existing systems.Muthian Sivathanu: That's right. Sridhar Vedantham: Excellent. [Music]Sridhar Vedantham: We have spoken a lot about the work that you've been doing and it's quite impressive. Do you have some numbers in terms of you know, how jobs will run faster or savings of any nature, do you have any numbers that you can share with us? Muthian Sivathanu: Yeah, sure. So the numbers, as always depend on the workload and several aspects. But I can give you some examples. So, in the Gandiva work that we did. We, introduce this ability to time slice jobs, right. So, the idea is, today when you launch a job in a GPU machine, that job essentially holds on to that machine until it completes, and until that time it has exclusive possession of that GPU, no other job can use it, right. And this is not ideal in several scenarios. You know, one classic example is hyperparameter tuning, where you have a model and you need to decide what exact hyperparameter values like learning rate, etc. actually are the best fit and give the best accuracy for this model. So, people typically do what is called the hyperparameter search where you run maybe 100 instances of the model, see how it's doing, maybe kill some instances spawn of new instances, and so on, right. And hyperparameter exploration really benefits from parallelism. You want to run all these instances at the same time so that you have an apples-to-apples comparison of how they are doing. And if you want to run like 100 configurations and you have only 10 GPUs, that significantly slows down hyperparameter exploration- it serializes it, right. What Gandiva has is an ability to perform fine grained time slicing of the same GPU across multiple jobs, just like how an operating system time slices multiple processes, multiple programs on the same CPU, we do the same in GPU context, right. And because we make use of mini batch boundaries and so on, we can do this very efficiently. And with that we showed that for typical hyperparameter tuning, we can sort of speed up the end-to-end time to accuracy by nearly 5-6x, right. Uh, and so this is one example of how time slicing can help. We also saw that from a cluster wide utilization perspective, some of the techniques that Gandiva adopted can improve overall cluster utilization by 20-30%. Right, and this directly translates to cost incurred to the cloud provider running those GPS because it means with the same GPU capacity, I can serve 30% more workload or vice versa, right, for a given workload I only need 30% lesser number of GPUs.Sridhar Vedantham: Yeah, I mean those savings sound huge and I think you're also therefore talking about reducing the cost of AI making the process of AI itself more efficient. Muthian Sivathanu: That's correct, that's correct. So, the more we are able to extract performance out of the same infrastructure, the cost per model or the cost per user goes down and so the cost of AI reduces and for large companies like Microsoft or Google, which have first party products that require deep learning, like search and office and so on, it reduces the capital expenditure running such clusters to support those workloads.Sridhar VedanthamRight.Muthian Sivathanu: And we've also been thinking about areas such as, today there is this limitation that large models need to run in really tightly coupled hyperclusters which are connected via InfiniBand and so on. And that brings up another dimension of cost escalation to the equation, because these are sparse, the networking itself is expensive, there is fragmentation across hyperclusters and so on. What we showed in some recent work is how can you actually run training of large models in just commodity VMs-these are just commodity GPU VMs- but without any requirement on them being part of the same InfiniBand cluster or hypercluster, but just they can be scattered anywhere in the data center, and more interestingly, we can actually run these off of spot VMs. So Azure, AWS, all cloud providers provide these bursty VMs or low priority VMs, which is away essentially for them to sell spare capacity, right. So, you get them at a significant discount. Maybe 5-10x cheaper price. And the disadvantage, I mean the downside of that is they can go away at any time. They can be preempted when real demand shows up. So, what we showed is it's possible to train such massive models at the same performance, despite these being on spot VMs and spread over a commodity network without custom InfiniBand and so on. So that's another example how you can bring down the cost of AI by reducing constraints on what hardware you need.Sridhar Vedantham: Muthian, we're kind of reaching the end of the podcast, and is there anything that you want to leave the listeners with, based on your insights and learning from the work that you've been doing? Muthian Sivathanu: Yeah, so taking a step back, right? I think continued innovation in systems and efficiency and costs are going to be crucial to drive the next generation of AI advances, right. And the last 10 years have been huge for deep learning and AI and primary reason for that has been the significant advance in both hardware in terms of emergence of GPUs and so on, as well as software infrastructure to actually parallelize jobs, run large distributed jobs efficiently and so on. And if you think about the theory of deep learning, people knew about backpropagation about neural networks 25 years ago. And we largely use very similar techniques today. But why have they really taken off in the last 10 years? The main catalyst has been sort of advancement in systems. And if you look at the trajectory of current deep learning models, the rate at which they are growing larger and larger, systems innovation will continue to be the bottleneck in sort of determining the next generation of advancement in AI.Sridhar Vedantham: Ok Muthian, I know that we're kind of running out of time now but thank you so much. This has been a fascinating conversation.Muthian Sivathanu: Thanks Sridhar, it was a pleasure.Sridhar Vedantham: Thank you

OODAcast
Episode 78: Amr Awadallah On The Biggest Challenges And Opportunities in Enterprise IT Today

OODAcast

Play Episode Listen Later Aug 20, 2021 50:51


Amr Awadallah is widely known as a founder of Cloudera. Prior to that he was working on extreme scale data solutions for Yahoo. Most recently he was VP for Developer Relations at Google Cloud. Amr has a BS in EE from Cairo University, an MS in Computer Engineering from Cairo University, and a PhD EE from Stanford University. His experiences in tech and company leadership put him in the perfect position to help bring actionable insights to decision-makers today. Topics we discussed include: Lessons from his foundational story which can inform how to inspire the youth of today to continue pursuing their dreams and reaching for deeper understanding of the world and how it works. The world before scalable data systems and the problems with old approaches to data The breakthroughs that came with the approaches detailed in Google papers on their file system and an approach called Map Reduce. What Hadoop is The Cloudera approach of making Hadoop and related capabilities safe for enterprise use The leadership approach at Cloudera Advice for founders today The biggest challenges and opportunities in enterprise IT today Views on the future of cybersecurity A discussion on the metaverse and what comes next  

MLOps.community
How Pinterest Powers Image Similarity // Shaji Chennan Kunnummel // System Design Reviews #1

MLOps.community

Play Episode Listen Later Jun 29, 2021 57:36


In this Machine Learning System Design Review, Shaji Chennan Kunnummel walks us through the system design for Pinterest's near-real-time architecture for detecting similar images. We discuss their usage of Kafka, Flink, rocksdb, and much more. Starting with the high-level requirements for the system, we discussed Pinterest's focus on debuggability and an easy transition from their batch processing system to stream processing. We then touch on the different system interfaces and components involved such as Manas—Pinterest's custom search engine—and how it all ends up in their custom graph database, downstream Kafka streams, and to Pinterest's feature store—Galaxy. With Shaji's expert knowledge of the system, we were able to do a deep dive into the system's architecture and some of its components. // Experiences 15+ years of experience in software product development. Led multiple teams in a highly agile, collaborative, and cross-functional environment. Designed and implemented highly scalable, fault-tolerant, and optimized distributed systems that scale to handle millions of requests per second. In-depth knowledge of Object-oriented programming and design patterns in C++/Java/Python/Golang. Designed and built complex data pipelines and microservices to train and serve machine learning models. Built analytics pipelines for processing and mining high-volume data set using Hadoop and Map-Reduce frameworks. In-depth knowledge of distributed storage, consistency models, NoSQL data modeling, Cloud computing environment (AWS and Google Cloud).

Python en español
Python en español #25: Tertulia 2021-03-23

Python en español

Play Episode Listen Later Jun 9, 2021 91:58


Pattern Matching en Python 3.10, Sans IO y Encuesta mundial de programadores Python https://podcast.jcea.es/python/25 Participantes: Jesús Cea, email: jcea@jcea.es, twitter: @jcea, https://blog.jcea.es/, https://www.jcea.es/. Conectando desde Madrid. Felipem, conectando desde Cantabria. Jesús, conectando desde Ferrol. Víctor Ramírez, twitter: @virako, programador python y amante de vim, conectando desde Huelva. Eduardo Castro, email: info@ecdesign.es. Conectando desde A Guarda. Audio editado por Pablo Gómez, twitter: @julebek. La música de la entrada y la salida es "Lightning Bugs", de Jason Shaw. Publicada en https://audionautix.com/ con licencia - Creative Commons Attribution 4.0 International License. [00:52] Advertencia de que el audio tiene lagunas y puede ser difícil seguir el hilo. [01:07] Conecta gente nueva y cambio de horario. ¡Queremos que las tertulias sean más cortas! [04:57] Python 3.10: ¿Mandar y cómo mandar deberes? ¿Probar las novedades cuando se anuncian o esperar a que entren en producción? [08:19] Presentaciones. [10:32] Jesús Cea ha enviado ya el parche para el bug que se discutió en navidades: Issue35930: Raising an exception raised in a "future" instance will create reference cycles https://bugs.python.org/issue35930. [11:12] Los que se autodenominan novatos también tienen mucho que aportar. [12:21] Unpacking en bucles for: PEP 3132 -- Extended Iterable Unpacking: https://www.python.org/dev/peps/pep-3132/. Busca en Internet: "Python tuple unpacking". PEP 448 -- Additional Unpacking Generalizations: https://www.python.org/dev/peps/pep-0448/. [15:57] Python Packaging: Lo estás haciendo mal https://www.youtube.com/watch?v=OeOtIEDFr4Y. https://github.com/astrojuanlu/charla-python-packaging. https://nbviewer.jupyter.org/format/slides/github/astrojuanlu/charla-python-packaging/blob/main/Charla%20Python%20packaging.ipynb#/ [18:07] Machete Mode: https://nedbatchelder.com/blog/202103/machete_mode_tagging_frames.html. [18:42] Pattern Matching en Python 3.10: PEP 622 -- Structural Pattern Matching https://www.python.org/dev/peps/pep-0622/. PEP 634 -- Structural Pattern Matching: Specification https://www.python.org/dev/peps/pep-0634/. PEP 635 -- Structural Pattern Matching: Motivation and Rationale https://www.python.org/dev/peps/pep-0635/. PEP 636 -- Structural Pattern Matching: Tutorial https://www.python.org/dev/peps/pep-0636/. Tema recurrente: ¿Compensa complicar la sintaxis del lenguaje? [22:27] ¡Combina todo eso con el Walrus operator (operador morsa)!: PEP 572 -- Assignment Expressions https://www.python.org/dev/peps/pep-0572/. Presentación en vídeo: Pattern Matching in Python 3.10: https://morioh.com/p/aa1e6d5352c3, minuto 8:47. [24:32] Temas recurrentes de Jesús Cea: complejidad accidental del lenguaje. ¿Python ha perdido el rumbo? Guido van Rossum https://es.wikipedia.org/wiki/Guido_van_Rossum está apoyando muchos cambios polémicos en Python. El principio del fin fue la implementación de async/await y dividir Python en dos mundos: síncrono y asíncrono. [27:02] Código que puede funcionar tanto en el mundo síncrono y asíncrono. Biblioteca asyncio: https://docs.python.org/3/library/asyncio.html. Biblioteca unsync: https://pypi.org/project/unsync/. inspect.iscoroutinefunction(object): https://docs.python.org/3/library/inspect.html#inspect.iscoroutinefunction. inspect.iscoroutine(object): https://docs.python.org/3/library/inspect.html#inspect.iscoroutine. inspect.isawaitable(object): https://docs.python.org/3/library/inspect.html#inspect.isawaitable. inspect.isasyncgenfunction(object): https://docs.python.org/3/library/inspect.html#inspect.isasyncgenfunction. inspect.isasyncgen(object): https://docs.python.org/3/library/inspect.html. [29:12] Lo bueno de tener dificultad para mezclar el mundo síncrono y el mundo asíncrono es que ha surgido un movimiento para desvincular los protocolos del propio medio de comunicación. Sans IO: https://sans-io.readthedocs.io/. Máquina de estados: https://es.wikipedia.org/wiki/M%C3%A1quina_de_estados. [33:15] How to write obfuscated python https://archive.org/details/pyvideo_398___how-to-write-obfuscated-python. [33:52] Seguridad en PyPI https://pypi.org/: New packaging security funding & NYU https://discuss.python.org/t/new-packaging-security-funding-nyu/7792. PEP 458 -- Secure PyPI downloads with signed repository metadata https://www.python.org/dev/peps/pep-0458/. Permite meter espejos de PyPI https://pypi.org/ sin tener que confiar en ellos. [36:27] Encuesta mundial de programadores Python: Python Developers Survey 2020 Results https://www.jetbrains.com/lp/python-developers-survey-2020/. Aquí no vamos a describir cada respuesta de la encuesta en las notas de la tertulia, pero listamos puntos y enlaces que pueden ser de interés. [40:32] Usar Python en el navegador web: Brython http://www.brython.info/. [44:42] ¿No molaría poder mezclar código Python y Javascript y poder llamarse mutuamente? [45:42] Precendente: Python y Java: Jython https://www.jython.org/. WAR: https://en.wikipedia.org/wiki/WAR_(file_format). [47:42] Python soportado en los navegadores. PyXPCOM: https://developer.mozilla.org/pl/docs/PyXPCOM. WebAssembly: https://es.wikipedia.org/wiki/WebAssembly. asm.js: https://en.wikipedia.org/wiki/Asm.js. Emscripten: https://emscripten.org/. Javascript en javascript: Polyfill https://en.wikipedia.org/wiki/Polyfill_(programming). Pyodide https://pyodide.org/en/stable/index.html. [59:37] Aún hay mucha documentación online sin actualizar, con ejemplos en Python 2. [01:00:42] Corte de conexión. Se supone que había un más gente grabando la tertulia, pero luego resultó que no. [01:04:12] pipenv https://pypi.org/project/pipenv/. [01:09:22] Las características deseadas de Python chocan con lo que más se valora en él... [01:11:32] La documentación de FastAPI https://fastapi.tiangolo.com/ es fantástica y puedes aprender muchísimos conceptos web. REST: https://en.wikipedia.org/wiki/Representational_state_transfer. [01:18:07] Libro "Modern Tkinter for Busy Python Developers" https://tkdocs.com/book.html. [01:19:12] En la escuesta no salen librerías de generación de PDFs https://es.wikipedia.org/wiki/PDF. Se sugieren algunas: Reportlab https://pypi.org/project/reportlab/. PyPDF3 https://pypi.org/project/PyPDF3/. weasyprint https://pypi.org/project/weasyprint/. [01:21:52] No hay representación de tecnologías de persistencia de datos en los resultados de la encuesta. [01:22:22] Tortoise ORM https://tortoise-orm.readthedocs.io/en/latest/index.html es asíncrona. [01:22:47] SQLite https://sqlite.org/ es perfecta si quieres SQL https://es.wikipedia.org/wiki/SQL, pero solo hay un programa usando la base de datos. [01:26:42] Map/Reduce https://es.wikipedia.org/wiki/MapReduce. Manta: Triton's object storage and converged analytics solution https://apidocs.joyent.com/manta/. [01:27:32] Lo dejamos en la mitad de la encuesta: Technologies and Cloud. [01:28:22] Intentamos coordinar el acceso a la segunda captura de audio de la tertulia. Lamentablemente la cosa no funcionó. [01:29:22] Queda pendiente comentar cómo va la publicación de las tertulias en formato podcast. [01:30:17] ¡Nuevo horario! [01:31:05] Final.

Data Engineer & A.I
Entender Map Reduce

Data Engineer & A.I

Play Episode Listen Later May 29, 2021 10:19


Map Reduce, análisis de Big Data

Python en español
Python en español #17: Tertulia 2021-01-26

Python en español

Play Episode Listen Later May 18, 2021 133:45


Eduardo Castro se desata y nos invita a comentar trucos y construcciones idiomáticas no evidentes https://podcast.jcea.es/python/17 Participantes: Jesús Cea, email: jcea@jcea.es, twitter: @jcea, https://blog.jcea.es/, https://www.jcea.es/. Conectando desde Madrid. Eduardo Castro, email: info@ecdesign.es. Conectando desde A Guarda. Javier, conectando desde Madrid. Víctor Ramírez, twitter: @virako, programador python y amante de vim, conectando desde Huelva. Dani, conectando desde Málaga. Miguel Sánchez, email: msanchez@uninet.edu, conectando desde Canarias. Jorge Rúa, conectando desde Vigo. Audio editado por Pablo Gómez, twitter: @julebek. La música de la entrada y la salida es "Lightning Bugs", de Jason Shaw. Publicada en https://audionautix.com/ con licencia - Creative Commons Attribution 4.0 International License. [00:52] Haciendo tiempo hasta que entre más gente. Raspberry Pi Pico: https://www.raspberrypi.org/products/raspberry-pi-pico/. Jesús Cea está encantado con su rango de alimentación. Micropython: https://www.micropython.org/. [06:02] Truco: Python -i: Ejecuta un script y pasa a modo interactivo. También se puede hacer desde el propio código con code.InteractiveConsole(locals=globals()).interact(). Jesús Cea se queja de que usando la invocación desde código no funciona la edición de líneas. Javier da la pista correcta: para que funcione, basta con hacer import readline antes de lanzar el modo interactivo. [11:17] Regresión con ipdb: https://pypi.org/project/ipdb/. [12:37] Nueva versión de Pyston https://www.pyston.org/. Intérprete de Python más rápido. Un 50% más rápido que cpython. [16:22] Ver si dos fechas son iguales con datetime https://docs.python.org/3/library/datetime.html. Trabajar siempre en UTC https://es.wikipedia.org/wiki/Tiempo_universal_coordinado, aunque solo tengas una zona horaria. [19:52] Jesús Cea ha investigado cómo funcionan los POSTs HTTP en las protecciones CSRF https://es.wikipedia.org/wiki/CSRF. Buena práctica: La respuesta al POST es una redirección a un GET. Patrón Post/Redirect/Get (PRG) https://es.wikipedia.org/wiki/Post/Redirect/Get. Ventajas de usar un framework. [24:32] ¿Optimizaciones cuando tienes grandes cantidades de datos? Tema muy amplio, hacen falta detalles del problema. Se ofrecen algunas ideas: Map/Reduce: https://en.wikipedia.org/wiki/Map_reduce. Usar generadores u otras construcciones "lazy" siempre que sea posible. https://wiki.python.org/moin/Generators. [31:52] Gestión de memoria en Python. Design of CPython’s Garbage Collector: https://devguide.python.org/garbage_collector/. Hora de sacar la basura garbage collector - Pablo Galindo y Victor Terrón - PyConES 2018 https://www.youtube.com/watch?v=G9wOSExzs5g. [35:17] Tipografía para programadores: Victor Mono: https://rubjo.github.io/victor-mono/. Fira Code: https://fonts.google.com/specimen/Fira+Code. Fira Code Retina: https://github.com/tonsky/FiraCode/issues/872. [37:17] Eduardo Castro se ha currado una lista de trucos sencillos pero interesantes: En estas notas solo referenciamos los puntos a los que dedicamos más tiempo, se habló de más cosas. El documento para poder seguir los comentarios de la grabación está en https://demo.hedgedoc.org/s/hEZB92q40#. hash(float('inf')) -> 314159. [43:02] LRU Caché: "blame". [01:33:57] Usos de lambda. Módulo Operator: https://docs.python.org/3/library/operator.html. [01:35:52] Algunos trucos cortos adicionales. collections.deque: https://docs.python.org/3/library/collections.html. dateutil: https://pypi.org/project/python-dateutil/. itertools: https://docs.python.org/3/library/itertools.html. if a < x < b: >>> import dis >>> dis.dis(lambda x: a < x < b) 1 0 LOAD_GLOBAL 0 (a) 2 LOAD_FAST 0 (x) 4 DUP_TOP 6 ROT_THREE 8 COMPARE_OP 0 ( 18 ROT_TWO 20 POP_TOP 22 RETURN_VALUE Desempaquetado complejo: >>> a, b, (c, d), *e, f = 1, 2, (3, 4), 5, 6, 7, 8, 9 >>> print(a,b,c,d,e,f) 1 2 3 4 [5, 6, 7, 8] 9 Usar la variable "guión bajo" para descartar valores. Ojo con la internacionalización. [01:56:22] Python cada vez tiene más "gotchas". Algunos ejemplos: Operador morsa. Tratado con projilidad en tertulias anteriores. Parámetros mutables. Definir "closures" dentro de un for pero usarlo fuera. Tuplas con un solo elemento. Es más evidente el constructor tuple(), pero ojo: tuple('abc') -> ('a', 'b', 'c'). [02:01:06] ¡Terminamos con los trucos! [02:01:37] Ideas para indexar y buscar el documentos: Whoosh: https://whoosh.readthedocs.io/en/latest/intro.html. Solr: https://solr.apache.org/. [02:04:22] Deberes para el futuro: módulos dis https://docs.python.org/3/library/dis.html y enum https://docs.python.org/3/library/enum.html. [02:04:47] Sugerencia sobre visión artificial: https://www.pyimagesearch.com/. De lo mejor que hay. [02:06:47] regex https://pypi.org/project/regex/ que libera el GIL https://en.wikipedia.org/wiki/Global_interpreter_lock. [02:07:47] Acelerador y distribución de programas Python precompilados en binario y empaquetados en un directorio e, incluso, en un único fichero: Nuitka: https://nuitka.net/. [02:08:57] Design of CPython’s Garbage Collector: https://devguide.python.org/garbage_collector/. [02:09:17] Cierre. [02:10:52] Casi se nos olvida el aviso legal para grabar y publicar las sesiones. [02:12:55] Final.

Serverless Chats
Episode #96: Serverless and Machine Learning with Alexandra Abbas

Serverless Chats

Play Episode Listen Later Apr 12, 2021 44:19


About Alexa AbbasAlexandra Abbas is a Google Cloud Certified Data Engineer & Architect and Apache Airflow Contributor. She currently works as a Machine Learning Engineer at Wise. She has experience with large-scale data science and engineering projects. She spends her time building data pipelines using Apache Airflow and Apache Beam and creating production-ready Machine Learning pipelines with Tensorflow.Alexandra was a speaker at Serverless Days London 2019 and presented at the Tensorflow London meetup.Personal linksTwitter: https://twitter.com/alexandraabbasLinkedIn: https://www.linkedin.com/in/alexandraabbasGitHub: https://github.com/alexandraabbasdatastack.tv's linksWeb: https://datastack.tvTwitter: https://twitter.com/datastacktvYouTube: https://www.youtube.com/c/datastacktvLinkedIn: https://www.linkedin.com/company/datastacktvGitHub: https://github.com/datastacktvLink to the Data Engineer Roadmap: https://github.com/datastacktv/data-engineer-roadmapThis episode is sponsored by CBT Nuggets: cbtnuggets.com/serverless andStackery: https://www.stackery.io/Watch this video on YouTube: https://youtu.be/SLJZPwfRLb8TranscriptJeremy: Hi, everyone. I'm Jeremy Daly, and this is Serverless Chats. Today I'm joined by Alexa Abbas. Hey, Alexa, thanks for joining me.Alexa: Hey, everyone. Thanks for having me.Jeremy: So you are a machine learning engineer at Wise and also the founder of datastack.tv. So I'd love it if you could tell the listeners a little bit about your background and what you do at Wise and what datastack.tv is all about.Alexa: Yeah. So as you said, I'm a machine learning engineer at Wise. So Wise is an international money transfer service. We are aiming for very transparent fees and very low fees compared to banks. So at Wise, basically, designing, maintaining, and developing the machine learning platform, which serves data scientists and analysts, so they can train their models and deploy their models, easily.Datastack.tv is, basically, it's a video service or a video platform for data engineers. So we create bite-sized videos, educational videos, for data engineers. We mostly cover open source topics, because we noticed that some of the open source tools in the data engineering world are quite underserved in terms of educational content. So we create videos about those.Jeremy: Awesome. And then, what about your background?Alexa: So I actually worked as a data engineer and machine learning engineer, so I've always been a data engineer or machine learning engineer in terms of roles. I also worked, for a small amount of time, I worked as a data scientist as well. In terms of education, I did a big data engineering Master's, but actually my Bachelor is economics, so quite a mix.Jeremy: Well, it's always good to have a ton of experience and that diverse perspective. Well, listen, I'm super excited to have you here, because machine learning is one of those things where it probably is more of a buzzword, I think, to a lot of people where every startup puts it in their pitch deck, like, "Oh, we're doing machine learning and artificial intelligence ..." stuff like that. But I think it's important to understand, one, what exactly it is, because I think there's a huge confusion there in terms of what we think of as machine learning, and maybe we think it's more advanced than it is sometimes, as I think there's lower versions of machine learning that can be very helpful.And obviously, this being a serverless podcast, I've heard you speak a number of times about the work that you've done with machine learning and some experiments you've done with serverless there. So I'd love to just pick your brain about that and just see if we can educate the users here on what exactly machine learning is, how people are using it, and where it fits in with serverless and some of the use cases and things like that. So first of all, I think one of the important things to start with anyways is this idea of MLOps. So can you explain what MLOps is?Alexa: Yeah, sure. So really short, MLOps is DevOps for machine learning. So I guess the traditional software engineering projects, you have a streamlined process you can release, really often, really quickly, because you already have all these best practices that all these traditional software engineering projects implement. Machine learning, this is still in a quite early stage and MLOps is in a quite early stage. But what we try to do in MLOps is we try to streamline machine learning projects, as well as traditional software engineering projects are streamlined. So data scientists can train models really easily, and they can release models really frequently and really easily into production. So MLOps is all about streamlining the whole data science workflow, basically.And I guess it's good to understand what the data science workflow is. So I talk a bit about that as well. So before actually starting any machine learning project, the first phase is an experimentation phase. It's a really iterative process when data scientists are looking at the data, they are trying to find features and they are also training many different models; they are doing architecture search, trying different architecture, trying different hyperparameter settings with those models. So it's a really iterative process of trying many models, many features.And then by the end, they probably find a model that they like and that hit the benchmark that they were looking for, and then they are ready to release that model into production. And this usually looks like ... so sometimes they use shadow models, in the beginning, to check if the results are as expected in production as well, and then they actually release into production. So basically MLOps tries to create the infrastructure and the processes that streamline this whole process, the whole life cycle.Jeremy: Right. So the question I have is, so if you're an ML engineer or you're working on these models and you're going through these iterations and stuff, so now you have this, you're ready to release it to production, so why do you need something like an MLOps pipeline? Why can't you just move that into production? Where's the barrier?Alexa: Well, I guess ... I mean, to be honest, the thing is there shouldn't be a barrier. Right now, that's the whole goal of MLOps. They shouldn't feel that they need to do any manual model artifact copying or anything like that. They just, I don't know, press a button and they can release to production. So that's what MLOps is about really and we can version models, we can version the data, things like that. And we can create reproducible experiments. So I guess right now, I think many bits in this whole lifecycle is really manual, and that could be automated. For example, releasing to production, sometimes it's a manual thing. You just copy a model artifact to a production bucket or whatever. So sometimes we would like to automate all these things.Jeremy: Which makes a lot of sense. So then, in terms of actually implementing this stuff, because we hear all the time about CI/CD. If we're talking about DevOps, we know that there's all these tools that are being built and services that are being launched that allow us to quickly move code through some process and get into production. So are there similar tools for deploying models and things like that?Alexa: Well, I think this space is quite crowded. It's getting more and more crowded. I think there are many ... So there are the cloud providers, who are trying to create tools that help these processes, and there are also many third-party platforms that are trying to create the ML platform that everybody uses. So I think there is no go-to thing that everybody uses, so I think there is many tools that we can use.Some examples, for example, TensorFlow is a really popular machine learning library, But TensorFlow, they created a package on top of TensorFlow, which is called TFX, TensorFlow Extended, which is exactly for streamlining this process and serving models easily, So I would say it TFX is a really good example. There is Kubeflow, which is a machine learning toolkit for Kubernetes. I think there are many custom implementations in-house in many companies, they create their own machine learning platforms, their own model serving API, things like that. And like the cloud providers on AWS, we have SageMaker. They are trying to cover many parts of the tech science lifecycle. And on Google Cloud, we have AI Platform, which is really similar to SageMaker.Jeremy: Right. And what are you doing at Wise? Are you using one of those tools? Are you building something custom?Alexa: Yeah, it's a mix actually. We have some custom bits. We have a custom API, serving API, for serving models. But for model training, we are using many things. We are using SageMaker, Notebooks. And we are also experimenting with SageMaker endpoints, which are actually serverless model serving endpoints. And we are also using EMR for model training and data preparation, so some Spark-based things, a bit more traditional type of model training. So it's quite a mix.Jeremy: Right. Right. So I am not well-versed in machine learning. I know just enough to be dangerous. And so I think that what would be really interesting, at least for me, and hopefully be interesting to listeners as well, is just talk about some of these standard tools. So you mentioned things like TensorFlow and then Kubeflow, which I guess is that end-to-end piece of it, but if you're ... Just how do you start? How do you go from, I guess, building and training a model to then productizing it and getting that out? What's that whole workflow look like?Alexa: So, actually, the data science workflow I mentioned, the first bit is that experimentation, which is really iterative, really free, so you just try to find a good model. And then, when you found a good model architecture and you know that you are going to receive new data, let's say, I don't know, I have a day, or whatever, I have a week, then you need to build out a retraining pipeline. And that is, I think, what the productionization of a model really means, that you can build a retraining pipeline, which can automatically pick up new data and then prepare that new data, retrain the model on that data, and release that model into production automatically. So I think that means productionization really.Jeremy: Right. Yeah. And so by being able to build and train a model and then having that process where you're getting that feedback back in, is that something where you're just taking that data and assuming that that is right and fits in the model or is there an ongoing testing process? Is there supervised learning? I know that's a buzzword. I'm not even sure what it means. But those ... I mean, what types of things go into that retraining of the models? Is it something that is just automatic or is it something where you need constant, babysitting's probably the wrong word, but somebody to be monitoring that on a regular basis?Alexa: So monitoring is definitely necessary, especially, I think when you trained your model and you shouldn't release automatically in production just because you've trained a new data. I mentioned this shadow model thing a bit. Usually, after you retrained the model and this retraining pipeline, then you release that model into shadow mode; and then you will serve that model in parallel to your actual product production model, and then you will check the results from your new model against your production model. And that's a manual thing, you need to ... or maybe you can automate it as well, actually. So if it performs like ... If it is comparable with your production model or if it's even better, then you will replace it.And also, in terms of the data quality in the beginning, you should definitely monitor that. And I think that's quite custom, really depends on what kind of data you work with. So it's really important to test your data. I mean, there are many ... This space is also quite crowded. There are many tools that you can use to monitor your distribution of your data and see that the new data is actually corresponds to your already existing data set. So there are many bits that you can monitor in this whole retraining pipeline, and you should monitor.Jeremy: Right. Yeah. And so, I think of some machine learning like use cases of like sentiment analysis, for example... looking at tweets or looking at customer service conversations and trying to rate those things. So when you say monitoring or running them against a shadow model, is that something where ... I mean, how do you gauge what's better, right? if you've got a shadow... I mean, what's the success metric there as to say X number were classified as positive versus negative sentiment? Is that something that requires human review or some sampling for you to kind of figure out the quality of the success of those models?Alexa: Yeah. So actually, I think that really depends on the use case. For example, when you are trying to catch fraudsters, your false positive rate and true positive rate, these are really important. If your true positive rate is higher that means, oh, you are catching more fraudsters. But let's say your new model, with your model, also the false positive rate is higher, which means that you are catching more people who are actually not fraudsters, but you have more work because I guess that's a manual process to actually check those people. So I think it really depends on the use case.Jeremy: Right. Right. And you also said that the markets a little bit flooded and, I mean, I know of SageMaker and then, of course, there's all these tools like, what's it called, Recognition, a bunch of things at AWS, and then Google has a whole bunch of the Vision API and some of these things and Watson's Natural Language Processing over at IBM and some of these things. So there's all these different tools that are just available via an API, which is super simple and great for people like me that don't want to get into building TensorFlow models and things like that. So is there an advantage to building your own models beyond those things, or are we getting to a point where with things like ... I mean, again, I know SageMaker has a whole library of models that are already built for you and things like that. So are we getting to a point where some of these models are just good enough off the shelf or do we really still need ... And I know there are probably some custom things. But do we still really need to be building our own models around that stuff?Alexa: So to be honest, I think most of the data scientists, they are using off-the-shelf models, maybe not the serverless API type of models that Google has, but just off-the-shelf TensorFlow models or SageMaker, they have these built-in containers for some really popular model architectures like XGBoost, and I think most of the people they don't tweak these, I mean, as far as I know. I think they just use them out of the box, and they really try to tweak the data instead, the data that they have, and try to have these off-the-shelf models with higher and higher quality data.Jeremy: So shape the data to fit the model as opposed to the model to fit the data.Alexa: Yeah, exactly. Yeah. So you don't actually have to know ... You don't have to know how those models work exactly. As long as you know what the input should be and what output you expect, then I think you're good to go.Jeremy: Yeah, yeah. Well, I still think that there's probably a lot of value in tuning the models though against your particular data sets.Alexa: Yeah, right. But also there are services for hyperparameter tuning. There are services even for neural architecture search, where they try a lot of different architectures for your data specifically and then they will tell you what is the best model architecture that you should use and same for the hyperparameter search. So these can be automated as well.Jeremy: Yeah. Very cool. So if you are hosting your own version of this ... I mean, maybe you'll go back to the MLOps piece of this. So I would assume that a data scientist doesn't want to be responsible for maintaining the servers or the virtual machines or whatever it is that it's running on. So you want to have this workflow where you can get your models trained, you can get them into production, and then you can run them through this loop you talked about and be able to tweak them and continue to retrain them as things go through. So on the other side of that wall, if we want to put it that way, you have your ops people that are running this stuff. Is there something specific that ops people need to know? How much do they need to know about ML, as opposed to ... I mean, the data scientists, hopefully, they know more. But in terms of running it, what do they need to know about it, or is it just a matter of keeping a server up and running?Alexa: Well, I think ... So I think the machine learning pipelines are not yet as standardized as a traditional software engineering pipeline. So I would say that you have to have some knowledge of machine learning or at least some understanding of how this lifecycle works. You don't actually need to know about research and things like that, but you need to know how this whole lifecycle works in order to work as an ops person who can automate this. But I think the software engineering skills and DevOps skills are the base, and then you can just build this knowledge on top of that. So I think it's actually quite easy to pick this up.Jeremy: Yeah. Okay. And what about, I mean, you mentioned this idea of a lot of data scientists aren't actually writing the models, they're just using the preconfigured model. So I guess that begs the question: How much does just a regular person ... So let's say I'm just a regular developer, and I say, "I want to start building machine learning tools." Is it as easy as just pulling a model off the shelf and then just learning a little bit more about it? How much can the average person do with some of these tools out of the box?Alexa: So I think most of the time, it's that easy, because usually the use cases that someone tries to tackle, those are not super edge cases. So for those use cases, there are already models which perform really well. Especially if you are talking about, I don't know, supervised learning on tabular data, I think you can definitely find models that will perform really well off the shelf on those type of datasets.Jeremy: Right. And if you were advising somebody who wanted to get started... I mean, because I think that I think where it might come down to is going to be things like pricing. If you're using Vision API and you're maybe limited on your quota, and then you can ... if you're paying however many cents per, I guess, lookup or inference, then that can get really expensive as opposed to potentially running your own model on something else. But how would you suggest that somebody get started? Would you point them at the APIs or would you want to get them up and running on TensorFlow or something like that?Alexa: So I think, actually, for a developer, just using an API would be super easy. Those APIs are, I think ... So getting started with those APIs just to understand the concepts are very useful, but I think getting started with Tensorflow itself or just Keras, I definitely I would recommend that, or just use scikit-learn, which is a more basic package for more basic machine learning. So those are really good starting points. And there are so many tutorials to get started with, and if you have an idea of what you would like to build, then I think you will definitely find tutorials which are similar to your own use case and you can just use those to build your custom pipeline or model. So I would say, for developers, I would definitely recommend jumping into TensorFlow or scikit-learn or XGBoost or things like that.Jeremy: Right, right. And how many of these models exist? I mean, are we talking there's 20 different models or are we talking there's 20,000 models?Alexa: Well, I think ... Wow. Good question. I think we are more towards today maybe not 20,000, but definitely many thousands, I think. But there are popular models that most of the people use, and I think there are maybe 50 or 100 models that are the most popular and most companies use them and you are probably fine just using those for any use case or most of the use cases.Jeremy: Right. Now, and speaking of use cases, so, again, I try to think of use cases or machine learning and whether it's classifying movies into genres or sentiment analysis, like I said, or maybe trying to classify news stories, things like that. Fraud detection, you mentioned. Those are all great use cases, but what are ... I know you've worked on a bunch of projects. So what are some of the projects that you've done and what were the use cases that were being solved there, because I find these to be really interesting?Alexa: Yeah. So I think a nice project that I worked on was a project with Lush, which is a cosmetics company. They manufacture like soaps and bath bombs. And they have this nice mission that they would like to eliminate packaging from their shops. So they asked us, when I worked at Datatonic, we worked on a small project with them. They asked us to create an image recognition model, to train one, and then create a retraining pipeline that they can use afterwards. So they provided us with many hundred thousand images of their products, and they made photos from different angles with different lightings and all of that, so really high-quality image data set of all their products.And then, we used a mobile net model, because they wanted this model to be built-in into their mobile application. So when users actually use this model, they download it with their mobile application. And then, they created a service called Lush [inaudible], which you can use from within their app. And then, people can just scan the products and they can see the ingredients and how-to-use guides and things like that. So this is how they are trying to eliminate all kinds of packaging from their shops, that they don't actually need to put the papers there or put packaging with ingredients and things like that.And in terms of what we did on the technical side, so as I mentioned, we used a mobile net model, because we needed to quantize the model in order to put it on a mobile device. And we used TF Lite to do this. TF Lite is specifically for models that you want to run on an edge device, like a mobile phone. So that was already a constraint. So this is how we picked a model. I think, back then, like there were only a few model architectures supported by TF Lite, and I think there were only two, maybe. So we picked MobileNet, because it had a smaller size.And then, in terms of the retraining, so we automated the whole workflow with Cloud Composer on Google Cloud, which is a managed version of Apache Airflow, the open source scheduling package. The training happened on AI Platform, which is Google Cloud's SageMaker.Jeremy: Yeah.Alexa: Yeah. And what else? We also had an image pre-processing step just before the training, which happened on Dataflow, which is an auto-scaling processing service on Google Cloud. And after we trained the model, we just saved the model active artifact in a bucket, and then ... I think we also monitored the performance of the model, and if it was good enough, then we just shipped the model to developers who actually they manually updated the model file that went into the application that people can download. So we didn't really see if they use any shadow model thing or anything like that.Jeremy: Right. Right. And I think that is such a cool use case, because, if I'm hearing you right, there were just like a bar soap or something like that with no packaging, no nothing, and you just hold your mobile phone camera up to it or it looks at it, determines which particular product is, gives you all that ... so no QR codes, no bar codes, none of that stuff. How did they ring them up though? Do you know how that process worked? Did the employees just have to know what they were or did the employees use the app as well to figure out what they were billing people for?Alexa: Good question. So I think they wanted the employees as well to use the app.Jeremy: Nice.Alexa: Yeah. But when the app was wrong, then I don't know what happened.Jeremy: Just give them a discount on it or something like that. That's awesome. And that's the thing you mentioned there about ... Was it Tensor Lite, was it called?Alexa: TF Lite. Yeah.Jeremy: TF Lite. Yes. TensorFlow Lite or TF Lite. But, basically, that idea of being able to really package a model and get it to be super small like you said. You said edge devices, and I'm thinking serverless compute at the edge, I'm thinking Lambda functions. I'm thinking other ways that if you could get your models small enough in package, that you could run it. But that'd be a pretty cool way to do inference, right? Because, again, even if you're using edge devices, if you're on an edge network or something like that, if you could do that at the edge, that'd be a pretty fast response time.Alexa: Yeah, definitely. Yeah.Jeremy: Awesome. All right. So what about some other stuff that you've done? You've mentioned some things about fraud detection and things like that.Alexa: Yeah. So fraud detection is a use case for Wise. As I mentioned, Wise services international money transfer, one of its services. So, obviously, if you are doing anything with money, then a full use case is for sure that you will have. So, I mean, in terms of ... I don't actually develop models at Wise, so I don't know actually what models they use. I know that they use H2O, which is a Spark-based library that you can use for model training. I think it's quite an advanced library, but I haven't used it myself too much, so I cannot talk about that too much.But in terms of the workflow, it's quite similar. We also have Airflow to schedule the retraining of the models. And they use EMR for data preparation, so quite similar to Dataflow, in a sense. A Spark-based auto-scaling cluster that processes the data and then, they train the models on EMR as well but using this H2O library. And then in the end, when they are happy with the model, we have this tool that they can use for releasing shadow models in production. And then, if they are satisfied with the performance of the model that they can actually release into production. And at Wise, we have a custom micro service, a custom API, for serving models.Jeremy: Right. Right. And that sounds like you need a really good MLOps flow to make all that stuff work, because you just have a lot of moving parts there, right?Alexa: Yeah, definitely. Also, I think we have many bits that could be improved. I think there are many bits that still a bit manual and not streamlined enough. But I think most of the companies struggle with the same thing. It's just we don't yet have those best practices that we can implement, so many people try many different things, and then ... Yeah, so I think it's still a work in progress.Jeremy: Right. Right. And I'm curious if your economics background helps at all with the fraud and the money laundering stuff at all?Alexa: No.Jeremy: No. All right. So what about you worked in another data engineering project for Vodafone, right?Alexa: Yeah. Yeah, so that was a data engineering project purely, so we didn't do any machine learning. Well, Vodafone has their own Google Analytics library that they use in all their websites and mobile apps and things like that and that sense Clickstream data to a server in a Google Cloud Platform Project, and we consume that data in a streaming manner from data flows. So, basically, the project was really about processing this data by writing an Apache Beam pipeline, which was always on and always expected messages to come in. And then, we dumped all the data into BigQuery tables, which is data warehouse in Google Cloud. And then, these BigQuery tables powered some of the dashboards that they use to monitor the uptime and, I don't know, different metrics for their websites and mobile apps.Jeremy: Right. But collecting all of that data is a good source for doing machine learning on top of that, right?Alexa: Yeah, exactly. Yeah. I think they already had some use cases in mind. I'm not sure if they actually done those or not, but it's a really good base for machine learning, what we collected the data there in BigQuery, because that is an analytical data warehouse, so some analysts can already start and explore the data as a first step of the machine learning process.Jeremy: Right. I would think anomaly detection and things like that, right?Alexa: Yeah, exactly.Jeremy: Right. All right. Well, so let's go on and talk about serverless a little bit more, because I know I saw you do a talk where you were you ran some experiments with serverless. And so, I'm just kind of curious, where are the limitations that you see? And I know that there continues ... I mean, we now have EFS integration, and we've got 10 gigs of memory for lambda functions, you've even got Cloud Run, which I don't know how much you could do with that, but where's still some of the limitations for running machine learning in a serverless way, I guess?Alexa: So I think, actually, from this data science lifecycle, many bits, there are Cloud providers offer a lot of serverless options. For data preparation, there is Dataflow, which is, I think, kind of like serverless data processing service, so you can use that for data processing. For model training, there is ... Or the SageMaker and AI Platform, which are kind of serverless, because you don't actually need to provision these clusters that you train your models on. And for model serving, in SageMaker, there are the serverless model endpoints that you can deploy. So there are many options, I think, for serverless in the machine learning lifecycle.In my experience, many times, it's a cost thing. For example, at Wise, we have this custom model serving API, where we serve all our models. And if they would use SageMaker endpoints, I think, a single SageMaker endpoint is about $50 per month, that's the minimum price, and that's for a single model and a single endpoint. And if you have thousands of models, then your price can go up pretty quickly, or maybe not thousands, but hundreds of models, then your price can go up pretty quickly. So I think, in my experience, limitation could be just price.But in terms of ... So I think, for example, if I compare Dataflow with a spark cluster that you program yourself, then I would definitely go with Dataflow. I think it's just much easier and maybe cost-wise as well, you might be better off, I'm not sure. But in terms of comfort and developer experience, it's a much better experience.Jeremy: Right. Right. And so, we talked a little bit about TF Lite there. Is that something possible where maybe the training piece of it, running that on Functions as a Service or something like that maybe isn't the most efficient or cost-effective way to do that, but what about running models or running inference on something like a Lambda function or a Google Cloud function or an Azure function or something like that? Is it possible to package those models in a way that's small enough that you could do that type of workload?Alexa: I think so. Yeah. I think you can definitely make inference using a Lambda function. But in terms of model training, I think that's not a ... Maybe there were already experiments for, I'm sure there were. But I think it's not the kind of workload that would fit for Lambda functions. That's a typical parallelizable, really large-scale workloads for ... You know the MapReduce type of data processing workloads? I think those are not necessarily fit for Lambda functions. So I think for model training and data preparation, maybe those are not the best options, but for model inference, definitely. And I think there are many examples using Lambda functions for inference.Jeremy: Right. Now, do you think that ... because this is always something where I find with serverless, and I know you're more of a data scientist, ML expert, but I look at serverless and I question whether or not it needs to handle some of these things. Especially with some of the endpoints that are out there now, we talked about the Vision API and some of the other NLP things, are we putting in too much effort maybe to try to make serverless be able to handle these things, or is it just something where there's a really good way to handle these by hosting your ... I mean, even if you're doing SageMaker, maybe not SageMaker endpoints, but just running SageMaker machines to do it or whatever, are we trying too hard to squeeze some of these things into a serverless environment?Alexa: Well, I don't know. I think, as a developer, I definitely prefer the more managed versions of these products. So the less I need to bother with, "Oh, my cluster died and now we need to rebuild a cluster of things," and I think serverless can definitely solve that. I would definitely prefer the more managed version. Maybe not serverless, because, for some of the use cases or some of the bits from the lifecycle, serverless is not the best fit, but a managed product is definitely something that I prefer over a non-managed product.Jeremy: Right. And so, I guess one last question for you here, because this is something that always interests me. Just there are relevant things that we need machine learning for. I mean, I think the fraud detection is a hugely important one. Sentiment analysis, again. Some of those other things are maybe, I don't know, I shouldn't call them toy things, but personalization and some of the things, they're all really great things to have, and it seems like you can't build an application now without somebody wanting some piece of that machine learning in there. So do you see that as where we are going where in the future, we're just going to have more of these APIs?I mean, out of AWS, because I'm more familiar with the AWS ecosystem, but they have Personalize and they have Connect and they have all these other services, they have the recommendation engine thing, all these different services ... Lex, or whatever, that will read text, natural language processing and all that kind of stuff. Is that where we're moving to just all these pre-trained, canned products that I can just access via an API or do you think that if you're somebody getting started and you really want to get into the ML world that you should start diving into the TensorFlows and some of those other things?Alexa: So I think if you are building an app and your goal is not to become an ML engineer or a data scientist, then these canned models are really useful because you can have a really good recommendation engine in your product, you could have really good personalization engine in your product, things like that. And so, those are, I think, really useful and you don't need to know any machine learning in order to use them. So I think we definitely go into that direction, because most of the companies won't hire data scientists just to train a recommender model. I think it's just easier to use an API endpoint that is already really good.So I think, yeah, we are definitely heading into that direction. But if you are someone who wants to become a data scientist or wants to be more involved with MLOps or machine learning engineering, then I think jumping into TensorFlow and understanding, maybe not, as we discussed, not getting into the model architectures and things like that, but just understanding the workflow and being able to program a machine learning pipeline from end to end, I think that's definitely recommended.Jeremy: All right. So one last question: If you've ever used the Watson NLP API or the Google Vision API, can you put on your resume that you're a machine learning expert?Alexa: Well, if you really want to do that, I would give it a go. Why not?Jeremy: All right. Good. Good to know. Well, Alexa, thank you so much for sharing all this information. Again, I find the use cases here to be much more complex than maybe some of the surface ones that you sometimes hear about. So, obviously, machine learning is here to stay. It sounds like there's a lot of really good opportunities for people to start kind of dabbling in it and using that without having to become a machine learning expert. But, again, I appreciate your expertise. So if people want to find out more about you or more about the things you're working on and datastack.tv, things like that, how do they do that?Alexa: So we have a Twitter page for datastack.tv, so feel free to follow that. I also have a Twitter page, feel free to follow me, account, not page. There is a datastack.tv website, so it's just datastack.tv. You can go there, and you can check out the courses. And also, we have created a roadmap for data engineers specifically, because there was no good roadmap for data engineers. I definitely recommend checking that out, because we listed most of the tools that a data engineer and also machine learning engineer should know about. So if you're interested in this career path, then I would definitely recommend checking that out. So under datastack.tv's GitHub, there is a roadmap that you can find.Jeremy: Awesome. All right. And that's just, like you said, datastack.tv.Alexa: Yes.Jeremy: I will make sure that we get your Twitter and LinkedIn and GitHub and all that stuff in there. Alexa, thank you so much.Alexa: Thanks. Thank you.

Datacast
Episode 58: Deep Learning Meets Distributed Systems with Jim Dowling

Datacast

Play Episode Listen Later Mar 19, 2021 79:15


Show Notes(1:56) Jim went over his education at Trinity College Dublin in the late 90s/early 2000s, where he got early exposure to academic research in distributed systems.(4:26) Jim discussed his research focused on dynamic software architecture, particularly the K-Component model that enables individual components to adapt to a changing environment.(5:37) Jim explained his research on collaborative reinforcement learning that enables groups of reinforcement learning agents to solve online optimization problems in dynamic systems.(9:03) Jim recalled his time as a Senior Consultant for MySQL.(9:52) Jim shared the initiatives at the RISE Research Institute of Sweden, in which he has been a researcher since 2007.(13:16) Jim dissected his peer-to-peer systems research at RISE, including theoretical results for search algorithm and walk topology.(15:30) Jim went over challenges building peer-to-peer live streaming systems at RISE, such as GradientTV and Glive.(18:18) Jim provided an overview of research activities at the Division of Software and Computer Systems at the School of Electrical Engineering and Computer Science at KTH Royal Institute of Technology.(19:04) Jim has taught courses on Distributed Systems and Deep Learning on Big Data at KTH Royal Institute of Technology.(22:20) Jim unpacked his O’Reilly article in 2017 called “Distributed TensorFlow,” which includes the deep learning hierarchy of scale.(29:47) Jim discussed the development of HopsFS, a next-generation distribution of the Hadoop Distributed File System (HDFS) that replaces its single-node in-memory metadata service with a distributed metadata service built on a NewSQL database.(34:17) Jim rationalized the intention to commercialize HopsFS and built Hopsworks, an user-friendly data science platform for Hops.(36:56) Jim explored the relative benefits of public research money and VC-funded money.(41:48) Jim unpacked the key ideas in his post “Feature Store: The Missing Data Layer in ML Pipelines.”(47:31) Jim dissected the critical design that enables the Hopsworks feature store to refactor a monolithic end-to-end ML pipeline into separate feature engineering and model training pipelines.(52:49) Jim explained why data warehouses are insufficient for machine learning pipelines and why a feature store is needed instead.(57:59) Jim discussed prioritizing the product roadmap for the Hopswork platform.(01:00:25) Jim hinted at what’s on the 2021 roadmap for Hopswork.(01:03:22) Jim recalled the challenges of getting early customers for Hopsworks.(01:04:30) Jim intuited the differences and similarities between being a professor and being a founder.(01:07:00) Jim discussed worrying trends in the European Tech ecosystem and the role that Logical Clocks will play in the long run.(01:13:37) Closing segment.Jim’s Contact InfoLogical ClocksTwitterLinkedInGoogle ScholarMediumACM ProfileGitHubMentioned ContentResearch Papers“The K-Component Architecture Meta-Model for Self-Adaptive Software” (2001)“Dynamic Software Evolution and The K-Component Model” (2001)“Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing” (2005)“Building Autonomic Systems Using Collaborative Reinforcement Learning” (2006)“Improving ICE Service Selection in a P2P System using the Gradient Topology” (2007)“gradienTv: Market-Based P2P Live Media Streaming on the Gradient Overlay” (2010)“GLive: The Gradient Overlay as a Market Maker for Mesh-Based P2P Live Streaming” (2011)“HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases” (2016)“Scaling HDFS to More Than 1 Million Operations Per Second with HopsFS” (2017)“Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata” (2017)“Implicit Provenance for Machine Learning Artifacts” (2020)“Time Travel and Provenance for Machine Learning Pipelines” (2020)“Maggy: Scalable Asynchronous Parallel Hyperparameter Search” (2020)Articles“Distributed TensorFlow” (2017)“Reflections on AWS’s S3 Architectural Flaws” (2017)“Meet Michelangelo: Uber’s Machine Learning Platform” (2017)“Feature Store: The Missing Data Layer in ML Pipelines” (2018)“What Is Wrong With European Tech Companies?” (2019)“ROI of Feature Stores” (2020)“MLOps With A Feature Store” (2020)“ML Engineer Guide: Feature Store vs. Data Warehouse” (2020)“Unifying Single-Host and Distributed Machine Learning with Maggy” (2020)“How We Secure Your Data With Hopsworks” (2020)“One Function Is All You Need For ML Experiments” (2020)“Hopsworks: World’s Only Cloud-Native Feature Store, now available on AWS and Azure” (2020)“Hopsworks 2.0: The Next Generation Platform for Data-Intensive AI with a Feature Store” (2020)“Hopsworks Feature Store API 2.0, a new paradigm” (2020)“Swedish startup Logical Clocks takes a crack at scaling MySQL backend for live recommendations” (2021)ProjectsApache Hudi (by Uber)Delta Lake (by Databricks)Apache Iceberg (by Netflix)MLflow (by Databricks)Apache Flink (by The Apache Foundation)PeopleLeslie Lamport (The Father of Distributed Computing)Jeff Dean (Creator of MapReduce and TensorFlow, Lead of Google AI)Richard Sutton (The Father of Reinforcement Learning — who wrote “The Bitter Lesson”)Programming BooksC++ Programming Languages books (by Scott Meyers)“Effective Java” (by Joshua Bloch)“Programming Erlang” (by Joe Armstrong)“Concepts, Techniques, and Models of Computer Programming” (by Peter Van Roy and Seif Haridi)

DataCast
Episode 58: Deep Learning Meets Distributed Systems with Jim Dowling

DataCast

Play Episode Listen Later Mar 19, 2021 79:15


Show Notes(1:56) Jim went over his education at Trinity College Dublin in the late 90s/early 2000s, where he got early exposure to academic research in distributed systems.(4:26) Jim discussed his research focused on dynamic software architecture, particularly the K-Component model that enables individual components to adapt to a changing environment.(5:37) Jim explained his research on collaborative reinforcement learning that enables groups of reinforcement learning agents to solve online optimization problems in dynamic systems.(9:03) Jim recalled his time as a Senior Consultant for MySQL.(9:52) Jim shared the initiatives at the RISE Research Institute of Sweden, in which he has been a researcher since 2007.(13:16) Jim dissected his peer-to-peer systems research at RISE, including theoretical results for search algorithm and walk topology.(15:30) Jim went over challenges building peer-to-peer live streaming systems at RISE, such as GradientTV and Glive.(18:18) Jim provided an overview of research activities at the Division of Software and Computer Systems at the School of Electrical Engineering and Computer Science at KTH Royal Institute of Technology.(19:04) Jim has taught courses on Distributed Systems and Deep Learning on Big Data at KTH Royal Institute of Technology.(22:20) Jim unpacked his O’Reilly article in 2017 called “Distributed TensorFlow,” which includes the deep learning hierarchy of scale.(29:47) Jim discussed the development of HopsFS, a next-generation distribution of the Hadoop Distributed File System (HDFS) that replaces its single-node in-memory metadata service with a distributed metadata service built on a NewSQL database.(34:17) Jim rationalized the intention to commercialize HopsFS and built Hopsworks, an user-friendly data science platform for Hops.(36:56) Jim explored the relative benefits of public research money and VC-funded money.(41:48) Jim unpacked the key ideas in his post “Feature Store: The Missing Data Layer in ML Pipelines.”(47:31) Jim dissected the critical design that enables the Hopsworks feature store to refactor a monolithic end-to-end ML pipeline into separate feature engineering and model training pipelines.(52:49) Jim explained why data warehouses are insufficient for machine learning pipelines and why a feature store is needed instead.(57:59) Jim discussed prioritizing the product roadmap for the Hopswork platform.(01:00:25) Jim hinted at what’s on the 2021 roadmap for Hopswork.(01:03:22) Jim recalled the challenges of getting early customers for Hopsworks.(01:04:30) Jim intuited the differences and similarities between being a professor and being a founder.(01:07:00) Jim discussed worrying trends in the European Tech ecosystem and the role that Logical Clocks will play in the long run.(01:13:37) Closing segment.Jim’s Contact InfoLogical ClocksTwitterLinkedInGoogle ScholarMediumACM ProfileGitHubMentioned ContentResearch Papers“The K-Component Architecture Meta-Model for Self-Adaptive Software” (2001)“Dynamic Software Evolution and The K-Component Model” (2001)“Using feedback in collaborative reinforcement learning to adaptively optimize MANET routing” (2005)“Building Autonomic Systems Using Collaborative Reinforcement Learning” (2006)“Improving ICE Service Selection in a P2P System using the Gradient Topology” (2007)“gradienTv: Market-Based P2P Live Media Streaming on the Gradient Overlay” (2010)“GLive: The Gradient Overlay as a Market Maker for Mesh-Based P2P Live Streaming” (2011)“HopsFS: Scaling Hierarchical File System Metadata Using NewSQL Databases” (2016)“Scaling HDFS to More Than 1 Million Operations Per Second with HopsFS” (2017)“Hopsworks: Improving User Experience and Development on Hadoop with Scalable, Strongly Consistent Metadata” (2017)“Implicit Provenance for Machine Learning Artifacts” (2020)“Time Travel and Provenance for Machine Learning Pipelines” (2020)“Maggy: Scalable Asynchronous Parallel Hyperparameter Search” (2020)Articles“Distributed TensorFlow” (2017)“Reflections on AWS’s S3 Architectural Flaws” (2017)“Meet Michelangelo: Uber’s Machine Learning Platform” (2017)“Feature Store: The Missing Data Layer in ML Pipelines” (2018)“What Is Wrong With European Tech Companies?” (2019)“ROI of Feature Stores” (2020)“MLOps With A Feature Store” (2020)“ML Engineer Guide: Feature Store vs. Data Warehouse” (2020)“Unifying Single-Host and Distributed Machine Learning with Maggy” (2020)“How We Secure Your Data With Hopsworks” (2020)“One Function Is All You Need For ML Experiments” (2020)“Hopsworks: World’s Only Cloud-Native Feature Store, now available on AWS and Azure” (2020)“Hopsworks 2.0: The Next Generation Platform for Data-Intensive AI with a Feature Store” (2020)“Hopsworks Feature Store API 2.0, a new paradigm” (2020)“Swedish startup Logical Clocks takes a crack at scaling MySQL backend for live recommendations” (2021)ProjectsApache Hudi (by Uber)Delta Lake (by Databricks)Apache Iceberg (by Netflix)MLflow (by Databricks)Apache Flink (by The Apache Foundation)PeopleLeslie Lamport (The Father of Distributed Computing)Jeff Dean (Creator of MapReduce and TensorFlow, Lead of Google AI)Richard Sutton (The Father of Reinforcement Learning — who wrote “The Bitter Lesson”)Programming BooksC++ Programming Languages books (by Scott Meyers)“Effective Java” (by Joshua Bloch)“Programming Erlang” (by Joe Armstrong)“Concepts, Techniques, and Models of Computer Programming” (by Peter Van Roy and Seif Haridi)

The Art Of Programming
242 Сложные отношения Deep Learning и DevOps — The Art Of Programming [ Deep Learning ]

The Art Of Programming

Play Episode Listen Later Nov 29, 2020 43:35


00:00:00 — На новом месте 00:08:50 — Роверы и Такси нас захватят 00:14:40 — Большая проблема обучения Ai 00:23:58 — Войти в Ai, DL, ML 00:28:26 — DevOps в ML мире 00:32:10 — FaaS, Spot Instances, Cortex 00:37:04 — Что поделать полезного для начала? Практический вебинар: новые возможности DataSphere для анализа данных и построения моделей машинного обучения Cortex Open Data Science Big Data Essentials: HDFS, MapReduce and Spark RDD Big Data Applications: Machine Learning at Scale Fast.Ai Участники @golodnyj Никита Ложников Благодарности патронам Aleksandr Kiriushin, Alex, Alex Malikov, Fedor Rusak, Ihor Kopyl, Leo Kapanen, Mikhail Gaidamaka, nikaburu, Vasiliy Galkin, Pavel Drabushevich, Pavel Sitnikov, Sergey Kiselev, Sergey Vinyarsky, Sergii Zhuk Подарок 2020 Telegram канал Youtube канал iTunes подкаст Поддержи подкаст Старые выпуски

ai programming ml faa dl devops deep learning mapreduce sergey kiselev pavel sitnikov vasiliy galkin
The Business of Open Source
Discussing Forter with CTO Iftah Gideoni

The Business of Open Source

Play Episode Listen Later Nov 4, 2020 39:25


This conversation covers: The value that Forter provides, and the types of companies that they work with. Iftah also explains what makes Forter so unique.  The underlying technology that Forter is using, and how they quickly process hundreds of complex backend workflows. Iftah also talks about some of the tools that they are using, including AWS and Apache Storm. How Forter approaches the cloud, and how it's helping them concentrate on the business of detecting fraud. In addition, talks about the types of cloud services that Forter is using. Forter's ability to scale — including how they responded to increased customer demand during COVID-19. Forter's biggest technical challenge that they are currently working through. Iftah's thoughts on the security- speed tradeoff. Links: Forter Forter on Twitter Connect with Iftah on LinkedIn Iftah's email: iftah@forter.com Transcript:Emily: Hi everyone. I'm Emily Omier, your host, and my day job is helping companies position themselves in the cloud-native ecosystem so that their product's value is obvious to end-users. I started this podcast because organizations embark on the cloud naive journey for business reasons, but in general, the industry doesn't talk about them. Instead, we talk a lot about technical reasons. I'm hoping that with this podcast, we focus more on the business goals and business motivations that lead organizations to adopt cloud-native and Kubernetes. I hope you'll join me.Emily: Welcome to The Business of Cloud Native. I'm Emily Omier, your host, and today I'm chatting with Iftah Gideoni. Iftah is the CTO at Forter. Iftah, first of all, thank you so much for joining me.Iftah: Very glad to be here.Emily: So, I wanted to have you start by introducing yourself and what you do, and then also what Forter does.Iftah: Hi, I'm Iftah. I'm a physicist of education, and in the last 20 years, a CTO of several companies, mostly [00:01:11 unintelligible] governmental companies, and companies that I founded. In the last six and a half years, I'm with Forter. And what Forter started to do from 2014 is to provide what was, at the time, very bold vision of fully automated, fully cloud-based decisions about whether to allow or decline e-commerce transactions. Now, from that time we actually implemented and executed that, we decide very many more than 3 million transactions every day, today, all in real-time without a human in the loop. And we expanded into being a fully-fledged trust engine that gives decisions not only about transactions, but about many other points of interaction with the consumer, for example, in their login time, and in other points where trust decision is needed.Emily: So, just because I think it might be interesting to listeners, give me some examples of, like, when somebody might interact with Forter or have some sort of action approved or declined by Forter.Iftah: Right. The prime customers of Forter are the big e-commerce enterprises. Think about the [00:02:42 Sephoras], the Nordstroms, the Home Depots, and this kind of companies. And whenever you press the button of requesting to committing to the purchase and you see this small things rounding on the screen, then it is sent to Forter and Forter within, usually, half a second returns a decision. Now, Forter does not act as an additional data point, or input, or score into some system of the merchant. It actually answer whether to approve or decline the transaction. In very many—and most of the revenue of Forter comes from a covered transaction that, if this transaction was fraud, it's on Forter. Forter will guarantee it. And we were pioneering this model to putting our mouth where our money is.Emily: Tell me just a little bit about why this is so difficult. What makes what Forter does unique?Iftah: What Forter does is unique because it tells the human story, and takes it all the way to the decision itself. For example, it's very easy to approve the fourth transaction of a person that is sitting at home, browsing from home, making the purchase on the same desktop they made at previous times, and sending the shipment to the same home. That's very easy. But we want to be able to approve the traveler, the person that is sending a gift to a third party, or a person that is sending a gift to another state while not browsing from home and not from his common device. We want to be able to approve those transactions that are checking out as guests from a new device and that's the first time this person ever appeared on our radar. And the ability to do that and to take the calculated risks and to look at the behavior, the cyber clues, and still be able to tell that this is indeed a new person and not someone that visited before and is trying now to hide. That's what makes what we do very difficult and complex.Emily: So, tell me a bit about the technology story. What technology do you use to accomplish this, and how does it work? What does your stack look like?Iftah: When I came to—from 2014, I looked at the system and what is actually needed in order to cater to such a complex story? And I thought to myself—and we'll talk about maybe a bit later about how all this is excellently suited for the Cloud, but what I found that throughput and big data is not the problem. First, it's more or less solved, but it is the e-commerce business; it's not Facebook scale throughput. And on the other hand, it's not hardcore real-time, right? We're talking about tens of milliseconds, not the microseconds domain. What is extreme about what we do is the complexity of the flow. We have hundreds of processes that are needed to be ran within that half a second in order to test, and check, and infer, and decide on many aspects of this transaction and of this person. So, first, we started from Amazon Web Services, and we started with, actually, Apache Storm. And why we decided that because we wanted to have something that enables first, a lot of parallelism—doing many things in parallel—with smart joins, that is with processes that takes information from other processes that executed in parallel, and can decide whether what they have so far from these processes is enough. Because we are very high availability, we didn't lose more than 10 seconds straight in the last four years. We are very high availability, but a lot of our sub-processes are not. So, you need such a machine that will be able to infer about whether the information at hand is good enough and to move forward and still give, after half a second, the answer. We also wanted to have within this high availability system, we wanted to have the domain experts, the analysts, and the fraud researchers, we wanted to give them a very direct access to the code and each insight that they get, in close to real-time, maybe in 10 or 15 minutes from the time that they understood that there is a new wave of attacks or a new fraudster in action in a particular store or across stores. We wanted all these insights to be manifested in the system within 10 or 15 minutes without these people needing any engineering in order to do that. So, we created incubators within these Apache Storm processes that enable them to write, in Python, their wisdom into the system without being technologists or engineers. So, this was the basic. Then we went on to see how we do the best similarity in the world. That is the understanding of whether we already saw a person, even if this person exhibits a new persona and is trying to hide. That is, they didn't give us the same phone number or email, it's not the same cookie, or the same IP, or the same credit card, and they don't use the same account, and we still want to know that this is the same person. This is a big part of what makes us efficient in exterminated fraud rings and enabling us to increase the cost of doing business for the sophisticated fraudsters. These are the prime building blocks and the last very important building block is the way we represent the world. Usually and traditionally, world was represented by the transaction. The transaction was the building block. But we represent the world as people. We know more than 700 million people and of their interactions and their browsing, in many stores. These 700 million people include most of the people that interact online in the US. And the same is for the IPs, and the addresses, and the devices in the US. The US is where our coverage is best. And all what we do revolves around the person because we believe that the person is what is actually persisting in the world with persistence reputation. That is a person is a legitimate person, they will stay legitimate. Usually, they won't flip on us. And if they are fraudsters, they will stay fraudsters. Not the same for IPs, for addresses, and for all other entities, you can think of. I hope this, it answered to a degree what you are asking about.Emily: Yeah. And I'm going to go into some more questions, but it's it's really interesting that what you're combining is both this sophisticated technology as well as sort of an understanding of—almost like a law enforcement understanding of how fraud works. Or how—like, a anthropological investigation of how fraud rings work.Iftah: Yes. And we found that there is a lot of—and I think our main asset is the ability to combine what analysts understand about the spoofing of the device, and how you detect that it's not really a mobile phone, it's an emulator on a desktop? And how can you tell that someone is trying to mess with an application that you protect? And what are the ways in which you can approve a transaction that looks very fishy to begin with, but it has some hints of legitimacy. How we combine this with a very robust, high availability and very secure machine because it needs to be secure. We touch a lot of personal, identifiable information in our regular course of business, and we need the system to be ultra-secure while it is on the Cloud. And our booklet, actually, of 101, how to secure your startup [00:12:45 unintelligible] usage on the Cloud was actually trending number one on GitHub for months in 2017 when we issued it. [laughs].Emily: That's excellent. Well, let me ask some more technology-specific questions. One is, just—you sort of alluded to this, but how is the Cloud important—and in fact, I believe you said critical to your business? Would Forter even be possible without the Cloud?Iftah: Forter would be possible with an on-prem cloud, right, because when we say Cloud, it could be Amazon, or Azure, or GCP, but it also could be in a cloud that we built somewhere. This would be possible. We didn't go there, and most of e-commerce companies would not go there, and we'll dive into this in a minute why it's not wise to go there. But Forter is heavily relying on knowledge of the people, regardless of which merchant they visited. So, if we see a person in the Forter, it could be their first time Nordstrom sees them, but we already know them, and we can project the reputation of the person from previous interactions with other customers of ours. We don't share any customer data, of course, with any of our customers, but we can share parts of reputation, especially for people where this is the first time they visited a particular merchant. Now, this is a prime reason why it cannot be on-prem of the customer. And several customer—and I will not mention name, but huge conglomerates of carmakers, actually, asked us to be on their Cloud. And we refused and we let go of the business because that's not how we do, and the best value for them would be to share the data. And so far, all the customers that we have so far actually agreed to share the usage of reputation with all the rest of the network of customers that we have. This is something that they cannot do in-house, and this is something, per your question, that cannot be done if we are not in the Cloud, but on their premises farm.Emily: And so are you operating in all public clouds, or do you have your main technology running in one?Iftah: We have our technology running in a few regions of AWS. And we are now deploying a few regions in Azure, too.Emily: And so it doesn't matter if your customer, which public cloud. So, if you have a customer that uses GCP, doesn't matter, right?Iftah: It doesn't matter. And most of them are [00:16:01 naturally] aware where we are. Bear in mind that we are serving companies—I mentioned the names—which are inherently not technology companies. And it doesn't matter where they sit; we are a full SaaS company for them. They send us the request, the transaction, and we give them a decision within this half a second, and that's the core of the business. Doesn't matter for them. As [00:16:36 unintelligible] to say earlier, the concept of the public cloud and using other people's cloud infrastructure, be it GCP, or Azure, or AWS or others, is very suited for the e-commerce because of these two prime characteristics of the e-commerce: first, you don't need it to be very hard real-time, you're talking about tens of milliseconds, and giving answers in hundreds of milliseconds, ultimately; and second, unlike the Twitters, and the WhatsApps, and the Facebooks, and the Googles, the e-commerce is not big data in the sense that every transaction of e-commerce is, on average, a very high monetization. So, the ultra cost of using public cloud is definitely worth it for the e-commerce entity, comparing to creating your own farm. It is good for their flexibility and it's good for the focus and attention on their core business, where if you run your own farm, you are into a lot of domains of expertise which are far away from selling whatever you sell.Emily: And tell me, also, a bit about, sort of, your own technology. Things like how you manage scalability. How important is it for Forter's bottom line, the ability to have a scalable system?Iftah: We are running from 2014—from day one, actually, from 2013, we have to be scaled out. We can't scale up. We don't have anything that is done by a single computer. All the transactions are on what are called brains that are a scaled out on both redundancy and scalability. All our data stores are scaled out. All the data stores that are storing the transactions, and the logging, and the entities that we talked about are scaled out and they are replicated, and the transactions that are dealing with our hundreds of thousands of browsing events that we receive and analyze every second, of course, they are scaled out. So, from day one, from 2014, everything that we do is scaled out. In the first two months, it was for redundancy in different availability zones of different regions, but from then on, it's all scaled out. And I will be very happy to dive into the particulars of the technologies, but what is important in the context of this podcast, I believe, is that doing it on the Cloud using the cloud infrastructure is actually enabling us to concentrate on the business of detecting fraud and business of these massive topologies of hundreds of processes that are both in Java and Kotlin and Python, and have very complex acyclic graphs connecting them. And we just do it in parallel on very many servers that we can scale up and out as we wish. And this is something that helped us focus our core business: understanding fraud.Emily: Going back, actually, to this idea of scalability, I know over the past six, eight months because of COVID, e-commerce has gone through the roof. And I'm assuming, in fact, I read that Forter's business has also been going through the roof. How have you managed scaling?Iftah: Yes. Forter business went through the roof with their several verticals: with food deliveries, of course; and we the big department stores, which COVID accelerated their digital transformation; it did dive with the travel business, of course, right? Few things happen to our customers, and for Forter, scaling was natural. If we have 20 minutes warning scaling is a natural to us, and here we had about 10 days of warning. Easy, right? For our customers, it was a bit different. First, a lot of them came to us, actually had to eliminate all their manual processes. And Forter was there for them. Now, what Forter did for them beyond eliminating any manual fraud-related tasks and loads was to reduce substantially the hike in the customer success load. Because Forter is able to be more accurate and to decline less legitimate customers, you don't have that many calls to the customer success centers. And these are two bottlenecks for our big merchants: the customer success, and fraud and fulfillment. And the fulfillment, that is being able to capture the money and to send the goods is also streamlined by the fact that it's all done in real time. These are the direct effect, but there are additional phenomenon that happened. One of them is that suddenly, in COVID, a lot of customers that didn't usually do things online started to buy online. And we saw that the amount—or the percentage of new buyers, in many of our customers, suddenly jumped. And when you have new buyers, you need a very sophisticated system to be able to allow them in, to approve their transaction, and to allow them to build their reputation; so this happened. The spikes in throughput happened; every day in the last four months is like a Good Friday and Cyber Monday combined for us. And that's good. We didn't lose any availability, and with the current technology, it wasn't that problem from the scalability aspects. And indeed, have we been on private, or our on-prem, this would be much harder.Emily: And now tell me a little bit more about the technology required. We talked a little bit about it not being exactly a throughput problem, but you do have hundreds of processes that you have to run in, you know, several seconds, what technology do you need to leverage in order to make that happen?Iftah: Everything that we run is on flash disks; we don't have rotating disks anymore. We do run low CPU and low memory on all—low memory usage and low percentage of CPU on every [00:24:30 unintelligible] that we run, to accommodate and reduce a spike pickups. We do use the Apache Storm for our base; it is the base of our topology of these processes that we talked about, and hundreds of them in each topology. And we have several topologies for both the transaction time and what we call the visit time, the browsing time. And we run—we are a big customer of Elasticsearch, we run Elastic from the very early days, and we use them for sophisticated queries in their own annoying language. [laughs]. And we have one of the largest clusters. We have about 15 clusters of Elasticsearch that serve our entities that we talked about, our mapping of people to these entities, our logging, and our real-time matching between the current transaction and all the hundreds of millions of people we already know that acted online previously. These are the core technologies in our stack, and on top of that, we use several other technologies: Spark and our wrappers over Spark for the MapReduce work of our machine learning processes, and we use Kafka for persistent distribution of our data among regions, and among availability zones.Emily: What would you say is your biggest technical challenge? And by this, I mean, like, something that you're perhaps still working on, you don't feel like you've totally figured it out yet.Iftah: I think we are very advanced in our matching, the similarity problem. That is something that we think is a pillar of our superiority in this field, but it's a never-ending story. The ability to detect relevant anomalies in the behavior of the crowd is something that we work very hard on, and we expect a lot from these technologies because they have the potential to help us mitigate threats which are new to us; zero-hour threats of modus operandi, of MOs that we did not encounter earlier. These are the main issues. One issue that is mundane and prosaic is the cost of transaction. We do a lot of processing and we start, in our scale, to feel the heat of the cost of serving all these transactions. Nothing that will take us out of the Cloud, but it's something that we need to work hard on. Last, but definitely not least, is security. We think we turned our emphasis on security to our unfair advantage in this field, but still, hardening your systems and thinking about the possible attack vectors on your systems and on your merchant's system is something that I lose sleep at night over, and it is something that we can never say that we are done with.Emily: What do you think about the security-speed trade-off? Do you think it's real? Or do you think you can move just as fast and be secure?Iftah: We can move with negligible sacrifices for the security. Again, if you are talking about real-time systems where the microseconds count, then it's a different story. But for us, having everything encrypted both at rest and at motion is something that does not need to come at the expense of security. What is very interesting in this trade-offs of security and speed is the trade off, not of the real-time speed and the processing speed, but of the engineering development speed. And here, the magic is in the automation, every security aspect, and with your ability to mask all these security aspects from your engineers, and giving them the right APIs so they can develop the application itself. Which is developed to our domain in the same speed, while still being totally secured without them needing to take care of the plumbing. And that's something that we invested a lot in, and it's a never-ending game. I think we're good at it, but never good enough.Emily: And do you rely primarily on the Cloud service providers? So, on AWS's native services, or do you tend to find additional out-of-the-box services, or build your own? Do you have, sort of, a philosophy on that?Iftah: You know, philosophy is one thing, and then what you're doing practice sometimes need to be traded off with reality. But we are currently running on both AWS and starting to run on Azure, so we are making our processes agnostic to the particular cloud that we run on. It is interesting to do when you come to security configuration because you need to create abstraction layers over the particular security mechanisms in AWS and Azure, which are quite different. And that's where we are now. So, we are moving to be totally agnostic. So far, we did use occasionally, not—we weren't a heavy users of AWS services, but we did use a analytic databases; we did use Kinesis, but we moved now to Kafka, and so on. And we did use very cloud-specific queues, but we're moving out of this now.Emily: Why do you think it's important to be cloud-agnostic?Iftah: Because we run on two different clouds. We run on two clouds because of the very high availability requirement that we have. First, we need to be totally available to our merchants. Second, we need not only to be totally available to merchants, we also need to be very, very accurate, always. So, it's not that I can degrade gracefully and say, “Okay, I always answer approve in certain occasions,” because the fraudsters will very quickly understand that. So, we need to be with full brain capacity, always on. And if we are not, we started within tens of minutes or a hour or two, to be very susceptible to great losses. So, that's the reason we need to be with multiple regions, and we need to be with both clouds. It does take a heavy penalty, and we do think about how to reduce the penalty of working with two clouds, but that's what we currently do.Emily: Can you tell me how much your technology stack has changed since 2014?Iftah: We did change a lot in the representation of the world, and this was big. We did move into a Elastic from more traditional NoSQL and SQL [00:33:20 unintelligible] BMS. And we move now, again, to new high throughput databases for our browsing events, the ones that do get hundreds of thousands of events per second. And we do move slowly [00:33:40 unintelligible] many more items or small stack items like queues, and data distribution channels that are no longer serving us well as we scale out, and as we move to being cloud-agnostic. For example, we move now our analytics database from AWS's Redshift to a cloud-agnostic database.Emily: Excellent. I'm going to wrap up pretty soon; this has been really interesting. But a couple, sort of, last questions I wanted to ask. One is, can you describe what a day looks like for you? What does the day for the CTO of a Forter, of a cloud-based SaaS fraud prevention company—what do you actually do?Iftah: First, I am looking at what may endanger our business in the next year and in the next three years. The reason why we are [00:34:42 unintelligible], we call this process internally, the ‘what can kill us?' process. Is mainly because we are in a good shape, and when you're in a good shape you need to look at the threats and how to protect the business from them, and what new business you need to do. Then I'm looking at the health of our precision teams. And our precision teams are both the data science team, the cyber R&D teams, the fraud researchers teams, and the engineering teams that are supporting them. All these are—we need to see that we maintain our superiority. We so far never lost a QC or a bakeoff on any performance issues, and it's a tall order to keep it that way. So, this is the second task that keeps me up. And the last is to see that, indeed we have all what we need in order to enable the spear of development and for the new products. Companies in their seventh year, as we are, are in an inflection point between the startup and the enterprise, and that's where you need to make sure that we stay agile. We stay agile, it depends on the agility of the organization. How can you scale? Or do you rely on several heroes? And the agility of the development itself that relies, to a great degree, on the tech debt kept low enough.Emily: Fabulous. And what is a tool or platform that you think is sort of essential to functioning?Iftah: I think that we built a very robust, extensive monitoring and alerting infrastructure, and this monitoring and alerting infrastructure enables us to understand quickly whether something has happened in the world. And I must say that most of the time that something is happening in the world, it's not something that we need to do something about manually, but sometimes it's something that the merchants need to do. We discovered, for example, that one of our online travel agencies customers started to issue flight tickets for one percent of their price; for three and four bucks instead of four hundred bucks. And we detected it not by looking at the prices, but by seeing spikes of purchases from this OTA in Malaysia and Vietnam, and we were able to tell this to our merchant, to the customer, and the whole thing was rectified about 14 minutes from the time it started. So, our alerting and monitoring systems, which is both on the application level, on the business level on them, and on the other end, on the machine levels, this is very, very important, and pays for itself handsomely.Emily: I think I've read accounts in the newspaper of travel agents, or airlines having that type of mistake.Iftah: Yes.Emily: It tends to get some publicity. Last question is, how can listeners connect with you or follow you?Iftah: Well, iftah@forter.com. Look us up in forter.com, and we will be very happy to talk to you.Emily: Excellent. Thank you so much, Iftah, this was really fascinating.Iftah: Thank you very much for having me.Emily: Thanks for listening. I hope you've learned just a little bit more about The Business of Cloud Native. If you'd like to connect with me or learn more about my positioning services, look me up on LinkedIn: I'm Emily Omier—that's O-M-I-E-R—or visit my website which is emilyomier.com. Thank you, and until next time.Announcer: This has been a HumblePod production. Stay humble.

Developer Weekly
Getting Started with AWS with David Tucker

Developer Weekly

Play Episode Listen Later Aug 5, 2020 34:09


David is a Webby Award winning cloud development consultant that focuses on cloud native custom development strategy. For over fifteen years as a consultant David has led custom software development on emerging platforms for companies such as FedEx, AT&T, Sony Music, Intel, Comcast, Herman Miller, Principal Financial, and Adobe (as well as many others). David regularly writes and speaks on the digital landscape with published works for Pluralsight, O’Reilly, and Lynda.com (now LinkedIn Learning). He has written for Mashable, Smashing Magazine, and VentureBeat, and he has spoken at events like AdTech, Interop, and Adobe Max.Show resources:David's blog Follow David on Twitter Pluralsight AWS Certified Cloud Practitioner PathFull transcript:Barry Luijbregts  0:20  Welcome to another episode of developer weekly. This week, I'm talking with David Ducker about getting started with Amazon Web Services or AWS. David is a cloud development consultant and author at Pluralsight, O'Reilly LinkedIn learning and much more. Thanks for being on the show. David's. How are you doing? David TuckerI'm doing excellent. Thank you for having me on. Barry Luijbregts  0:46  Yeah, I know. It's a very interesting topic. I usually get into as your topics as I I love as you and I have been playing with it since its conception. So I don't know much about AWS and I would love to Learn from you, because AWS is actually a lot older than Azure, right? David Tucker  1:05  Yeah, that's correct. And so AWS really began began this entire space. And one of the interesting things is, you know, when we look at it, they have kind of evolved the entire concept of what it means to even be a cloud provider. And so AWS, in a lot of ways has led the way in this area. But obviously, we've seen providers like Azure, come up and provide very similar services in a lot of areas. But yet, it's still confusing when you're dealing with any platform that has so many different options and services included in it. Barry Luijbregts  1:36  Yeah, absolutely. That's also what I usually try to do in Azure as in tell people which services they can use for which scenarios because that is very confusing. There are hundreds of services. For your scenario, which one do you pick, and there's lots of overlap as well. So how did you even get into the topic of AWS? David Tucker  1:56  Well, I could I could go back to almost the beginning of my career. I'll Just give a super, super quick highlight. I remember when I was working at a university here in the States, and I was helping to consult on research projects with the university. And I remember the first time I could actually fire up virtual servers, like multiple virtual servers on my own machine. And I just remember the excitement of being like I can make anything I want to make with this. And so when the cloud came out, I started to understand more about the public cloud, it really was helping with a lot of the challenges that I was seeing with my development projects, just figuring out how to handle storage, for example, and how to spin up web servers because I really my initial development was in just being a web developer. I wanted to figure out how I could go beyond what I could do just with a co located server, which was how I was doing a lot of my work. And so with that the cloud became a really a big interest for me because it enabled me to do so much more than I could do with what I had. Barry Luijbregts  2:55  Right? Yeah, the cloud is, is an amazing place. So let's just let's just start right there, as in cloud in general, why is that even interesting over, let's say, a server that's under your desk? David Tucker  3:09  Yeah, I think for I think, especially when we think about today's climate in terms of development and technology in general, the exciting thing here is, we've made it accessible to pretty much everyone. I remember when I first started as a developer, you had to have so much money to be able to set up something that could scale to even meet thousands of users. And the exciting thing here is now if you're a developer, and you have an idea, you can bring it to millions of people, and really only pay for what you're actually using. Back. When we think about traditional data centers with the ability to scale you had to predict the amount of loads we're going to have, you had to get more servers than what you needed. You had to have access to a data center. And it's just we've almost democratized getting technology in the hands of people and that to me, is what's most exciting about it? Barry Luijbregts  4:02  Yeah, that is very exciting to me as well. Because, you know, basically now if you have an idea, you can just bring it to market. It doesn't really matter if you have no budget or anything, you can just put it all in the cloud on serverless services. And it just works. It's amazing. Absolutely. Yeah, it still excites me to this day as well, because cloud services evolve quickly, as well, as in back in the day, I used to work with web applications a lot. And they also needed to be scalable, even if they would run on virtual machines on premises or wherever. So then we would build web farms, and those web farms within their be connected to each other and scale, which was a very, very difficult thing to do with sharing session state and things like that. And nowadays, it's just a slider. You just slide to scale up and down and it's just crazy how much time I have invested into learning that and actually getting things to run on that. And now it's just a slider. It kind of makes me sad, but also very excited. David Tucker  5:08  I totally agree. And I think only people like us who have lived in both of these worlds really understand the brilliance of what we have currently. And one of the interesting things is, is that it means that in some ways, we're doing less. And I think for some people, that reaction is almost, it's a little, it's almost a little troubling for them because they feel like, Well, I know how to do all this complex things. Like for example, like you're talking about setting up some type of store for doing session state and keeping that across an entire cluster of servers. But what we've learned is we get to now focus in not on all of these things required to do something but we can really focus in on the application we're building and not any of these other things. Barry Luijbregts  5:50  Exactly. The cloud takes care of the plumbing for us and we just focus on creating value for the customers.So AWS What can you do for us? Let's say I'm a dotnet developer, which I am and I create, let's say, an ASP. NET Core web application, which is just a web application that can run anywhere. Really? Where would I run that in AWS? How would that work? David Tucker  6:14  Well, that's a great question. And one of the things that I've seen because several of my clients are primarily dotnet shops as well. However, for some of them, whether it's for financial reasons, or existing relationships, they have, they've chosen to go the AWS route. And again, for most developers, that decision is going to be made, you know, by their company yet at a high level. So you could be a dotnet developer, and maybe again, you really love Azure, you use it for all of your side projects, but all of a sudden, you find yourself trying to figure out how do I work in this AWS space. And when we look at the problem, like you mentioned, trying to figure out where to run something like this A dotnet core application, that's a web application. One of the great things is just like on Azure, you have a lot of different choices depending on what you needed to do. So when we started off with eight have us, you know, there really was a couple of ways to do this. But we've seen new services expand. And so, you know, if you're looking for the serverless type approach, where you're really trying to minimize the amount of maintenance, you're going to have to have looking at a service like AWS lambda, which really, when lambda launched, it really kicked off this serverless concept across most all of the cloud platforms. And they now have some equivalent, it gives you the ability to do something closer to what we would call Functions as a Service f as within the cloud, but you still have the ability if you need to, to either spin up a container with the container service, it's available on AWS, which we call ECS. Or just spin up your virtual servers, if that's what you're more comfortable with using EC two, which is a service that's been around really since about the beginning of AWS. Barry Luijbregts  7:45  Right. So you could use lambda, which is the serverless service to run an complete website in it. David Tucker  7:54  Yeah, that's correct. And in most cases, we'll see this actually paired if you're doing a serverless approach. So if you're Looking to do, let's say maybe a single page application type approach. And so you're going to build and react or Angular or view. And you're going to host that in s3, which is the object storage service that we have within AWS. And then you're going to do all of your API calls through lambda. So if you're looking to do more of that type of web application, then you'll just see all of that logic handled within lambda, but the hosting in s3. But if you're doing more of a traditional web application, then you can look at using ECS, it's still possible to do it in lambda, but it's a little bit more complicated in that approach. So that's when you generally see people moving over to more of a containerized approach. Barry Luijbregts  8:38  And why would you use containers, really, in this case? Sure.  David Tucker  8:43  So in this case, when we're thinking about, you know, building out a traditional application where, you know, you're not adopting a front end, you know, web framework that's going to handle all the rendering for you and you're doing more page based, when you're looking at running something that's going to run over an extended period of time. One of the limitations that you have in working with a solution like lambda is it even though you get the benefits of it being more of a serverless type approach, you you have specific limits for how it can run and for how much memory it can have. And so in some cases, you could build an entire an entire traditional web application to run within those constructs. However, it probably would end up feeling a little bit limiting when you, when you're running something on a container, you obviously you lose those limits, you have the ability to give it as much time as it needs to run and because it's always going to be up and running. Or you could even set it to just run based on traffic. But you also lose that memory limit as well. You have the ability to configure it to have as much memory as you needed to have. So again, it would depend on what your limits are. But you gain the ability and using a specific service within ECS called fargate. You lose the the kind of the burden of having to manage your underlying cluster that your containers are running on. So you can do it in a much more efficient way than what we use To have to do when we were managing those clusters ourselves. Barry Luijbregts  10:02  And that is fargate. Is that then a container orchestrator? David Tucker  10:07  Yes. So it pairs with the AWS service called ECS. So there's really two different approaches, you can take on AWS, if you're interested in running a container. So you have ECS, which is Amazon's native service for running containers in the cloud. They also have Eks, for people that are interested in doing the full Kubernetes workflow. But with ECS, you have the option to use this sub service called fargate. And it totally manages the underlying layer for you. And this was one of the challenges that those of us that that when we were starting off, and we were trying to use ECS over Kubernetes. The challenge was the effective way to manage that underlying layer, because initially Kubernetes just did that better. But with fargate AWS has totally built up a native service for this and managing that underlying layer. So you don't even have to think about it. As a developer, you can simply say, I want to have this container running. I want to have this menu. instance is up and running. And I want it to be able to, you know, meet this demand and the rest of it will be handled for it. Right. So if you would compare fargate to Kubernetes service, then fargate is even more platform as a service as in you don't have to do as much then Kubernetes. Absolutely. And and so you, you gain, you have a little bit less control, but you haven't been fully managed, as opposed to, you know, with Kubernetes, as you mentioned, you'd have, you'd have a lot more things you'd have to control and a lot more things that could go wrong. In some situations, that's exactly what you need. But for most cases, especially with the clients that I work with, that they actually need less control, because the platform is going to manage it efficiently for them. Barry Luijbregts  11:39  Yeah. Okay. Oh, that's a great option, actually. Because I like containers. And I like the concept of containers, and that you can just take it and run it locally. And it's the exact same thing that you run into Cloud, but I always, I'm not sure you know, because it's so Infrastructure as a Service, especially when you use Kubernetes. Because then stuff You have to manage that whole infrastructure. And that's just not what I want to do. I want to just focus on creating stuff David Tucker  12:06  Exactly. And this brings up what I think is the number one mistake that new developers make when moving into the cloud. And that's because especially if they're more senior developers, they immediately shift to the more complex option, instead of what's the option that's going to allow me to maximize the time I spend maintaining whatever I build. And I think you see that with even organizations, they'll, they'll say, Well, of course, we need Kubernetes. We need all of those controls. And yet they don't ever factor in the maintenance time to the solutions that they build. I've worked with clients that really do need those controls. But again, I would say a vast majority of them, do not. And so with the cloud, one of the things I encourage new developers with is is choose the minimum approach that will allow you to get the objectives that you need. You can always add new things in later you can always adjust your approach. But in the beginning build something for The minimum amount of maintenance that you need long term that still meets the needs of the users that are going to be using it. Barry Luijbregts  13:05  Right. Because Is it easy to migrate from service to service? David Tucker  13:09  Yeah, one of the great things about a lot of the services is you do have that ability to migrate aspects of it. So if you're using a container, so especially if let's take a look at the container services, ECS, fargate, and Eks. Within that approach, you're still using a Docker container no matter which direction you choose. So if you wanted to start off by using fargate, and then you know what we really need the controls that Kubernetes provides for us, absolutely, you can make that switch, there will be some work in switching. But it's not going to be it's a little easier to to go from a simpler solution to a more complex one than it is to work backwards and go from the more complex one to the simple one. Barry Luijbregts  13:47  All right. So that's great. That's a couple of options. And those are actually a lot less options to run your application and then as your has, which is a great thing, I think because there's so much overlap always and it's difficult to choose things from. So what about storing data? What would you use for that? David Tucker  14:04  Yeah, and this this, there are a couple of options here with this as well. And I think this is one of the things that's important to remember to those of us that have been in the cloud for a while is that chances are when we started in the cloud, there were a lot less options. And now that there's so many options, it's a little bit more overwhelming for new developers that are getting into the platform. But for most things, in terms of storage on AWS, you're going to be looking at s3, which is just one of the most important services on the entire platform. Now, if you're talking about things like where you're actually attaching volumes to virtual servers, there's there's other services that you're going to be leveraging. But when you're simply talking about storage, whether that's storing things like user generated content, from your web application or your mobile application, or whether you're talking about storing a log data or whether you're talking about you know, really storing any type of just general data In those cases, s3 is going to be the solution for you. And one of the things that I think developers can sometimes be fooled by is it's very simple to get into s3 and to go in and upload files into s3. And you might think well, that's that's all this is, right? This just stores files. But you can begin to know some of the capabilities that are provided with s3 that really do differentiate it being you know, one is there's lifecycle configuration. So you've got the ability to move your data between, you know, warm storage to cold storage to a true complete cold, cold archive storage, you've got the ability to use it for a data lake. So you've got the ability to even go in and run queries against unstructured data that's stored within your s3 buckets. There's, I mean, really, there's so much that s3 does, and it all ties in very nicely with AWS is authorization tool, which is I am so you can control who has access to it and even set up some very specific policies for things like controlling who can access it from, from a user perspective, from an IP perspective, there's there's a lot of different options. So s3 is really the powerhouse storage service that we have on AWS. And then you use that to store unstructured data. Barry Luijbregts  16:09  So normal relational data, right?  David Tucker  16:10  Correct. So we can see, I know a lot of organizations that will dump For example, let's say large amounts of log data into s3 directly. And as mentioned, you can use a service called Athena to go in and actually run queries against that data. Again, you can also use it just as easily to store you know, photos that people upload as a part of your web application. And again, use that to potentially use the lifecycle rules to move that back and forth between warm storage and cold storage, for example. And one of the great things about s3 as well is built into that by default, depending on how you configure it, but you have the ability to also have URLs to every object that you store within s3. So if you want to use it as storage for your web assets, you have the ability to do that if you want to be able to just make something available to the public and throw it out there so you can have a download link. You can Do that. And then you also compare this in with another service, which is called Amazon CloudFront, which is Amazon's global content delivery network. So you can utilize pair s3 with CloudFront. And now you've distributed your content out to all of their edge locations. And you see a lot of people using this with their web applications for storing their static assets. And doing it this way, you're really optimizing the download speed. For anyone that's using your web application, we can see great increases over just using s3 by pairing it with CloudFront. Barry Luijbregts  17:30  Right. So just for the listeners, if you didn't catch that, then CloudFront is a content delivery network, which makes sure that stuff that you put in there, like static files, like JavaScript files, or images, get to be populated to edges that are very close to the user's little data centers that are always close to the user so that the data is always close to you. And therefore you have less latency and things are more performance. David Tucker  17:57  Absolutely. And so that's cool in AWS has many, many edge locations. I forget the exact number now, but I'm pretty sure we're north of 200 edge locations around the world. So you can really see your content spread out. And this is another one of the things that just gets me excited when we think about kind of how things used to be versus how they are now, the fact that virtually anyone can take and distribute their content and send it out to servers, you know, from, from Europe, to Asia, to North America, South America, you can just send it out through just really with one click of the mouse, within five to 10 minutes, you're gonna have that content all around the world. That's something that's still really excites me. Barry Luijbregts  18:32  Yeah. It's it's just a massive scale, isn't it? It's the extreme, massive scale that is so easy to use with the cloud. It's just still amazing to me. Absolutely. So what about relational data, like a SQL database? For instance, can I put that somewhere in AWS? David Tucker  18:48   Absolutely. And so there's several different approaches that you can take, but the core service for relational databases on AWS is called RDS or relational database service. And the great thing here is we're not just talking about, you know, using an AWS specific database, you have access here to SQL Server, you have access to MySQL, you have, you know, access to Postgres and Maria dB, there's several choices. But in addition to that, you also do have access to something that's AWS specific. And that's called Aurora. And that's a database engine that really was built for the cloud. So they built that themselves, but they really targeted it at being both MySQL and Postgres compatible. So you actually can pick when you create an overall database, hey, do I want it to be MySQL compatible, or Postgres compatible, and you can use all of the same libraries. So one of the great benefits is, if you're used to using either of those databases, then you simply can create a database in RDS that's Aurora, and you don't have to change any of your code to get it to work with Aurora. It just works out of the box. And one of the really exciting things that they also have developed with this is there's a concept called Aurora serverless. So if you have a database, maybe you have a side project and You're just you want to have access to a database, but you don't want to pay for one to be up all the time with serverless, you gain the ability to basically have this database spin up and spin down as needed, and even scale as needed without you having to worry about managing those underlying database instances. So we're certainly seeing a lot more in this area, there's still a few negative aspects of using the serverless approach. They're still kind of maturing that product over time. But it's really exciting to see those kind of concepts factor in now two databases as well as you know, compute resources that we have with lambda. Barry Luijbregts  20:29  Yeah, that's very exciting. What a cool name. By the way, I'll roll rock. There are cool names in AWS. David Tucker  20:36  I will give you one comment on the names. One thing you do have to be careful with when you're learning about AWS as a developer is a lot of the services have similar names. And so one of the things that I always hear back from learners when they're getting ready for certification tests is there's so many services to memorize. And we have things like cloud search versus cloud formation versus, you know, cloud trail all of these sounded the same, how do I you know, so so that's just where Other things to let developers know if you're struggling with that you're not the only one. There's, you know, 212 services right now on AWS. And sometimes it can be hard to remember all of the different names and what they mean. Barry Luijbregts  21:10  Yeah, absolutely. And they might change as well, like Microsoft Azure, they sometimes change because the marketing team just decides that another name just sounds better, or is better for the markets. Yes, definitely. So what about big data and data analytics, because you talked about that a little bit already, that you can use, it was s3, I think, also to run to store your big, non relational data and then do a bit of data analytics over that other services as well. David Tucker  21:39  Yeah, there are and there's there's actually a growing number of services in this area. This is an area that I think AWS has really placed a lot of emphasis on in the last few years. We've even seen them develop what we call specialty certifications for both big data which is now called analytics and also within machine learning. And these areas really do intersect. So if you're looking for more of a traditional data warehousing approach, this is where we have a service called redshift. And so this is what's going to give you, you know, again, column based storage for structured data, where you can store it at a petabyte scale. So large, large amounts of data. So that's where we see a lot of organizations shift. They're looking for more of that data warehouse approach. Now, if you're looking for more of that data lake approach, this is where we see organizations looking to use s3 for that type of data storage. And AWS has even tried to make this easier with a service called Lake formation, which any of their services that that end in formation are really there to help you build out an initial capability in this area to launch infrastructure. So Lake formation tries to go in and set up data lake constructs go in and actually set up some aspects of governance and they even have services you can integrate with it that will help to go through and identify using Machine Learning identify sensitive data and make sure that that's being handled properly as well. So this is an exciting area, there's so many services. You know, if you're an organization that's used to using traditional if you're if you're used to using Apache Spark, for example, you know, we have the service EMR, which is elastic MapReduce, which will allow you to have access to all of those same tools within AWS, but in a way where they're managing that for you, it's really more of a platform as a service approach when you're doing that, but there also are, you know, cloud native tools that you can interact with as well. And then we have the entire suite, with Sage maker, for example, that will enable us to go in and take all the data that we have stored in and begin to create machine learning solutions on top of what's there. Ah, very cool. Barry Luijbregts  23:43  And what about visualizing that data? David Tucker  23:47  So we have some different tools. And here's, here's where I'm going to be really honest with you, because I know that you know, some people that work in a platform like AWS, just always believe AWS is the best solution. But here you know if we have people that are used to working within power Bi and Tableau, for example. You know, AWS has a service called Quick side. And it's a really good service, it doesn't have the capabilities that you would see in a Power BI or a tableau solution. But for some organizations, the solutions there are adequate for what they need. I've moved several of my clients on to quick site, because they have some very, pretty basic needs in terms of data visualization. And with quick site, you can go in just as you can with those other services and create customized dashboards that are tied into your data. And you can do that, you know, you can marry together your structured and unstructured data into a single into a single view. And, you know, for a lot of organizations, that type of data insight is just something that you know, something that they use on a daily basis. But I will say again, if you're looking for some really advanced visualization use cases, solutions, like Power BI and Tableau are they're going to be a little bit a step ahead of what we have within quick sight. Barry Luijbregts  24:50  Okay, well, you should choose a tool that's best for you and appreciate a tool that's in your preferred platform. Barry Luijbregts  24:58  All right. So we're building quite intricate Already, we can run our websites, we can store our data, we can use containers, if we want to, we can do data analytics, if we want to. What about if I want to do something with IoT? David Tucker  25:11  Like I have a little device or I have many devices? And that sends many, many millions of messages to the cloud? Is there something for that? Absolutely. And what we see here within a service called AWS IoT is that one of the great benefits of it is that it does integrate seamlessly into a lot of the other services that we've already mentioned. And This to me is while I totally agree with what you mentioned previously, we need to use the service that's best for whatever solution we need. One of the things I will say too, is when we do pick services that are in the platform that we're in, we do usually get some advantages with that. And I think here One of the advantages in using AWS IoT is we can see this integrated in a great way with services like lambda, for example and with with some of the messaging services that We have within AWS. So it becomes very easy for us to go in and configure even if we have millions of messages coming in from our IoT devices, we can see them, you know, come in, we can analyze them, we can get analytics on them using some tools with what we call Amazon kinesis, which is the stream processing solution we have on AWS, we can then based on certain conditions, fire off a compute instance with lambda to actually perform some action on the data that's coming in. And we can store that data, even if it's unstructured in s3 and get that data lake capability that we talked about previously. So I really think the IoT example is really a strong use case for pairing some of these services together, because of all the tight integration that can happen when you're working within a platform like AWS. Barry Luijbregts  26:44  Yeah. And then from there, you have lots of data that you can then do machine learning on and use artificial intelligence to discover what's in the data or to use it for different purposes. I'll bet you guys probably have a lot of Artificial intelligence services as well like as your cognitive services that is artificial intelligence as a service, which is really a software as a service offering. What is what is there in AWS for that? David Tucker  27:11  Absolutely. So the equivalent services to the cognitive services in Azure is that on AWS, we have what they call their AI services. And they're very similar in nature. And this is one of the things I love really about both Azure and AWS, you know, for some organizations, especially if we look, you know, three, four years in the past, it was really difficult for them to get up to speed with using any aspect of machine learning or AI because it required them to have a very specific skill set, they had to have people that were really at the time kind of on the cutting edge, they had to have a lot of expensive hardware to do some GPU based processing. And and what we see here is we've really lowered the barrier for what it takes for organizations to get in and use these kind of services. So on AWS, we have a whole suite of them and it can be you know, ones like for example, AWS recognition. This is the Computer Vision service. And so with this, you can go in and get keywords back from an image. For example, if we want to just understand what is detected within that image, we can get those back. We also can go in and store faces within recognition and then detect those faces in other images, we can even go through and try to determine the emotion of someone within a particular image. And that's just that's just really the tip of the iceberg of what's possible. We also have the ability to go in and get take text and convert audio of text into into actual text that we can work with. We can take text that we submit and have it be converted into a voice actually speaking that so we have so many different things that cover you know, visual use cases from computer vision to natural language processing. To regression, we have a service called AWS forecasts that is able to actually just based on the data that you input, create a regression model and be able to predict future values. So we really see a wide range of services. that people can simply use, you know, in a SaaS based approach to fully take advantage of machine learning, but without having to build their own models and go through all the complexities that come with that. Barry Luijbregts  29:09  Yeah, I think that's a very good approach to get people into AI as well, because it's very complex to to show. And when you use these, you can just get started. And if you want to customize, you can always do that later. Barry Luijbregts  29:23  So I would like to use Visual Studio and Visual Studio code to create my applications. Are there any extensions for AWS in Visual Studio Visual Studio code so that I can easily deploy stuff or maybe talk to API's within AWS? David Tucker  29:42  Sure, that's, that's a great question. And in first, let me just, I'll throw out the irony here that, you know, for a long time, I was a developer, not in the Microsoft world. And I you know, I was on a Mac and I was, you know, I was doing iOS development for a long period of time. And it's funny if you would have ever told me that so much of what I'm doing would would shift over to the Microsoft stack, I probably wouldn't have believed you. But even me on a daily basis, I'm using Visual Studio code as my primary editor in working with AWS and in working with Azure with some of my clients. And so one of the great things we have here is there are multiple extensions that are available for AWS in terms of working with within Visual Studio code. This actually is the primary editor I see them creating extensions for so you have depending on what you're doing within within AWS, there's going to be several different extensions that you can take advantage of including just, you know, some basic extensions that that cover, you know, wide use cases and then some very specific extensions for working with specific things like for example, the the cDk, which is AWS, one of AWS tools for doing infrastructures code. So there are there are several different options that are available to you. And if you're using Visual Studio code, especially, I think you'll you'll probably feel right at home working within AWS. Barry Luijbregts  30:54  I expected as much. There probably are lots of extensions just like they offer as your Course. Yes. As in Visual Studio code in Visual Studio as well. So So Amazon just tell it's it seems like a very complete platform, of course, because it's very mature. And it has all these offerings for basically everything that you can think of. How do you best get started with it? As in? Are there guides or websites that you can go to? What's the best way to get started? David Tucker  31:25  Yeah, absolutely. I think for for most developers, there are some great resources that AWS does provide to kind of help you take those first steps. One of the things that I probably would selfishly say this is I've actually spent a lot of time thinking about how to get developers started on AWS. And a lot of this went into a path that I have on Pluralsight. And I worked very closely with Pluralsight. we'd spent about a month kind of rethinking, you know, how do we put out a path that really helps people get started in this area, and what we ended up with is a path that covers something called the cloud practitioner certification. So AWS has this an entry level certification. And this is pretty unique here. This is designed not just for developers, but really anybody who's going to be working in or around the cloud. And this is the initial certification that just shows that somebody has a good understanding of the platform, and of the different capabilities. It doesn't cover everything. It's it's a very wide, but kind of very shallow certification. It's designed to help just demonstrate that you have this wide knowledge. And one of the things I've seen is, you know, we've seen so many people take this on, especially in this current time when people aren't sure about their job status, they're trying to get new skills. They're trying to make themselves marketable within, you know, within this pandemic, to potentially new opportunities. And this certification has proved to be a great way for new developers to get into AWS. So that would be one of the things I would reference there. There's three different courses, there's even a project where you can begin to put some of those concepts in place, and while AWS has some free resources that also are very, very good. I think this would really help you get from, you know, kind of your starting point of not knowing much about the platform at all, to truly understanding the benefits of the cloud, what AWS provides. And also one of the great things about it is if you go down this path and you stick with it, you actually will end up with a certification that you can actually go out and have that on your resume be something that helps open up doors for you within your career. All right, well, that is absolutely great. Barry Luijbregts  33:25  I will put a link to this Pluralsight path in the show notes, and also to other links of yours, including https://www.davidtucker.net/. Well, thank you very much for being on the show. And we will see you next week. Thank you for listening to another episode of developer weekly. Please help me to spread the word by reviewing the show on iTunes or your favorite podcast player. Also visit https://developerweeklypodcast.com/ for shownotes and the full transcript. And if you'd like to support me in making the show, please visit my Pluralsight courses to learn something new.  

כל תכני עושים היסטוריה
[עושים תוכנה] עיבוד Big Data ב-Scale עצום!

כל תכני עושים היסטוריה

Play Episode Listen Later Aug 5, 2020 60:57


כיום יש כלים רבים ונוחים לעיבוד מידע והפקת תובנות מעניינות ממנו, אבל מה עושים כשהמידע גדול מדיי בשביל המחשב שלנו, או כשזמן העיבוד עולה מדקות, לשעות לימים ואפילו שבועות? בפרק חדש בסדרת הBig Data נעשה zoom in לתחום העיבוד ומדברים על פתרונות שונים לעיבוד מידע בצורה מבוזרת, בעזרת mapReduce, וApache Spark. נגדיר מושגים שונים, רעיונות, בעיות ופתרונות לתחום עיבוד המידע.האזנה נעימה,חן ועמית.

Around IT in 256 seconds
#11: MapReduce

Around IT in 256 seconds

Play Episode Listen Later Aug 4, 2020 4:15


MapReduce is a programming model for processing large amounts of data. It works best when you have a relatively simple program, but data is spread across thousands of servers. MapReduce was invented and popularized by Google. I'll talk about MapReduce in general and Hadoop in particular. Read more: https://256.nurkiewicz.com/11 Get new episode straight to your mailbox: https://256.nurkiewicz.com/newsletter

Podcast – Software Engineering Daily
Flink and BEAM Stream Processing with Maximilian Michels

Podcast – Software Engineering Daily

Play Episode Listen Later Feb 12, 2020 51:14


Distributed stream processing systems are used to read large volumes of data and perform operations across those data streams.  These stream processing systems often build off of the MapReduce algorithm for collecting and aggregating large volumes of data, but instead of processing a calculation over a single large batch of data, they process data on The post Flink and BEAM Stream Processing with Maximilian Michels appeared first on Software Engineering Daily.

Data – Software Engineering Daily
Flink and BEAM Stream Processing with Maximilian Michels

Data – Software Engineering Daily

Play Episode Listen Later Feb 12, 2020 51:14


Distributed stream processing systems are used to read large volumes of data and perform operations across those data streams.  These stream processing systems often build off of the MapReduce algorithm for collecting and aggregating large volumes of data, but instead of processing a calculation over a single large batch of data, they process data on The post Flink and BEAM Stream Processing with Maximilian Michels appeared first on Software Engineering Daily.

Software Engineering Daily
Flink and BEAM Stream Processing with Maximilian Michels

Software Engineering Daily

Play Episode Listen Later Feb 12, 2020 51:14


Distributed stream processing systems are used to read large volumes of data and perform operations across those data streams.  These stream processing systems often build off of the MapReduce algorithm for collecting and aggregating large volumes of data, but instead of processing a calculation over a single large batch of data, they process data on The post Flink and BEAM Stream Processing with Maximilian Michels appeared first on Software Engineering Daily.

Podcast – Software Engineering Daily
The Data Exchange with Ben Lorica

Podcast – Software Engineering Daily

Play Episode Listen Later Feb 10, 2020 68:36


Data infrastructure has been transformed over the last fifteen years.  The open source Hadoop project led to the creation of multiple companies based around commercializing the MapReduce algorithm and Hadoop distributed file system. Cheap cloud storage popularized the usage of data lakes. Cheap cloud servers led to wide experimentation for data tools. Apache Spark emerged The post The Data Exchange with Ben Lorica appeared first on Software Engineering Daily.

Data – Software Engineering Daily
The Data Exchange with Ben Lorica

Data – Software Engineering Daily

Play Episode Listen Later Feb 10, 2020 68:36


Data infrastructure has been transformed over the last fifteen years.  The open source Hadoop project led to the creation of multiple companies based around commercializing the MapReduce algorithm and Hadoop distributed file system. Cheap cloud storage popularized the usage of data lakes. Cheap cloud servers led to wide experimentation for data tools. Apache Spark emerged The post The Data Exchange with Ben Lorica appeared first on Software Engineering Daily.

Software Engineering Daily
The Data Exchange with Ben Lorica

Software Engineering Daily

Play Episode Listen Later Feb 10, 2020 68:36


Data infrastructure has been transformed over the last fifteen years.  The open source Hadoop project led to the creation of multiple companies based around commercializing the MapReduce algorithm and Hadoop distributed file system. Cheap cloud storage popularized the usage of data lakes. Cheap cloud servers led to wide experimentation for data tools. Apache Spark emerged The post The Data Exchange with Ben Lorica appeared first on Software Engineering Daily.

The Podlets - A Cloud Native Podcast
[BONUS] A conversation with Joe Beda (Ep 6)

The Podlets - A Cloud Native Podcast

Play Episode Listen Later Nov 22, 2019 47:21


For this special episode, we are joined by Joe Beda who is currently Principal Engineer at VMware. He is also one of the founders of Kubernetes from his days at Google! We use this open table discussion to look at a bunch of exciting topics from Joe's past, present, and future. He shares some of the invaluable lessons he has learned and offers some great tips and concepts from his vast experience building platforms over the years. We also talk about personal things like stress management, avoiding burnout and what is keeping him up at night with excitement and confusion! Large portions of the show are obviously spent discussion different aspects and questions about Kubernetes, including its relationship with etcd and Docker, its reputation as a very complex platform and Joe's thoughts for investing in the space. Joe opens up on some interesting new developments in the tech world and his wide-ranging knowledge is so insightful and measured, you are not going to want to miss this! Join us today, for this great episode! Follow us: https://twitter.com/thepodlets Website: https://thepodlets.io Feeback: info@thepodlets.io https://github.com/vmware-tanzu/thepodlets/issues Special guest: Joe Beda Hosts: Carlisia Campos Bryan Liles Michael Gasch Key Points From This Episode: A quick history of Joe and his work at Google on Kubernetes. The one thing that Joe thinks sometimes gets lost in translation on these topics. Lessons that Joe has learned in the different companies where he has worked. How Joe manages mental stress and maintains enough energy for all his commitments. Reflections on Kubernetes relationship with and usage of etcd. Is Kubernetes supposed to be complex? Why are people so divided about it? Joe's experience as a platform builder and the most important lessons he has learned. Thoughts for venture capitalists looking to invest in the Kubernetes space. Joe's thoughts on a few different recent developments in the tech world. The relationship and between Kubernetes and Docker and possible ramifications of this. The tech that is most exciting and alien to Joe at the moment! Quotes: “These things are all interrelated. At a certain point, the technology and the business and career and work-life – all those things really impact each other.” — @jbeda [0:03:41] “I think one of the things that I enjoy is actually to be able to look at things from all those various different angles and try and find a good path forward.” — @jbeda [0:04:19] “It turns out that as you bounced around the industry a little bit, there's actually probably more alike than there is different.” — @jbeda [0:06:16] “What are the things that people can do now that they couldn't do pre-Kubernetes? Those are the things where we're going to see the explosion of growth.” — @jbeda [0:32:40] “You can have the most beautiful technology, if you can't tell the human story about it, about what it does for folks, then nobody will care.” — @jbeda [0:33:27] Links Mentioned in Today’s Episode: The Podlets on Twitter — https://twitter.com/thepodlets Kubernetes — https://kubernetes.io/Joe Beda — https://www.linkedin.com/in/jbedaEighty Percent — https://www.eightypercent.net/Heptio — https://heptio.cloud.vmware.com/Craig McLuckie — https://techcrunch.com/2019/09/11/kubernetes-co-founder-craig-mcluckie-is-as-tired-of-talking-about-kubernetes-as-you-are/Brendan Burns — https://thenewstack.io/kubernetes-co-creator-brendan-burns-on-what-comes-next/Microsoft — https://www.microsoft.comKubeCon — https://events19.linuxfoundation.org/events/kubecon-cloudnativecon-europe-2019/re:Invent — https://reinvent.awsevents.com/etcd — https://etcd.io/CosmosDB — https://docs.microsoft.com/en-us/azure/cosmos-db/introductionRancher — https://rancher.com/PostgresSQL — https://www.postgresql.org/Linux — https://www.linux.org/Babel — https://babeljs.io/React — https://reactjs.org/Hacker News — https://news.ycombinator.com/BigTable — https://cloud.google.com/bigtable/Cassandra — http://cassandra.apache.org/MapReduce — https://www.ibm.com/analytics/hadoop/mapreduceHadoop — https://hadoop.apache.org/Borg — https://kubernetes.io/blog/2015/04/borg-predecessor-to-kubernetes/Tesla — https://www.tesla.com/Thomas Edison — https://www.biography.com/inventor/thomas-edisonNetscape — https://isp.netscape.com/Internet Explorer — https://internet-explorer-9-vista-32.en.softonic.com/Microsoft Office — https://www.office.comVB — https://docs.microsoft.com/en-us/visualstudio/get-started/visual-basic/tutorial-console?view=vs-2019Docker — https://www.docker.com/Uber — https://www.uber.comLyft — https://www.lyft.com/Airbnb — https://www.airbnb.com/Chromebook — https://www.google.com/chromebook/Harbour — https://harbour.github.io/Demoscene — https://www.vice.com/en_us/article/j5wgp7/who-killed-the-american-demoscene-synchrony-demoparty Transcript: BONUS EPISODE 001 [INTRODUCTION] [0:00:08.7] ANNOUNCER: Welcome to The Podlets Podcast, a weekly show that explores Cloud Native one buzzword at a time. Each week, experts in the field will discuss and contrast distributed systems concepts, practices, tradeoffs and lessons learned to help you on your cloud native journey. This space moves fast and we shouldn’t reinvent the wheel. If you’re an engineer, operator or technically minded decision maker, this podcast is for you. [EPISODE] [0:00:41.9] CC: Hi, everybody. Welcome back to The Podlets. We have a new name. This is our first episode with a new name. Don’t want to go much into it, other than we had to change from The Kubelets to The Podlets, because the Kubelets conflicts with an existing project and we’ve thought it was just better to change. The show, the concept, the host, everything stays the same. I am super excited today, because we have a special guest, Joe Beda and Bryan Liles, Michael Gasch. Joe, just give us a brief introduction. The other hosts have been on the show before. People should know about them. Everybody should know about you too, but there's always newcomers in the space, so give us a little bit of a background. [0:01:29.4] JB: Yeah, sure. I'm Joe Beda. I was one of the founders of Kubernetes back when I was at Google, along with Craig McLuckie and Brendan Burns, with a bunch of other folks joining on soon after. I'm currently Principal Engineer at VMware, helping to cover all things Kubernetes and Tanzu related across the company. I came into VMware via the acquisition of Heptio, where Bryan's wearing the shirt today. Left Google, did that with Craig for about two years. Then it's almost a full year here at VMware. We're at 11 months officially as of two days ago. Yeah, really excited to be here. [0:02:12.0] CC: Yeah, I am so excited. Your name is Joe Beda. I always say Joe Beda. [0:02:16.8] JB: You know what? It's four letters and it's easy – it's amazing how many different ways there are to pronounce it. I don't get picky about it. [0:02:23.4] CC: Okay, cool. Well, today I learned. I am very excited about this show, because basically, I get to ask you anything I want. [0:02:35.9] JB: I’ll do my best to answer. [0:02:37.9] CC: Yeah. You can always not answer. There are so many interviews of you out there on YouTube, podcasts. We are going to try to do something different. Let me fire the first question I have for you. When people interview you, they ask you yeah, the usual questions, the questions that are very useful for the community. I want to ask you is this, what are people asking you that you think are the wrong questions? [0:03:08.5] JB: I don't think there's any bad questions like this. I think that there's a ton of interest that's when we're talking about technical stuff at different parts of the Kubernetes stack, I think that there's a lot of business context around the container ecosystem and the companies and around to forming Heptio, all that. A lot of times, I'll have discussions around career and what led me to where I'm at now. I think those are all a lot of really interesting things to talk about all around all that. The one thing that I think is doesn't always come across is these things are all interrelated. At a certain point, the technology and the business and career and work-life – all those things really impact each other. I think it's a mistake to try and take these things in isolation. There's a ton of lead over. I think one of the things that we tried to do at Heptio, and I think we did a good job is recognized that for anybody senior enough inside of any organization, they really have to be able to play all roles, right? At a certain point, everybody is as a business person, fundamentally, in terms of actually moving the ball forward for the company, for the business as a whole. Yeah. I think one of the things that I enjoy is actually to be able to look at things from all those various different angles and try and find a good path forward. [0:04:28.7] BL: All right. Taking that, so you've gone from big co to big co, to VC to small co to big co. What does that unique experience taught you and what can you share with us? [0:04:45.5] JB: Bryan, you know my resume better than I do apparently. I started my career at Microsoft and cut my teeth working on Internet Explorer and doing client side stuff there. I then went to Google in the office up here in Seattle. It was actually in Kirkland, this little hole-in-the-wall, temporary office, preemie work type of thing. I’m thinking, “Hey, I want to do some server-side stuff.” Worked on Google Talk, worked on ads, worked on cloud, started Kubernetes, was a little burned out. Took some time off, goofed off. Did this entrepreneur-in-residence thing for VC and then started Heptio and then sold the VMware. [0:05:23.7] BL: When you're in a big company, especially when you're more junior, it's easy to get caught up in playing the game inside of that company. When I say the game, what I mean is that there are measures of success within big companies and there are ways to advance see approval, see rewards that are all very specific to that company. I think the culture of a company is really defined by what are the parameters and what are the successes, the success factors for getting ahead inside of each of those different companies. I think a lot of times, especially when as a Microsoft straight out at college, I did a couple internships at Microsoft and then joining – leaving Microsoft that first time was actually really, really difficult because there is this fear of like, “Oh, my God. Everything's going to be super different.” It turns out that as you bounced around the industry a little bit, there's actually probably more alike than there is different. The biggest difference I think between large company and small company is really, and I'll throw out some science analogies here. I think, oftentimes organizations are a little bit like the ideal gas law. Okay, maybe going past y'all, but this is – PV = nRT. Pressure times volume equals number of molecules times temperature and the R is a constant. The idea here is that this is an equation where as you add more molecules to a constrained space, that will actually change the temperature and the pressure and these things all rise. What happens is inside of a large company, you end up with so many people within a constrained space in terms of the product space. When you add more people to the organization, or when you're looking to get ahead, it feels very zero-sum. It very much feels like, “Hey, for me to advance, somebody else has to lose.” That's not how the real world works, but oftentimes that's how it feels inside of the big company, is that if it feels zero-sum like that. The liberating thing for being at a startup and I think why so many people get addicted to working at startups is that startups are fundamentally not zero-sum. Everybody succeeds and fails together. When a new person shows up, your thought process is naturally like, “Awesome, we got more cylinders in the engine. We’re going to go faster,” which is not always the case inside of a big company. Now, I think as you get senior enough, all of a sudden these things changes, because you're not just operating within the confines of that company. You're actually again, playing a role in the business, you're looking at the ecosystem, you're looking at the community, you're looking at the competitive landscape and that's where you have your eye on the ball and that's what defines success for you, not the internal company metrics, but really the business metrics is what defines success for you. The thing that I'm trying to do, here at VMware now is as we do Tanzu is make sure that we recognize the unbounded possibilities in front of us inside of this world, make sure that we actually focus our energy on serving customers. In doing so, out-compete others in the market. It's not a zero-sum game, it's not something where as we bring more folks on that we feel we're competing with them. That's a little rambling of an answer. I don't know if that links together for you, Bryan. [0:08:41.8] BL: No, no. That was pretty good. [0:08:44.1] JB: Thanks. [0:08:46.6] MG: Joe, that's probably going to be a context switch now. You touched on the time when you went through the burnout phase. Then last week, I think you put out a tweet on there's so much stuff going on, which tweet I'm talking about. Yeah. In the Kubernetes community, you’re a rock star. At VMware, you're already a rock star being on stage at VMware shaking hands with Pat. I mean, there's so many people, so many e-mails, so many slacks, whatever that you get every day, but still I feel you are able to keep the balance, stay grounded and always have a chat, even though sometimes I don't want to approach you, but sometimes I do when I have some crazy questions maybe. Still you’re not pushing people away. How do you manage with mental stress preventing another burnout? What is the secret sauce here? Because I feel I need to work on that. [0:09:37.4] JB: Well, I mean it's hard. The tweet that I put out was last week I was coming back from Barcelona and tired of travel. I'm looking forward to right now, we're recording this just before KubeCon. Then after KubeCon, planning to go to re:Invent in Vegas, which is just a social denial-of-service. It's just overwhelming being with that. I was tired of traveling. I posted something and came across a little stronger than I wanted to. That I just hate people, right? I was at that point where it's just you're traveling and you just don't want to deal with anybody and every little thing is really bugging you and annoying you. I think burnout is an interesting thing. For me and I think there's different causes for different folks. Number one is that it's always fascinating when you start a new job, your calendar is empty, your responsibilities are low. Then as you are successful and you integrate yourself into the organization, all of a sudden you find that you have more work than you have time to do. Then you hit this point where you try and like, “I'm just going to keep doing it. I'm going to power through.” Then you finally hit this point where you're like, “This is just not humanly possible.” Then you go into a triage mode and then you have to decide what's important. I know that there's more to be done than I can do. I have to be very thoughtful about prioritizing what I'm doing. There's a lot of techniques that you can bring to bear there. Being explicit about what your goals are and what your priorities are, writing those things down, whether it's an OKR process, or whether it's just here's the my top three things that I'm focusing on. Making sure that those things are purposefully meaningful to you, right? Understanding the difference between urgent and important, which these are business booky type of things, but it's this idea of there are things that feel they have to get done right now and then there are things that are long-term important. If you're not thoughtful about how you do things, you spend all your time doing the urgent things, but you never get to the stuff that's the actually long-term important. That's a really easy trap to get yourself into. Finding ways to delegate to folks is really, really helpful here, in terms of empowering others, trusting them. It's hard to let go sometimes, but I think being able to set the stage for other people to be successful is really empowering. Then just recognizing it's not all going to get done and that's okay. You can't hold yourself to expect that. Now with respect to burnout, for me, the biggest driver for burnout in my career has been when I felt personal responsibility over something, but I have been had the tools, or the authority, or the ability to impact it.When you feel in your bones ownership over something, but yet you can't actually really own it, that is what causes burnout for me. I think there are studies talking about how the worst job is middle management. I think it's not being the CEO. It's not being new to the organization, being junior. It's actually being stuck in the middle. Because you're given a certain amount of responsibility, but you aren't always given the tools necessary to be able to drive that. Whereas the folks at the top, oftentimes they don't have those constraints, so they actually own stuff and have agency to be able to take care of it. I think when you're starting on more junior in the organization, the scope of ownership that you feel is relatively minor. That being stuck in the middle is the biggest driver for me for burnout. A big part of that is just recognizing that sometimes you have to take a step back and personally divest that feeling of ownership when really it's not yours to own. I'll give you an example, is that I started Google Compute Engine at Google, which is arguably the foundational cloud service for GCP. As it grew, as it became more important to Google, as it got reorged, more or more of the leadership and responsibilities and decision-making, I’m up here in Seattle, move down the mountain view, a lot of that stuff was focused at had been in the cloud market, but then at Google for 10 or 15 years coming in and they're like, “Okay, that's cute. We got it from here,” right? That was a case where it was my thing. I felt a lot of ownership over it. It was clear after a certain amount of time, hey, you know what? I just work here. I'm just doing my job and I do what I do, but really it’s these other folks that are driving the bus. That's a painful transition to actually go from that feeling of ownership to I just work here. That I think is one of the reasons why oftentimes, people leave the companies. I think that was one of the big drivers for why I ended up leaving Google, was that lack of agency to be able to impact things that I cared about quite a bit. [0:13:59.8] CC: I think that's one reason why – well, I think that working in the companies where things are moving fast, because they have a very clear, very worthwhile goal provides you the opportunity to just have so much work that you have to say no to a lot of things like where you were saying, and also take ownership of pieces of that work, because there's more work to go around than people to do it. For example, since Heptio and VM – okay, I’m plugging. This is a big plug for VMware I guess, but it definitely is a place that's moving fast. It's not crazy. It's reasonable, because everybody, pretty much, wherever one of us grown up. There is so much to do and people are glad when you take ownership of things. That really for me is a big source of work satisfaction. [0:14:51.2] JB: Yeah. I think it's that zero-sum versus positive-sum game. I think that when you – there's a lot more room for you to actually feel that ownership, have that agency, have that responsibility when you're in a positive-sum environment, versus a zero-sum environment. [0:15:04.9] BL: All right, so now I want to ask your technical question. [0:15:08.1] JB: All right. [0:15:09.5] BL: Not a really hard one. Just more of how you think about this. Kubernetes is five and almost five and a half years old. One of the key components of Kubernetes is etcd. Now knowing what we know now and 2019 with Kubernetes have you used etcd as its key store? Or would you have gone another direction? [0:15:32.1] JB: I think etcd is a good fit. The truth of the matter is that we didn't give that decision as much thought as we probably should have early on. We saw that it was relatively easy to stand up and get going with. At least on paper, it had the qualities that we were looking for, so we started building with it and then just ran with it. Something like ZooKeeper was also something we could have taken, but the operational overhead at the time of ZooKeeper was very different from etcd. I think we could have gone in the direction of them and this is why [inaudible 0:15:58.5] for a lot of their tools, where they actually build the data store into the tool in a native way. I think that can lead in some ways to a simpler getting started experience, because there's just one thing to boot up, but also it's more monolithic from a backup, maintenance, recovery type of thing. The one thing that I think we probably should have done there in retrospect is to try and create a little bit more of an arm's length relationship in Kubernetes and etcd. In terms of having some cleaner interfaces, some more contractor and stuff, so that we could have actually swapped something else out. There's folks that are doing it, so it's not impossible, but it's definitely not something that's easy to do, or well-supported. I think that that's probably the thing that I wouldn't change in that space. Another thing we might want to change, I think it might have been good to be more explicit about being able to actually shard things out, so that you could have multiple data stores for multiple resources and actually find a way to horizontally scale. Now we do that with events, because we were writing events into etcd and that's just a totally different stream of data, but everything else right now – I think now there's room to do this into the future. I think we've been able to push etcd vertically up until now. There will come a time where we need to find ways to shard that thing up horizontally. [0:17:12.0] CC: Is it possible though to use a different data store than etcd for Kubernetes? [0:17:18.4] JB: The things that I'm aware of here and there may be more and I may not be a 100% up to date, is I do know that the Azure folks created a proxy layer that speaks to the etcd protocol, but that is actually implemented on the backend using CosmoDB. That approach there was to essentially create a translation layer. Then Rancher created this project, which is a little bit if you've – been added a bit of a fork of Kubernetes, where they're I believe using PostgresSQL as the database for Kubernetes. I haven't looked to see exactly how they ended up swapping that in. My guess is that there's some chewing gum and bailing wiring and it's quite a bit of effort for each version upgrade to be able to actually adapt that moving forward. Don't know for sure. I haven't looked deeply. [0:18:06.0] CC: Okay. Now I would love to philosophize a little bit, or maybe a lot about Kubernetes. In the spirit of thinking of different questions to ask, so I had a bunch of questions and then I was thinking, “How could I ask this question in a different way?” Maybe this is not the right “question.” Here is the way I came up with this question. We’re so divided out there. One camp loves Kubernetes, another camp, "So hard, so complicated, it’s so complex. Why even bother with it? I don't understand why people are using this." Basically, there is that sentiment that Kubernetes is complicated. I don't think anybody would refute that. Now is that even the right way to talk about Kubernetes? Is it even not supposed to be complicated? I mean, what kind of a tool is it that we are thinking, it should just work, it should be just be super simple. Is it true that it should be a super simple tool to use? [0:19:09.4] JB: I mean, that's a loaded question [inaudible]. Let me just first say that number one, if people are complaining, I mean, I'm stealing this from Tim [inaudible], who I think this is the way he takes some of these things in stride. If people are complaining, then you're relevant, right? If nobody is complaining, then nobody cares about what you're doing. I think that it's a good thing that folks are taking a critical look at Kubernetes. That means that they're taking a look at it, right? For five years in, Kubernetes is on an upswing. That's not going to necessarily last forever. I think we have work to do to continually earn Kubernetes’s place in the technology stack over time. Now that being said, Kubernetes is a super, super flexible tool. It can do so many things in so many different situations. It's used from everything from in retail stores across the tens of thousands of stores, any type of solutions. People are looking at it for telco, 5G. People are looking at it to even running it inside cars, which scares me, right? Then all the way up to folks like at CERN using it to do data analytics for hiring and physics, right? The technology that I look at that's probably most comparable to that is something like Linux. Linux is actually scalable from everything from a phone, all the way up to an IBM mainframe, but it's not easy, right? I mean, to be able to adapt it across all that things, you have to essentially download the kernel type, make config and then answer 5,000 questions, right, for those who haven't done that. It's not an easy thing to do. I think that a lot of times, people might be looking at Kubernetes at the wrong level to be able to say this should be simple. Nobody looks at the Linux kernel that you get from git cloning, Linux’s fork and compiling it and saying, “Yeah, this is too hard.” Of course it's hard. It's the Linux kernel. You expect that you're going to have a curated experience if you want something easy, right? Whether that be an Android phone or Ubuntu or what have you. I think to some degree, we're still in the early days where people are dealing with it perhaps at to raw level, versus actually dealing with it in a more opinionated way. Now I think the fascinating thing for Kubernetes is that it provides a lot of the extension points and patterns, so that we don't know exactly what those higher-level easier-to-use abstractions are going to look like, but we know, or at least we're pretty confident that we have the right tools and the right environment to be able to experiment our way there. I think we're not there yet, but we're set up for success. That's the first thing. The second thing is that Kubernetes introduces a whole bunch of different concepts and ideas and these things are different and uncomfortable for folks. It's hard to learn new things. It's hard for me to learn new things and it's hard for everybody to learn new things. When you compare Kubernetes to say, getting started with the modern front-end web development stack, with things like Babel and React and how do you deploy this and what are all these different options and it changes on a weekly basis. There's a hell of a lot in common actually between these two ecosystems. They're both really hard, they both introduce all these new concepts and you have to be embedded in it to really get it. Now that being said, if you just wanted take raw JavaScript, or jQuery and have at it, you can do it and you'll see on Hacker News articles every once in a while where people are like, “Hey, I've programmed my site with jQuery and it's just fine. I don't need all this new stuff,” right? Just like you'll see folks saying like, “I just SSH’d in and actually ran some stuff and it works fine. I don't need all this Kubernetes stuff.” If that works for you, that's great. Kubernetes doesn't have to solve every problem for every person. Then the next thing is that I think that there's a lot of people who've been solving these problems again and again and again and again, but they've been solving them in their own way. It's not uncommon when you look at back-end systems, to join a company, look at what they've built and found that it's a complicated, bespoke system of chewing gum and baling wire with maybe a little bit Ansible, maybe a little bit of Puppets and bash. Everybody has built their own, complex, overwrought system to do a lot of the stuff that Kubernetes does. I think one of the values that we see here is that these things are complex, unique complex to do it, but shared complexity is more valuable than personal complexity. If we can agree on some of these concepts, then that's something that can be leveraged widely and it will fade to the background over time, versus having everybody invent their own complex system every time they need to solve these problems. With that all said, we got a ton of work to do. It's not like we're done here and I'm not going to actually sit here and say Kubernetes is easy, or that every complex thing is absolutely necessary and that we can't find ways to simplify it. We clearly can. I just think that when folks say, “Hey, I just want this to be easy." I think they're being a little bit too naïve, because it's a very difficult problem domain. [0:23:51.9] BL: I'd like to add on to that. I think about this a lot as well. Something that Joe said to me few years back, where Kubernetes is the platform for creating platforms, it is very applicable here. Where we are looking at as an industry, we need to stop looking at Kubernetes as some estimation. Your destination is really running your applications that give you pleasure, or make your business money. Kubernetes is a tool to enable us to think about our applications more, rather than the underlying ecosystem. We don't think about servers. We want to think about storage and networking, even things like finding things in your cluster. You don't think about that. Kubernetes gives it to you. If we start thinking about Kubernetes as a way to enable us to do better things, we can go back to what Joe said about Linux. Back whenever I started using Linux in the mid-90s, guess what? We compiled it. Make them big. That stuff was hard and it was slow. Now think about this, in my office I have three different Linux distributions running. You know what? I don't even think about it anymore. I don't think about configuring X. I don't think about anything. One thing that from Kubernetes is going to grow is it's going to – we're going to figure out these problems and it's going to allow us to think of these other crazy things, which is going to push the industry further. Think maybe 20 years from now if we're still running Kubernetes, who cares? It's just going to be there. We're going to think about some other problem and it could be amazing. This is good times. [0:25:18.2] JB: At one point. Sorry, the dog’s going to bark here. I mean, at one point people cared about some of the BIOS that they were running on our computers, right? That was something that you stressed out about. I mean, back in the bad old days when I was doing DOS gaming and you're like, “Oh, well this BIOS is incompatible with the –” IRQ's and all that. It's just background now. [0:25:36.7] CC: Yeah, I think about this too as a developer. I might have mentioned this before in this podcast. I have never gone from one job to another job and had to use the same deployment system. Every single job I've ever had, the deployment system is completely different, completely different set of tooling and completely different process. Just being able to walk out from one job to another job and be able to use the same platform for deployment, it must be amazing. On the flip side, being able to hire people that will join your organization already know how your deployment works, that has value in itself. It's a huge value that I don't think people talk about enough. [0:26:25.5] JB: Well honestly, this was one of the motivations for creating Kubernetes, is that I looked around Google early on and Google is really good at importing open source, circa 2000, right? This is like, “Hey, you want to use libpng, or you want to use this library, or whatever.” That was the type of open source that Google is really, really good at using. Then Google did things, like say release the Big Table paper. Then somebody went through and then created Cassandra out of it. Maybe there's some ideas in Cassandra that actually build on top of big table, or you're looking at MapReduce versus Hadoop. All of a sudden, you found that these things diverge and Google had zero ability to actually import open source, circa 2010, right? It could not back import systems, because the operational characteristics of these things were solely alien when compared to something like Borg. You see this also, like we would acquire companies and it would take those companies way too long to be able to essentially re-platform themselves on top of Borg, because it was just so different. This is one of the reasons, honestly, why we ended up doing something like GCE is to actually have a platform that was actually more familiar from acquisition. It's one of the reasons we did it. Then also introducing Kubernetes, it's not Borg. It's a cousin of Borg inside of Google. For those who don't know, Borg is the container system that’s been in production at Google for probably 15 years now, and the spiritual grandfather to Kubernetes in a lot of ways. A lot of the ideas that you learn from Kubernetes are applicable to Borg. It's not nearly as big a leap for people to actually change between them, as it was before, Kubernetes was out there. [0:27:58.6] MG: Joe, I got a similar question, because it seems to be like you're a platform builder. You've worked on GCE, Kubernetes obviously. If you would be talking to another platform architect or builder, what would be something that you would recommend to them based on your experiences? What is a key ingredient, technically speaking of a platform that you should be building today, or the main thing, or the lesson learned that you had from building those platforms, like technical advice, if you will? [0:28:26.8] JB: I mean, that's a really good question. I think in my mind, the mark of a good platform is when people can use it to do things that you hadn't imagined when you were building it, right? The goal here is that you want a platform to be a force multiplier. You wanted to enable people to do amazing things. You compare, again the Linux kernel, even something as simple as our electrical grid, right? The folks who established those standards, God knows how long ago, right? A 150 years ago or whenever, the whole Tesla versus Thomas Edison, [inaudible]. Nobody had any idea the long-term impact that would have on society over time. I think that's the definition of a successful platform in my mind. You got to keep that in mind, right? I think that for me, a lot of times people design for the first five minutes at the expense of the next five years. I've seen in a lot of times where you design for hey, I'm getting a presentation. I want to be able to fit something amazing on one slot. You do it, but then all of a sudden somebody wants to do something different. They want to go off course, they want to go off the rails, they want to actually experiment and the thing is just brittle. It's like, “Hey, it does this. It doesn't do anything else. Do you want to do something else? Sorry, this isn't the tool for you.” For me, I think that's a trap, right? Because it's easy to get it early users based on that very curated experience. It's hard to keep those users as they actually start using the thing in anger, as they start interfacing with the real world, as they deal with things that you didn't think of as a platform. I'm always thinking about how can every that you put in the platform be used in multiple ways? How can you actually make these things be composable building blocks, because then that gives you the opportunity for folks to actually compose them in ways that you didn't imagine, starting out. I think that's some of it. I started my career at Microsoft working on Internet Explorer. The fascinating thing about Microsoft is that through and through and through and through Microsoft is a platform company. It started with DOS and Windows and Office, but even though Office is viewed as a platform inside of Microsoft. They fundamentally understand in their bones the benefit of actually starting that platform flywheel. It was really interesting to actually be doing this for the first browser wars of IE versus Netscape when I started my own career, to actually see the fact that Microsoft always saw Internet Explorer as a platform, whereas I think Netscape didn't really get it in the same way, right? They didn't understand the potential, I think in the way that Microsoft did it. For me, I mean, just being where you start your career, oftentimes you actually sets your patterns in terms of how you look at things over time. I think a lot of this platform thinking comes from just imprinting when I was a baby developer, I think. I don't know. It takes a lot of time to really internalize that stuff. [0:31:14.1] BL: The lesson here is this a good one, is that when we're building things that are way bigger than us, don't think of your product as the end goal. Think of it as an enabler. When it's an enabler, that's where you get that X multiplier. Then that's where you get all the residuals. Microsoft actually is a great example of it. My gosh. Just think of what Microsoft has been able to do with the power of Office? [0:31:39.1] JB: Yeah. I look at something like VB in the Microsoft world. We still don't have VB for the cloud era. We still haven't created that. I think there's still opportunity there to actually strike. VB back in the day, for those who weren't there, struck this amazing balance of being easy to get started with, but also something that could actually grow with you over time, because it had all these extension mechanisms where you could actually – there's the marketplace controls that you could buy, you could partner with other developers that were writing C or C++. It was an incredible platform. Then they leverage to Office to extend the capabilities of VB. It's an amazing ecosystem. Sorry. I didn't mean to interrupt you, Bryan. [0:32:16.0] BL: Oh, no. That's all good. I get as excited about it as you do whenever I think about it. It's a pretty exciting place to be. [0:32:21.8] JB: Yeah. I'll talk to VC's, because I did a startup and the EIR thing and I'll have them ask me things like, “Hey, where should we invest in the Kubernetes space?” My answer is using the BS analogy like, “You got to go where the puck is going.” Invest in the things that Kubernetes enables. What are the things that people can do now that they couldn't do pre-Kubernetes? Those are the things where we're going to see the explosion of growth. It's not about the Kubernetes. It's really about a larger ecosystem that Kubernetes is the seed crystal for. [0:32:56.2] BL: For those of you listening, if you want to get anything out of here, rewind back about 20 seconds and play that over and over again, what Joe just said. [0:33:04.2] MG: Yeah. This was brilliant. [0:33:05.9] BL: It’s where the puck is going. It's not where we are now. We're building for the future. We're not building for now. [0:33:11.1] MG: I'm looking at this tweetable quotes here, the last 20 seconds, so many tweetable quotes. We have to decide which ones to tweet then. [0:33:18.5] CC: Well, we’ll tweet them all. [0:33:20.0] MG: Oh, yes. [0:33:21.3] JB: Here’s another thing. Here’s another piece of career advice. Successful people are good storytellers. You can have the most beautiful technology, if you can't tell the human story about it, about what it does for folks, then nobody will care. I spend a lot of the time on Twitter and probably too much time, if you ask my family. That medium of being able to actually distill your thoughts down into something that is tweetable, quotable, really potent, that is a skill that's worth developing and it's a skill that's worth valuing. Because there's things that are rolling around in my head and I still haven't found a way to get them into a tweet. At some point, I'll figure it out and it'll be a thing. It takes a lot of time to build that skill to be able to refine like that. [0:34:08.5] CC: I want to say an anecdote of myself. I interview a small – so tiny startup, maybe less than 10 people at the time in Cambridge back when I lived up there. The guy was borderline wanting to hire me and no. I sent him an e-mail to try to influence his decision and it was a long-ass e-mail. They said, “No, thank you.” Then I think we had a good rapport. I said, well, anything you can tell me about your decision then? He said something along the lines like, I was too verbose. That was pre-Twitter. Twitter I think existed, but it was at the very beginning, I wasn't using it. Yeah, people. Be concise. Decision-makers don't have time to read long things. You need to be able to convey your message in short sentences, few sentences. It's crucial. [0:35:07.5] BL: All right, so we're nearing the end. I want to ask another question, because these are random questions for Joe. Joe, it is the week before KubeCon North America 2019 and today is actually an interesting day. A couple of neat things happened today. We had Docker. It was neat. Docker split somewhat and it sold part of it and now they're going to be a tools company. That's neat. We're all still trying decoding what that actually is. Here's the neat piece, Apple released a laptop that can have 64 gigabytes of memory. [0:35:44.4] MG: Has an escape key. [0:35:45.7] BL: It has an escape key. [0:35:47.6] MG: This is brilliant. [0:35:48.6] BL: Yeah. I think the question was what do you think about that? [0:35:52.8] JB: Okay. Well, so first of all, I mean, Docker is fascinating and I think this is – there's a lot of lessons there and I'm not sure I'm the one to tell them. I think it's easy to armchair-quarterback these things. It's hard to live that story. I think that it's fun to play that what-if game. I think it does show that this stuff is hard. You can have everything in your grasp and then just have it all slip away. I think that's not anybody's fault. It's just there's different strategies, different approaches in how this stuff plays out over time. On the laptop thing, I think my current laptop has 16 gigs of RAM. One of the things that we're seeing is that as we move towards a microservices world, I gave a talk about this probably three or four years ago. As we move to a microservices world, I think there's one stage where you create a bunch of microservices, but you still view those things as an app. You say, "This microservice belongs to this app." Within a mature organization, those things start to grow and eventually what you find is that you have services that are actually useful for multiple apps. Your entire production infrastructure becomes this web of services that are calling each other. Apps are just entry points into these things at different points of that web of infrastructure. This is the way that things work at Google. When you see companies that are microservices-based, let's take an Uber, or Lyft or an Airbnb. As they diversify the set of products that they're offering, you know they're not running completely independent stacks. You know that there's places where these things connect to behind the scenes in a microservices world. What does that mean for developers? What it means is that you can no longer fit an entire company's worth of infrastructure on your laptop anymore. Within a certain constraint, you can go through and actually say, “Hey, I can bring up this canonical cut of microservices. I can bring that up on my laptop, but it will have dependencies that I either have to actually call into the prod dependencies, call into specialized staging, or mock those things out, so that I can actually run this thing locally and develop it.” With 64 gig of RAM, I can run more on my laptop, right? There's a little bit of kick in that can down the road in terms of okay, there's this race between more microservicey versus how much I can port on my laptop. The interesting thing is that where is this going to end? Are we going to have the ability to bring more and more with your laptop? Are you going to be able to run in the split brain thing across like there's people who will create network connections between these things? Or are we going to move to a world where you're doing more development on cluster, in the cloud and your laptop gets thinner and thinner, right? Either you absolutely need 64 gig because you're pushing up against the boundaries of what you can do on your laptop, or you've given up and it's all running in the cloud. Yet anyways, you might as well just use a Chromebook. It's fascinating that we're seeing this divergence of scaling up, versus actually moving stuff to the cloud. I can tell you at Google, a lot of folks, even developers can actually be super, super productive with something relatively thin like Chromebook, because there's so many tools there that really are targeted at doing all that stuff remotely, in Google's production data centers and such. That's I think the interesting implication from a developer point of view with 64 gigabytes of RAM. What you going to do Bryan? You're going to get the 64 gig Mac? You’re going to do it? [0:39:11.2] BL: It’s already coming. They'll be here week after next. [0:39:13.2] JB: You already ordered it? You are such an Apple fanboy. Oh, man. [0:39:18.6] BL: Oh, I'm actually so not to go too much into it. I am a fan of lots of memory. You know what? We work in this cloud native world. Any given week, I’ll work on four to five projects. I'm lazy. I don't want to shut any of them down. Now with 64 gigs, I don't have to shut anything down. [0:39:37.2] JB: It was so funny. When I was at Microsoft, everybody actually focused on Microsoft Windows boot time. They’re like, “We got to make it boot faster. We got to make it boot faster.” I'm like, I don't boot that often. I just want the thing to resume from sleep, right? If you can make that reliable on that theme. [0:39:48.7] CC: Yeah. I frequently have to restart my computer, because of memory issues. I don't want to know which app is taking up memory. I have a tool that I can look up, but I just shut it down, flush the memory. I do have a question related to Docker. Kubernetes, I don't know if it's right to say that Kubernetes is so reliant on Docker, because I know it works with other container technologies as well. In the worst case scenario, it's obviously, I have no reason to predict this, but in the worst case scenario where Docker, let's say is discontinued, how would that affect Kubernetes? [0:40:25.3] JB: Early on when we were doing Kubernetes and you're in this relationship with a company like Docker, I looked at what Docker was doing and you're like, “Okay, where is the real value here over time?” In my mind, I thought that the interface with developers that distributed kernel, that API surface area of Kubernetes, that was really the thing and that a lot of the Docker stuff was over time going to fade to the background. I think we've seen that happen, because when we talk about production systems, we definitely have moved past Docker and we have the CRI, we have Container D, which it was essentially built by Docker, donated to the CNCF as it made its way towards graduation. I think it's graduated now. The governance ties to Docker have been severed at this point. In production systems for Kubernetes, we've moved past that. I still think that there's developer experiences oftentimes reliant on Docker and things like Docker files. I think we're moving past that also. I think that if Docker were to disappear off the face of the earth, there would be some adjustment, but I think we have the right toolkits and the right systems to be able to do that. Some of that is open sourced by Docker as part of the Moby project. The whole Docker file evaluation flow is actually in this thing called Build Kit that you can actually use in different contexts outside of the Docker game. I think there's a lot of the building action. The thing that I think is the most influential thing that actually I think will stand the test of time is the Docker container image format. That artifact that you upload, that you download, the registry APIs. Now those things have been codified and are moving forward slowly under the OCI, the open container initiative project, which is a little bit of a sister foundation niche type of thing to the CNCF. I think that's the influence over time. Then related to that, I think the world should be a little bit worried about Docker Hub and what that means for Docker Hub over time, because that is not a cheap service to run. It's done as a public good, similar to github. If the commercial aspects of that are not healthy, then I think it might be disruptive if we see something bad happen with Docker Hub itself. I don't know what exactly the replacement for that would be overnight. That'd be incredibly disruptive. [0:42:35.8] CC: Should be Harbour. [0:42:37.7] JB: I mean, Harbour is a thing, but somebody's got a run it and somebody's got to pay the bandwidth bills, right? Thank you to Docker for paying those bandwidth bills, because it's actually been good for not just Docker, but for our entire ecosystem to be able to do that. I don't know what that looks like moving forward. I think it's going to be – I mean, maybe github with github artifacts and it's going to pick up the slack. We’re going to have to see. [0:42:58.6] MG: Good. I have one last question from my end. Totally different topic, not Docker at all. Or maybe, depends on your answer to it. The question is you're very technical person, what is the technology, or the stuff that your brain is currently spinning on, if you can disclose? Obviously, no secrets. What keeps you awake at night, in your brain? [0:43:20.1] JB: I mean, I think the thing that – a couple of things, is that stuff that's just completely different from our world, I think is interesting. I think we've entered at a place where programming computers, and so stuff is so specialized. That again, I talk about if you made me be a front-end developer, I would flail for several months trying to figure out how to even be productive, right? I think similar when we look at something like machine learning, there's a lot of stuff happening there really fast. I understand the broad strokes, but I can't say that I understand it to any deep degree. I think it's fascinating and exciting the amount of diversity in this world and stuff to learn. Bryan's asked me in the past. It's like, “Hey, if you're going to quit and start a new career and do something different, what would it be?” I think I would probably do something like generative art, right? Essentially, there's folks out there writing these programs to generate art, a little bit of the moral descendant of Demoscene that was I don't know. I wonder was the Demoscene happened, Bryan. When was that? [0:44:19.4] BL: Oh, mid 90s, or early 90s. [0:44:22.4] JB: That’s right. I was never super into that. I don't think I was smart enough. It's crazy stuff. [0:44:27.6] MG: I actually used to write demoscenes. [0:44:28.8] JB: I know you did. I know you did. Okay, so just for those not familiar, the Demoscene was essentially you wrote essentially X86 assembly code to do something cool on screen. It was all generated so that the amount of code was vanishingly small. It was this puzzle/art/technical tour de force type of thing. [0:44:50.8] BL: We wrote trigonometry in a similar – that's literally what we did. [0:44:56.2] JB: I think a lot of that stuff ends up being fun. Stuff that's related to our world, I think about how do we move up the stack and I think a lot of folks are focused on the developer experience, how do we make that easier. I think one of the things through the lens of VMware and Tanzu is looking at how does this stuff start to interface with organizational mechanics? How does the typical enterprise work? How do we actually make sure that we can start delivering a toolset that works with that organization, versus working against the organization? That I think is an interesting area, where it's hard because it involves people. Back-end people like programmers, they love it because they don't have to deal with those pesky people, right? They get to define their interfaces and their interfaces are pure and logical. I think that UI work, UX work, anytime when you deal with people, that's the hardest thing, because you don't get to actually tell them how to think. They tell you how to think and you have to adapt to it, which is actually different from a lot of back-end here in logical type of folks. I think there's an aspect of that that is user experience at the consumer level. There's developer experience and there's a whole class of things, which is maybe organizational experience. How do you interface with the organization, versus just interfacing, whether it's individuals in the developer, or the end-user point of view? I don't know if as an industry, we actually have our heads wrapped around that organizational limits. [0:46:16.6] CC: Well, we have arrived at the end. Makes me so sad, because we could talk for easily two more hours. [0:46:24.8] JB: Yeah, we could definitely keep going. [0:46:26.4] CC: We’re going to bring you back, Joe. Don’t worry. [0:46:28.6] JB: For sure. Anytime. [0:46:29.9] CC: Or do worry. All right, so we are going to release these episodes right after KubeCon. Glad everybody could be here today. Thank you. Make sure to subscribe and follow us on Twitter. Follow us everywhere and suggest episode topics for us. Bye and until next time. [0:46:52.3] JB: Thank you so much. [0:46:52.9] MG: Bye. [0:46:54.1] BL: Bye. Thank you. [END OF EPISODE] [0:46:55.1] ANNOUNCER: Thank you for listening to The Podlets Cloud Native Podcast. Find us on Twitter at https://twitter.com/ThePodlets and on the http://thepodlets.io/ website, where you'll find transcripts and show notes. We'll be back next week. Stay tuned by subscribing. [END]See omnystudio.com/listener for privacy information.

RWpod - подкаст про мир Ruby и Web технологии
22 выпуск 06 сезона. Microsoft + GitHub, Http.rb is Great, The Cult of the Complex, Nuclide, ProppyJS, Minipack и прочее

RWpod - подкаст про мир Ruby и Web технологии

Play Episode Listen Later Jun 3, 2018 42:59


Добрый день уважаемые слушатели. Представляем новый выпуск подкаста RWpod. В этом выпуске: Ruby Rails 5.2 introduces allow_other_host option to redirect_back method, Deprecating Paperclip и Faster JSON generation using PostgreSQL JSON function Simple, Async, Map/Reduce queue for Ruby, Setup ELK for NGINX logs with Elasticsearch, Logstash, and Kibana и Is Your Rails Team Slowing Down? Here's Why, And What You Can Do About It. Http.rb is Great и Ruby has Character JavaScript Microsoft + GitHub и The Cult of the Complex Introduction to the Headless CMS и Firebase Cloud Functions: the great, the meh, and the ugly 11 Javascript Utility Libraries You Should Know In 2018, Nuclide - an open IDE for web and native mobile development, built on top of Atom, ProppyJS - functional props composition for components, Critters-webpack-plugin - a Webpack plugin that inlines your app's critical CSS and lazy-loads the rest, Minipack - a simplified example of a modern module bundler written in JavaScript и Hiper - a statistical analysis tool for performance testing

The NoSQL Database Podcast
NDP005: Big Data and Where it Fits with NoSQL

The NoSQL Database Podcast

Play Episode Listen Later Jun 24, 2016 31:44


In this episode I'm joined by Anand Iyer from Cloudera and we chat about widely used Big Data platforms and technologies.  For example, we discuss Hadoop and its underlying MapReduce technology.  We also talk about Apache Spark and what it brings to the table in comparison to Hadoop. Anand does a great job at getting into the fine technical details for all the technologies discussed in the episode. If you have any questions regarding this episode, send them to advocates@couchbase.com.

The NoSQL Database Podcast
NDP004: Querying with NoSQL

The NoSQL Database Podcast

Play Episode Listen Later May 16, 2016 32:45


In this podcast episode I'm joined by colleagues, Keshav Murthy who is the Director of Query and Prasad Varakur who is the Product Manager for N1QL at Couchbase. The goal of this episode is to discuss how querying works in a NoSQL database, what options are available to you, and how it differs from that of a relational database. For any questions related to this episode, direct them to advocates@couchbase.com.

CERIAS Security Seminar Podcast
Ariel Feldman, Verifying Computations with (Private) State

CERIAS Security Seminar Podcast

Play Episode Listen Later Nov 11, 2015 56:24


Is it possible for Alice to compute a result and for Bob to be convinced of its correctness without having to reexecute the computation? What if the computation is performed over sensitive data that Bob is not allowed to see due to privacy concerns? Recent work on proof-based verifiable computation has brought these goals much closer to practicality. In this talk, I will present two implemented systems that incorporate verifiable computation in order to build realistic applications. The first, Pantry, enables a user to outsource a general-purpose computation to a potentially faulty cloud provider and yet verify that the computation was performed correctly. Unlike prior efforts, Pantry allows verifiable computations to operate on remotely-stored data, opening the way to a wide variety of uses such as MapReduce jobs and database queries.The second system, VerDP, aims to resolve the conflict in many research studies between the verifiability of the results and the privacy of the study participants. VerDP accepts queries over sensitive data that are written in a domain-specific language and processes them only if a) it can certify that the result will not compromise individuals' privacy, and if b) it can prove the integrity of the result to the public. Experimental evaluation shows that VerDP can successfully process several types of useful queries, and that the cost of generating and verifying the proofs is practical. About the speaker: Ariel Feldman is an Assistant Professor of Computer Science at the University of Chicago. His research lies at the intersection of computer security and distributed systems. He is presently focused on finding new ways to protect the security and privacy of users of "cloud hosted" services. His interests also include software and network security, data privacy, anonymity, and electronic voting, as well as the interaction between computer security, law, and public policy. Previously, he was a postdoctoral researcher at the CIS department at the University of Pennsylvania, and he received his Ph.D. in Computer Science from Princeton University in 2012.

CERIAS Security Seminar Podcast
Savvas Savvides, Practical Confidentiality Preserving Big Data Analysis in Untrusted Clouds

CERIAS Security Seminar Podcast

Play Episode Listen Later Jan 28, 2015 48:11


The "pay-as-you-go" cloud computing model has strong potential for efficiently supporting big data analysis jobs expressed via data-flow languages such as Pig Latin. Due to security concerns — in particular leakage of data — government and enterprise institutions are however reluctant to moving data and corresponding computations to public clouds. In this talk we will discuss Crypsis, a system that allows execution of MapReduce-style data analysis jobs directly on encrypted data. Crypsis transforms data analysis scripts written in Pig Latin so that they can be executed on encrypted data. Crypsis to that end employs existing practical partially homomorphic encryption (PHE) schemes, and adopts a global perspective in that it can perform partial computations on the client side when PHE alone would fail. About the speaker: Savvas Savvides is a PhD student in Computer Science at Purdue University. He earned his Master's degree in Computer Science from New York University and his Bachelor's in Computer Science from the University of Manchester. His primary research interests include Information Security, Distributed Systems and Cloud Computing. His current research focus is on devising practical solutions for confidentiality preserving big data analysis jobs.

.NET Rocks!
Andrew Brust Processes Big Data

.NET Rocks!

Play Episode Listen Later Jan 22, 2013 61:50


Carl and Richard talk to Andrew Brust about Big Data. Andrew starts off connecting together the definitions of business intelligence, data analytics, OLAP, data warehousing and big data. They're all related, even though they've come at the problem of understanding data from different directions. The conversation digs deeply into Hadoop, the Linux-centric MapReduce technology that has come to define the idea of Big Data, as well as Microsoft's implementation once called Project Isotope and now known as HDInsight. How big is Big Data? That's up to you!Support this podcast at — https://redcircle.com/net-rocks/donations