POPULARITY
An airhacks.fm conversation with Colt McNealy (@coltmcnealy) about: first computing experience with Sun workstations and network computing, background in hockey and other sports, using system76 Linux laptops for development, starting programming in high school with Java and later learning C, fortran, assembly, C++ and python, working at a real estate company with kubernetes and Kafka, the genesis of LittleHorse from experiencing challenges with distributed microservices and workflow management, LittleHorse as an open source workflow orchestration engine using Kafka as a commit log rather than a message queue, building a custom distributed database optimized for workflow orchestration, the recent move to fully open source licensing, comparison with AWS Step Functions but with more capabilities and open source benefits, using RocksDB and Kafka Streams for the underlying implementation, performance metrics of 12-40ms latency between tasks and hundreds of tasks per second, the multi-tenant architecture allowing for serverless offerings, integration with Kafka for event-driven architectures, the distinction between orchestration and choreography in distributed systems, using Java 21 with benefits from virtual threads and generational garbage collection, plans for Java 25 adoption, the naming story behind "Little Horse" and its competition with MuleSoft, the Sun Microsystems legacy and innovation culture, recent adoption of Quarkus for some components, the "Know Your Customer" flow as the Hello World example for Little Horse, the importance of observability and durability in workflow management, plans for serverless offerings and multi-tenant architecture, the balance between open source core and commercial offerings Colt McNealy on twitter: @coltmcnealy
Data Streaming und Stream Processing mit Apache Kafka und dem entsprechenden Ecosystem.Eine ganze Menge Prozesse in der Softwareentwicklung bzw. für die Verarbeitung von Daten müssen nicht zur Laufzeit, sondern können asynchron oder dezentral bearbeitet werden. Begriffe wie Batch-Processing oder Message Queueing / Pub-Sub sind dafür geläufig. Es gibt aber einen dritten Player in diesem Spiel: Stream Processing. Da ist Apache Kafka das Flaggschiff, bzw. die verteilte Event Streaming Platform, die oft als erstes genannt wird.Doch was ist denn eigentlich Stream Processing und wie unterscheidet es sich zu Batch Processing oder Message Queuing? Wie funktioniert Kafka und warum ist es so erfolgreich und performant? Was sind Broker, Topics, Partitions, Producer und Consumer? Was bedeutet Change Data Capture und was ist ein Sliding Window? Auf was muss man alles acht geben und was kann schief gehen, wenn man eine Nachricht schreiben und lesen möchte?Die Antworten und noch viel mehr liefert unser Gast Stefan Sprenger.Bonus: Wie man Stream Processing mit einem Frühstückstisch für 5-jährige beschreibt.Unsere aktuellen Werbepartner findest du auf https://engineeringkiosk.dev/partnersDas schnelle Feedback zur Episode:
In this video I speak with Felix GV, who is a Principal Staff Engineer at Linkedin, and has done major contributions to the data infrastructure and Linkedin, including VeniceDB. This episode will give you a good understanding of why we need a new database for storing "Derived Data" in a low latency, high performance manner, which is very important for Machine Learning workloads. Chapters: 00:00 Introduction 01:42 The Evolution of LinkedIn's Databases 03:15 Challenges with Voldemort and the Birth of VeniceDB 08:42 Understanding Derived Data 13:33 Planet-Scale Applications and Multi-Region Support 17:40 Writing Data into VeniceDB 22:53 Merging Data in VeniceDB 40:31 Understanding the Architecture 40:47 Components of the Write Path 41:56 Leader and Follower Architecture 43:58 Partitioning and DaVinci Client 47:57 Read Patterns and Client Options 54:25 Fault Tolerance and Recommender Systems 01:01:19 Kafka Integration and Deployment 01:06:56 Roadmap and Future Improvements Important links: VeniceDB blog: https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform VeniceDB docs: https://venicedb.org/ Qcon: https://youtu.be/pJeg4V3JgYo?si=vblGUxp5fNdKPHoC Follow me on Linkedin and Twitter: https://www.linkedin.com/in/kaivalyaapte/ and https://twitter.com/thegeeknarrator If you like this episode, please hit the like button and share it with your network. Also please subscribe if you haven't yet. Database internals series: https://youtu.be/yV_Zp0Mi3xs Popular playlists: Realtime streaming systems: https://www.youtube.com/playlist?list=PLL7QpTxsA4se-mAKKoVOs3VcaP71X_LA- Software Engineering: https://www.youtube.com/playlist?list=PLL7QpTxsA4sf6By03bot5BhKoMgxDUU17 Distributed systems and databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4sfLDUnjBJXJGFhhz94jDd_d Modern databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4scSeZAsCUXijtnfW5ARlrsN Stay Curios! Keep Learning! #kafka #linkedin #venicedb #Rocksdb
Summary Databases come in a variety of formats for different use cases. The default association with the term "database" is relational engines, but non-relational engines are also used quite widely. In this episode Oren Eini, CEO and creator of RavenDB, explores the nuances of relational vs. non-relational engines, and the strategies for designing a non-relational database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that prevents data quality issues from entering every part of your data workflow, from migration to dbt deployment. Datafold has recently launched data replication testing, providing ongoing validation for source-to-target replication. Leverage Datafold's fast cross-database data diffing and Monitoring to test your replication pipelines automatically and continuously. Validate consistency between source and target at any scale, and receive alerts about any discrepancies. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold). Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Oren Eini about the work of designing and building a NoSQL database engine Interview Introduction How did you get involved in the area of data management? Can you describe what constitutes a NoSQL database? How have the requirements and applications of NoSQL engines changed since they first became popular ~15 years ago? What are the factors that convince teams to use a NoSQL vs. SQL database? NoSQL is a generalized term that encompasses a number of different data models. How does the underlying representation (e.g. document, K/V, graph) change that calculus? How have the evolution in data formats (e.g. N-dimensional vectors, point clouds, etc.) changed the landscape for NoSQL engines? When designing and building a database, what are the initial set of questions that need to be answered? How many "core capabilities" can you reasonably design around before they conflict with each other? How have you approached the evolution of RavenDB as you add new capabilities and mature the project? What are some of the early decisions that had to be unwound to enable new capabilities? If you were to start from scratch today, what database would you build? What are the most interesting, innovative, or unexpected ways that you have seen RavenDB/NoSQL databases used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RavenDB? When is a NoSQL database/RavenDB the wrong choice? What do you have planned for the future of RavenDB? Contact Info Blog (https://ayende.com/blog/) LinkedIn (https://www.linkedin.com/in/ravendb/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RavenDB (https://ravendb.net/) RSS (https://en.wikipedia.org/wiki/RSS) Object Relational Mapper (ORM) (https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping) Relational Database (https://en.wikipedia.org/wiki/Relational_database) NoSQL (https://en.wikipedia.org/wiki/NoSQL) CouchDB (https://couchdb.apache.org/) Navigational Database (https://en.wikipedia.org/wiki/Navigational_database) MongoDB (https://www.mongodb.com/) Redis (https://redis.io/) Neo4J (https://neo4j.com/) Cassandra (https://cassandra.apache.org/_/index.html) Column-Family (https://en.wikipedia.org/wiki/Column_family) SQLite (https://www.sqlite.org/) LevelDB (https://github.com/google/leveldb) Firebird DB (https://firebirdsql.org/) fsync (https://man7.org/linux/man-pages/man2/fsync.2.html) Esent DB? (https://learn.microsoft.com/en-us/windows/win32/extensible-storage-engine/extensible-storage-engine-managed-reference) KNN == K-Nearest Neighbors (https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm) RocksDB (https://rocksdb.org/) C# Language (https://en.wikipedia.org/wiki/C_Sharp_(programming_language)) ASP.NET (https://en.wikipedia.org/wiki/ASP.NET) QUIC (https://en.wikipedia.org/wiki/QUIC) Dynamo Paper (https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf) Database Internals (https://amzn.to/49A5wjF) book (affiliate link) Designing Data Intensive Applications (https://amzn.to/3JgCZFh) book (affiliate link) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Highlights from this week's conversation include:Apruva's background in streaming technology (0:48)Developer experience and Kafka streams (2:47)Motivation to bootstrap a startup (4:09)Meeting the Confluent founders and early work at Confluent (6:59)Projects at Confluent and transition to engineering management (10:34)Overview of Responsive and event-driven applications (12:55)Defining event-driven applications (15:33)Importance of latency and state in event-driven applications (18:54)Low Latency and Stateful Processing (21:52)In-Memory Storage and Evolution of Kafka (25:02)Motivation for KSQL and Kafka Streams (29:46)Category Creation and Database-like Interface (34:33)Developer Experience with Kafka and Kafka Streams (38:50)Kafka Streams Functionality and Operational Challenges (41:44)Metrics and Tuning Configurations (43:33)Architecture and Decoupling in Kafka Streams (45:39)State Storage and Transition from RocksDB (47:48)Future of Event-Driven Architectures (56:30)Final thoughts and takeaways (57:36)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Summary Data lakehouse architectures are gaining popularity due to the flexibility and cost effectiveness that they offer. The link that bridges the gap between data lake and warehouse capabilities is the catalog. The primary purpose of the catalog is to inform the query engine of what data exists and where, but the Nessie project aims to go beyond that simple utility. In this episode Alex Merced explains how the branching and merging functionality in Nessie allows you to use the same versioning semantics for your data lakehouse that you are used to from Git. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Alex Merced, developer advocate at Dremio and co-author of the upcoming book from O'reilly, "Apache Iceberg, The definitive Guide", about Nessie, a git-like versioned catalog for data lakes using Apache Iceberg Interview Introduction How did you get involved in the area of data management? Can you describe what Nessie is and the story behind it? What are the core problems/complexities that Nessie is designed to solve? The closest analogue to Nessie that I've seen in the ecosystem is LakeFS. What are the features that would lead someone to choose one or the other for a given use case? Why would someone choose Nessie over native table-level branching in the Apache Iceberg spec? How do the versioning capabilities compare to/augment the data versioning in Iceberg? What are some of the sources of, and challenges in resolving, merge conflicts between table branches? Can you describe the architecture of Nessie? How have the design and goals of the project changed since it was first created? What is involved in integrating Nessie into a given data stack? For cases where a given query/compute engine doesn't natively support Nessie, what are the options for using it effectively? How does the inclusion of Nessie in a data lake influence the overall workflow of developing/deploying/evolving processing flows? What are the most interesting, innovative, or unexpected ways that you have seen Nessie used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with Nessie? When is Nessie the wrong choice? What have you heard is planned for the future of Nessie? Contact Info LinkedIn (https://www.linkedin.com/in/alexmerced) Twitter (https://www.twitter.com/amdatalakehouse) Alex's Article on Dremio's Blog (https://www.dremio.com/authors/alex-merced/) Alex's Substack (https://amdatalakehouse.substack.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Project Nessie (https://projectnessie.org/) Article: What is Nessie, Catalog Versioning and Git-for-Data? (https://www.dremio.com/blog/what-is-nessie-catalog-versioning-and-git-for-data/) Article: What is Lakehouse Management?: Git-for-Data, Automated Apache Iceberg Table Maintenance and more (https://www.dremio.com/blog/what-is-lakehouse-management-git-for-data-automated-apache-iceberg-table-maintenance-and-more/) Free Early Release Copy of "Apache Iceberg: The Definitive Guide" (https://hello.dremio.com/wp-apache-iceberg-the-definitive-guide-reg.html) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=6cc46c8c6088) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157) AWS Glue (https://aws.amazon.com/glue/) Tabular (https://tabular.io/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Trino (https://trino.io/) Presto (https://prestodb.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58) RocksDB (https://rocksdb.org/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Hive Metastore (https://cwiki.apache.org/confluence/display/hive/design#Design-Metastore) PyIceberg (https://py.iceberg.apache.org/) Optimistic Concurrency Control (https://en.wikipedia.org/wiki/Optimistic_concurrency_control) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Follow: https://stree.ai/podcast | Sub: https://stree.ai/sub | Looking back at our favorite episodes from 2023, Tim Berglund chats with Anna McDonald about the fascinating world of Kafka Streams. Anna, a customer success technical architect at Confluent, shares her insights on the core concepts of Kafka Streams, including the all-important table and stream abstractions. They delve into the benefits of statefulness and durability, such as active and standby tasks, which ensure seamless failover, and how Kafka Streams stores state in RocksDB and in Kafka itself. New episodes every Monday resume on January 8, 2024!
In this episode, we spoke with Dhruba Borthakur, Dhruba is the CTO and Co-founder at Rockset. Rockset is a search and analytics database hosted on the cloud. Dhruba was the founding engineer of the RocksDB project at Facebook and the Principal Architect for HDFS for a while. In this episode, we discuss RocksDB and compare it with LevelDB. We also discuss in detail the Aggregator Leaf Tailer architecture, which started at Facebook and is now powering Rockset. Follow Dhruba: https://twitter.com/dhruba_rocks Follow Alex: https://twitter.com/alexbdebrie
Louis Brandy is VP of Engineering at Rockset, the real-time search and analytics database startup formed by the creators of the popular open source project, RocksDB. Subscribe to the Gradient Flow Newsletter: https://gradientflow.substack.com/Subscribe: Apple • Spotify • Stitcher • Google • AntennaPod • Podcast Addict • Amazon • RSS.Detailed show notes can be found on The Data Exchange web site.
Follow: https://stree.ai/podcast | Sub: https://stree.ai/sub | New episodes every Monday! Join Tim Berglund as he chats with Anna McDonald about the fascinating world of Kafka Streams. Anna, a customer success technical architect at Confluent, shares her insights on the core concepts of Kafka Streams, including the all-important table and stream abstractions. They delve into the benefits of statefulness and durability, such as active and standby tasks, which ensure seamless failover, and how Kafka Streams stores state in RocksDB and in Kafka itself. With a teaser for the next episode, this conversation promises an exciting exploration of data ingestion and time management in Kafka Streams. Don't miss out on this insightful discussion!Starting with Apache Kafka: https://developer.confluent.io/learn-kafka/apache-kafka/events/KIP-392 information: https://cwiki.apache.org/confluence/display/KAFKA/KIP-392%3A+Allow+consumers+to+fetch+from+closest+replica
Highlights from this week's conversation include:Dhruba's journey into the data space (2:02)The impact of Hadoop on the industry (3:37)Dhruba's work in the early days of the Facebook team (7:54)Building and implementing RocksDB (14:33)Stories with Mark Zuckerberg at Facebook (24:25)The next evolution in storage hardware (26:14)How Rockset is different from other real-time platforms (33:13)Going from a key value store to an index (37:15)Where does Rockset go from here? (44:59)The success of RocksDB as an open source project (49:11)How do we properly steward real-time technology for impact (51:17)Final thoughts and takeaways (56:18)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.
Wie viel MySQL Drop In-Replacement steckt wirklich in MariaDB?MariaDB, ein Fork der populären Datenbank MySQL. Angetreten, um ein Drop-In-Replacement und eine direkte Alternative zu MySQL darzustellen. Doch wie viel ist da dran? Ist MariaDB MySQL kompatibel? Wo liegen die Gemeinsamkeiten und Unterschiede? Was war eigentlich der Grund für den Fork? In welchen Bereichen entwickeln sich beide Datenbanken vollkommen anders? Und was hat sich im Bereich der Storage-Engines alles so getan?In dieser Episode bringen wir etwas Licht in den direkten Vergleich zwischen MySQL und MariaDB.Bonus: Was ein Weber-Grill mit MySQL und MariaDB zu tun hat.Das schnelle Feedback zur Episode:
Обсудим из чего базы данных состоят, где узкие горлышки, и какие продукты приходят на смену устоявшимся Postgres, MongoDB, Redis, Neo4J. Например, знали ли вы, что большая часть современных БД хранит данные в Log Structured Merged Tree структуре, а если точнее - в одной ее реализации - RocksDB от Facebook? А что Postgres не умеет работать с асинхронными интерфейсами Linux, и уже на этом уровне в 10 раз медленнее чем новые аналоги построенные на io_uring и SPDK? Или что можно отправить данные с диска на видеокарту в обход процессора? У ребят из Unum есть своя альтернатива RocksDB, которая до 7 раз быстрее, и функционально совместима. И вокруг нее построена череда open-source проектов, включая UKV, универсальную базу данных, которая может хранить документы, графы и бинарные данные, предлагая сквозные ACID транзакции, а также целый набор интуитивно понятных библиотек для пользователей верхнеуровневых языков. Ее уже можно найти в AWS Marketplace, а документация и исходные коды лежат на GitHub, а также на официальном сайте https://www.unum.cloud/
Summary The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) today to learn more Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it's all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications Interview Introduction How did you get involved in the area of data management? Can you describe what Grainite is and the story behind it? What are the personas that you are focused on addressing with Grainite? What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite? How does Grainite work to reduce that complexity? What are some of the commonalities that you see in the teams/organizations that find their way to Grainite? What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture? Can you describe how Grainite is architected? How have the design and goals of the platform changed/evolved since you first started working on it? What does your internal build vs. buy process look like for identifying where to spend your engineering resources? What is the process for getting Grainite set up and integrated into an organizations technical environment? What is your process for determining which elements of the platform to expose as end-user features and customization options vs. keeping internal to the operational aspects of the product? Once Grainite is running, can you describe the day 0 workflow of building an application or data flow? What are the day 2 - N capabilities that Grainite offers for ongoing maintenance/operation/evolution of those applications? What are the most interesting, innovative, or unexpected ways that you have seen Grainite used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite? When is Grainite the wrong choice? What do you have planned for the future of Grainite? Contact Info Ashish LinkedIn (https://www.linkedin.com/in/ashishkumarprofile/) Abhishek LinkedIn (https://www.linkedin.com/in/abhishekchauhan/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Grainite (https://www.grainite.com/) Blog about the challenges of streaming architectures (https://www.grainite.com/blog/there-was-an-old-lady-who-swallowed-a-fly) Getting Started Docs (https://gitbook.grainite.com/developers/getting-started) BigTable (https://research.google/pubs/pub27898/) Spanner (https://research.google/pubs/pub39966/) Firestore (https://cloud.google.com/firestore) OpenCensus (https://opencensus.io/) Citrix (https://www.citrix.com/) NetScaler (https://www.citrix.com/blogs/2022/10/03/netscaler-is-back/) J2EE (https://www.oracle.com/java/technologies/appmodel.html) RocksDB (https://rocksdb.org/) Pulsar (https://pulsar.apache.org/) SQL Server (https://en.wikipedia.org/wiki/Microsoft_SQL_Server) MySQL (https://www.mysql.com/) RAFT Protocol (https://raft.github.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)
Der zweite Datenbank-Deepdive im Engineering Kiosk.Indirekt knüpfen wir an Episode 8 mit dem Thema Datenbanken. Diesmal fangen wir aber ganz vorne an: Mit hierarchischen Datenbanken über Objektorientierte Datenbanken, anschließend zu SQL bis hin zur NoSQL und Spaltenorientierten Datenbank-Ära. Dabei klären wir Fragen was zum Beispiel der Unterschied zwischen Datenbanken und Dateien ist, ob OOP-Datenbank immer noch ein Hype ist, was Indexe sind und wie diese funktionieren, warum die Migration weg von Oracle schwierig sein kann, ob Lucene eine Datenbank ist und noch viel viel mehr.Bonus: Was Kürbiskerne mit Datenbanken zu tun haben und warum MySQL ein besseres Adressbuch mit SQL Interface ist.Feedback an stehtisch@engineeringkiosk.dev oder via Twitter an https://twitter.com/EngKioskLinksIBM Mainframes: https://www.ibm.com/de-de/it-infrastructure/zClickHouse: https://github.com/ClickHouse/ClickHouse / https://clickhouse.com/Oracle Cloud Free Tier: https://www.oracle.com/de/cloud/free/Apache Lucene: https://lucene.apache.org/Apache Solr: https://solr.apache.org/ElasticSearch: https://github.com/elastic/elasticsearchListe der Datenbankmanagementsysteme: https://de.wikipedia.org/wiki/Liste_der_DatenbankmanagementsystemeIBM Go Fork für Mainframes: https://github.com/linux-on-ibm-z/goDB4O: https://de.wikipedia.org/wiki/Db4oMichael Stonebraker / The End of an Architectural Era (It's Time for a Complete Rewrite): http://nms.csail.mit.edu/~stavros/pubs/hstore.pdfPercona: https://www.percona.com/2ndquadrant: https://www.2ndquadrant.com/OSS Names: https://github.com/EngineeringKiosk/OSS-NamesRedis: https://github.com/redis/redisRedisLabs: https://redis.com/antirez: http://antirez.com/RocksDB: http://rocksdb.org/ElasticSearch: https://github.com/elastic/elasticsearchLevelDB: https://github.com/google/leveldbMyRocks: http://myrocks.io/Sprungmarken(00:00:00) Intro(00:00:55) Mathematik-Professoren und Kürbiskern-Brötchen(00:02:27) Warum Datenbanken ein Herzensthema von Wolfgang ist(00:04:08) Was ist denn eine Datenbank und wann verwendet man eine Datenbank?(00:06:34) Sind klassische Dateien auch eine Datenbank?(00:07:25) Hierarchische Datenbanksysteme: IBM IMS(00:09:30) IBM Mainframes, Go, Docker und horizontale Skalierung(00:11:30) Was wäre ein Use-Case von hierarchische und Objekt-Orientierte Datenbanken?(00:16:15) Hast du bereits eine Objekt-Orientierte Datenbanken bereits in einem Projekt eingesetzt?(00:16:52) Trennung von Daten und Applikationslogik und SQL als Basis-Datenbanken-Wissen(00:18:55) Was ist der Unterschied von SQL-Datenbanken und Dateien(00:19:32) Datenbank Index/Indize: Daten-Duplikation, Lese- und Schreibzugriffe(00:23:54) Ist eine Excel-Datei eine Datenbank?(00:24:58) Unterschied von Files und Datenbanken: Nutzung von mehreren Benutzern(00:28:03) Recovery, persistentes und konsistentes Speichern bei Files und Datenbanken(00:31:01) Relationale Datenbanken sind die eigentlich klassischen Datenbanken(00:34:31) Proprietäre Datenbanken: Oracle Migration nach PostgreSQL(00:37:06) Oracle Cloud und das Free-Tier(00:38:29) MySQL wurde von Oracle übernommen und MariaDB als Alternative(00:39:48) Logik in der Datenbank, Oracle-Migrationen und Application-Server(00:41:10) Gibt es ein Killer-Argument für proprietäre Datenbanken?(00:43:57) Woher kommt der Name MySQL und MariaDB kommt?(00:45:19) Ist ElasticSearch eine Datenbank nach der klassischen Definition?(00:46:38) Ist Redis und andere Key-Value-Stores eine Datenbank?(00:48:42) NoSQL ist für Kinder, Feature-Ritis, Einfache Datenbanken und LevelDB / RocksDB und MyRocks(00:53:19) Was sind Spalten-Datenbanken und wann sollten diese angewendet werden? Analytische Datenbanken und Clickhouse von Yandex(00:58:15) Was für Fragen sind relevant um die richtige Datenbank für mich zu finden?(01:01:43) Feedback zum Thema Datenbanken und OutroHostsWolfgang Gassler (https://twitter.com/schafele)Andy Grunwald (https://twitter.com/andygrunwald)Engineering Kiosk Podcast: Anfragen an stehtisch@engineeringkiosk.dev oder via Twitter an https://twitter.com/EngKiosk
We've been talking a lot about K8s of late so we thought it was time to get back down to earth and spend some time with Adi Gelvan (@speedb_io), Co-founder and CEO of Speedb, an embedded key-value store, drop-in/replacement for RocksDB, that significantly improves on its IO performance for large metadata databases. At Adi's last … Continue reading "132: GreyBeards talk fast embedded k-v stores with Speedb's Co-Founder&CEO Adi Gelvan"
We've been talking a lot about K8s of late so we thought it was time to get back down to earth and spend some time with Adi Gelvan (@speedb_io), Co-founder and CEO of Speedb, an embedded key-value store, drop-in/replacement for RocksDB, that significantly improves on its IO performance for large metadata databases. At Adi's last … Continue reading "132: GreyBeards talk fast embedded k-v stores with Speedb's Co-Founder&CEO Adi Gelvan"
About VenkatVenkat Venkataramani is CEO and co-founder of Rockset. In his role, Venkat helps organizations build, grow and compete with data by making real-time analytics accessible to developers and data teams everywhere. Prior to founding Rockset in 2016, he was an Engineering Director for the Facebook infrastructure team that managed online data services for 1.5 billion users. These systems scaled 1000x during Venkat's eight years at Facebook, serving five billion queries per second at single-digit millisecond latency and five 9's of reliability. Venkat and his team also created and contributed to many noted data technologies and open-source projects, including Facebook's TAO distributed data store, RocksDB, Memcached, MySQL, MongoRocks, and others. Prior to Facebook, Venkat worked on tools to make the Oracle database easier to manage. He has a master's in computer science from the University of Wisconsin-Madison, and bachelor's in computer science from the National Institute of Technology, Tiruchirappalli.Links Referenced: Company website: https://rockset.com Company blog: https://rockset.com/blog TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored by our friends at Revelo. Revelo is the Spanish word of the day, and its spelled R-E-V-E-L-O. It means “I reveal.” Now, have you tried to hire an engineer lately? I assure you it is significantly harder than it sounds. One of the things that Revelo has recognized is something I've been talking about for a while, specifically that while talent is evenly distributed, opportunity is absolutely not. They're exposing a new talent pool to, basically, those of us without a presence in Latin America via their platform. It's the largest tech talent marketplace in Latin America with over a million engineers in their network, which includes—but isn't limited to—talent in Mexico, Costa Rica, Brazil, and Argentina. Now, not only do they wind up spreading all of their talent on English ability, as well as you know, their engineering skills, but they go significantly beyond that. Some of the folks on their platform are hands down the most talented engineers that I've ever spoken to. Let's also not forget that Latin America has high time zone overlap with what we have here in the United States, so you can hire full-time remote engineers who share most of the workday as your team. It's an end-to-end talent service, so you can find and hire engineers in Central and South America without having to worry about, frankly, the colossal pain of cross-border payroll and benefits and compliance because Revelo handles all of it. If you're hiring engineers, check out revelo.io/screaming to get 20% off your first three months. That's R-E-V-E-L-O dot I-O slash screaming.Corey: This episode is sponsored in part by LaunchDarkly. Take a look at what it takes to get your code into production. I'm going to just guess that it's awful because it's always awful. No one loves their deployment process. What if launching new features didn't require you to do a full-on code and possibly infrastructure deploy? What if you could test on a small subset of users and then roll it back immediately if results aren't what you expect? LaunchDarkly does exactly this. To learn more, visit launchdarkly.com and tell them Corey sent you, and watch for the wince.Corey: Welcome to Screaming in the Cloud. I'm Corey Quinn. Today's promoted guest episode is one of those questions I really like to ask because it can often come across as incredibly, well, direct, which is one of the things I love doing. In this case, the question that I am asking is, when you look around at the list of colossal blunders that people make in the course of careers in technology and the rest, it's one of the most common is, “Oh, yeah. I don't like the way that this thing works, so I'm going to build my own database.” That is the siren call to engineers, and it is often the prelude to horrifying disasters. Today, my guest is Venkat Venkataramani, co-founder and CEO at Rockset. Venkat, thank you for joining me.Venkat: Thanks for having me, Corey. It's a pleasure to be here.Corey: So, it is easy for me to sit here in my beautiful ivory tower that is crumbling down around me and use my favorite slash the best database imaginable, which is TXT records shoved into Route 53. Now, there are certainly better databases than that for most use cases. Almost anything really, to be honest with you, because that is a terrifying pattern; good joke, terrible practice. What is Rockset as we look at the broad landscape of things that store data?Venkat: Rockset is a real-time analytics platform built for the cloud. Let me break that down a little bit, right? I think it's a very good question when you say does the world really need another database? Don't we have enough already? SQL databases, NoSQL databases, warehouses, and lake houses now.So, if you really break it down, the first digital transformation that happened in the '80s was when people actually retired pen and paper records and started using a relational database to actually manage their business records and what have you instead of ledgers and books and what have you. And that was the first digital transformation. That was—and Oracle called the rows in a table ‘records' for a reason. They're called records to this date. And then, you know, 20 years later, when all businesses were doing system of record and transactions and transactional databases, then analytics was born, right?This was, like, the whole reason why I wanted to make better data-driven business decisions, and BI was born, warehouses and data lakes started becoming more and more mainstream. And there was really a second category of database management systems because the first category it was very good at to be a system of record, but not really good at complex analytics that businesses are asking to be able to guide their decisions. Fast-forward 20 years from then, the nature of applications are changing. The world is going from batch to real-time, your data never stops coming, advent of Apache Kafka and technologies like that, 5G, IoTs, data is coming from all sorts of nooks and corners within an enterprise, and now customers in enterprises are acquiring the data in real-time at a scale that the world has never seen before.Now, how do you get analytics out of that? And then if you look at the database market—entire market—there are still only two large categories of databases: OLTP databases for transaction processing, and warehouses and data lakes for batch analytics. Now suddenly, you need the speed of OLTP at the scale of batch, right, in terms of, like, complexity of compute, complexity of storage. So, that is really why we thought the data management space needs that third leg, and we call it real-time analytics platform or real-time analytics processing. And this is where the data never stops coming; the queries never stopped coming.You need the speed and the scale, and it's about time we innovate and solve the problem well because in 2015, 2016, when I was researching for this, every company that was looking to solve build applications that were real-time applications was building a custom Rube Goldberg machine of sorts. And it was insanely complex, it was insanely expensive. Fast-forward now, you can build a real-time application in a matter of hours with the simplicity of the cloud using Rockset.Corey: There's a lot to be said that the way we used to do things after the first transformation and we got into the world of batch processing, where—in the days of punch cards, which was a bit before my time and I believe yours as well—where they would drop them off and then the next day, or two days, they would come back later after the run, they would get the results only to figure out syntax error because you put the wrong card first or something like that. And it was maddening. In time, that got better, but still, nightly runs have become a thing to the point where even now, by default, if you wind up looking at the typical timing of a default Linux install, for example, you see that in the middle of the night is when a bunch of things will rotate when various cleanup jobs get done, et cetera, et cetera. And that seemed like a weird direction to go in. One of the most famous Google April Fools Day jokes was when they put out their white paper on MapReduce.And then Yahoo fell for it hook, line, and sinker, built out Hadoop, and we've been stuck with this idea of performing these big query jobs on top of existing giant piles of data, where ideally, you can measure it with a wall clock; in practice, you often measure the calendar in some cases. And as the world continues to evolve, being able to do streaming processing and understand in real-time what is going on, is unlocking different approaches, at least by all accounts. Do you have an example you can give me of a problem that real-time analytics solves for a customer? Because I can sit here and talk all day about how things might theoretically work, but I have to get out of my Route 53-based ivory tower over here, what are customers seeing?Venkat: That's a great question. And I want one hundred percent agree. I think Google did build MapReduce, and I think it's a very nice continuation of what happened there and what is happening in the world now. And built MapReduce and they quickly realized re-indexing the whole world [laugh] every night, as the size of the internet is exploding is a bad idea. And you know how Google index is now? They do real-time indexing.That is how they index the wor—you know, web. And they look for the changes that are happening in the internet, and they only index the changes. And that is exactly the same principle behind—one of the core principles behind Rockset's real-time analytics platform. So, what is the customer story? So, let me give you one of my favorite ones.So, the world's number one or number two buy now, pay later company, they have hundreds of millions of users, they have 300,000-plus merchants, they operate in, like, maybe 100-plus countries, so many different payment methods, you can imagine the complexity. At any given point in time, some part of the product is broken, well, Apple Pay stopped working in Switzerland for this e-commerce merchant. Oh God, like, we got to first detect that. Forget even debugging and figuring out what happened and having an incident response team. So, what did they do as they scale the number of payments processed in the system across the world—it's, like, in millions; first, it was millions in the day, and there was millions in an hour—so like everybody else, they built a batch-based system.So, they would accumulate all these payment records, and every six hours—so initially, it was a day, and then afterwards, you know, you try to see how far I can push it, and they couldn't push it beyond every six hours. Every six hours, some batch job would come and process through all the payments that happened, have some statistical models to detect, hey, here are some of the things that you might want to double-click and follow up on. And as they were scaling, the batch job that they will kick off every six hours was starting to take more than six hours. So, you can see how the story goes. Now, fast-forward, they came to us and say—it's almost like Rockset has, like, a big red button that says, “Real-time this.”And then they kind of like, “Can you make this real-time? Because not only that we are losing millions of potential revenue dollars in a year because something stops working and we're not processing payments, and we don't find out about that up to, like, three hours later, five hours later, six hours later, but our merchants are also very unhappy. We are also not able to protect our customers' business because that is all we are about.” And so fast-forward, they use Rockset, and simply using SQL now they have all the metrics and statistical computation that they want to do, happens in real-time, that are accurate up to the second. All of their anomaly detectors run every minute and the anomaly detectors take, like, hundreds of milliseconds to run.And so, now they've cut down the business observability, I would say. It's not metrics and machine observability is actually the—you know, they have now business observability in real-time. And that not only actually saves them a lot of potential revenue loss from downtimes, that's also allowing them to build a better product and give their customers a better experience because they are now telling their merchants and their customers that something is not working in some part of your e-commerce footprint before even the customers notice that something is wrong. And that allows them to build a better product and a better customer experience than their competitors. So, this is a very real-world example of why companies and enterprises are moving from batch to real-time.Corey: With the stories that you, and frankly, a lot of other data analytics companies tend to fall back on all the time has been stories of the ones you're telling, where you're talking about the largest buy now, pay later lender, for example. These are companies operating at massive scale who have tremendous existing transaction volume, and they're built out already. That's great, but then I wanted to try to cut to the truth of some of these things. And when I visit your pricing page at Rockset, it doesn't have what I would expect if that were the only use case. And what that would be is, “Great. Call here to conta—open up a sales quote, and we'll talk to you et cetera, et cetera, et cetera.”And the answer then is, “Okay, I know it's going to have at least two commas in it, ideally, not three, but okay, great.” Instead, you have a free tier where it's, “Hey, we'll give you a pile of credits, here's some limits on our free account, et cetera, et cetera.” Great. That is awesome. So, it tells me that there is a use case here for folks who have not already, on some level, made a good show of starting the process of conquering the world.Rather, someone with an idea some evening at two in the morning can wind up diving in and getting started. What is the Twitter for Pets, in my garage, spare-time side project story for using something like Rockset? What problem will I have as I wind up building those things out, when I don't have any user traffic or data yet, but I want to, you know for once in my life, do the smart thing in advance rather than building an impressive tower of technical debt?Venkat: That is the first thing we built, by the way. When we finish our product, the first thing we built was self-service. The first thing we built was a free forever tier, which has certain limits because somebody has to pay the bill, right? And then we also have compute instances that are very, very affordable that cost you, like, approximately $1 a day. And so, we built all of that because real-time analytics is not a need that only, like, the large-scale companies have. And I'll give you a very, very simple example.Let's say you're building a game, it's a mobile game. You can use Amazon DynamoDB and use AWS Lambdas and have a serverless stack and, like, you're really only paying… you're kind of keeping your footprint very, very small, and you're able to build a very lively game and see if it gets [wider 00:12:16], and it's growing. And once it grows, you can have all the big company scaling problems. But in the early days, you're just getting started. Now, if you think about DynamoDB and Lambdas and whatnot, you can build almost every part of the game except probably the leaderboard.So, how do I build a leaderboard when thousands of people are playing and all of their individual gameplays and scores and everything is just another simple record in DynamoDB. It's all serverless. But DynamoDB doesn't give me a SQL SELECT *, order by score, limit 100, distinct by the same player. No, this is a analytical question, and it has to be updated in real-time, otherwise, you really don't have this thing where I just finished playing. I go to the leaderboard, and within a second or two, if it doesn't update, you kind of lose people along the way. So, this is one of actually a very popular use case, when the scale is much smaller, which is, like, Rockset augments NoSQL database like a Dynamo or a Mongo where you can continue to use that for—or even a Postgres or MySQL for that case where you can use that as your system of record and keep it small, but cover all of your compute-heavy and analytical parts of your application with Rockset.So, it's almost like kind of a CQRS pattern where you use your OLTP database as your system of record, you connect Rockset to it, and so—Rockset comes in with built-in connectors, by the way, so you don't have to write a single line of code for your inserts and updates and deletes in your transactional database to get reflected in Rockset within one to two seconds. And so now, all of a sudden you have a fully indexed, fast SQL replica of your transactional database that on which you can do all sorts of analytical queries and that's fully isolated with your transactional database. So, this is the pattern that I'm talking about. The mobile leaderboard is an example of that pattern where it comes in very handy. But you can imagine almost everybody building some kind of an application has certain parts of it that is very analytical in nature. And by augmenting your transactional database with Rockset, you can have your cake and eat it too.Corey: One of the challenges I think that at least I've run into when it comes to working with data—and let's be clear, I tend to deal with data in relatively small volumes, mostly. The stuff that's significantly large, like, oh, I don't know, AWS bills from large organizations, the format of those is mostly predefined. When I'm building something out, we're using, I don't know, DynamoDB or being dangerous with SQLite or whatnot, invariably I find that even at small-scale, I paint myself into a corner by data model design or how I wind up structuring access or the rest, and the thing that I'm doing that makes perfect sense today winds up being incredibly challenging to change later. And I still, in production and have a DynamoDB table that has the word ‘test' in its name because of course I do.It's not a great place to find yourself in some cases. And I'm curious as to what you've seen, as you've been building this out and watching customers, especially ones who already had significant datasets as they move to you. Do you have any guidance around how to avoid falling down that particular well?Venkat: I will say a lot of the complexity in this world is by solving the right problems using the wrong tool, or by solving the right problem on the wrong part of the stack. I'll unpack this a little bit, right? So, when your patterns change, your application is getting more complex, it is demanding more things, that doesn't necessarily mean the first part of the application you build—and let's say DynamoDB was your solution for that—was the wrong choice. That is the right choice, but now you're expanded the scope of your application and the demand that you have on your backend transactional database. And now you have to ask the question, now in the expanded scope, which ones are still more of the same category of things on why I chose Dynamo and which ones are actually not at all?And so, instead of going and abusing the GSIs and other really complex and expensive indexing options and whatnot, that Dynamo, you know, has built, and has all sorts of limitations, instead of that, what do I really need and what is the best tool for the job, right? What is the best system for that? And how do I augment? And how do I manage these things? And this goes to the first thing I said, which is, like, this tremendous complexity when you start to build a Rube Goldberg machine of sorts.Okay, now, I'm going to start making changes to Dynamo. Oh, God, like, how do I pick up all of those things and not miss a single record? Now, replicate that to another second system that is going to be search-centric or reporting-centric, and do I have to rethink this once in a while? Do I have to build and manage these pipelines? And suddenly, instead of going from one system to two system, you actually end up going from one system to, like, four different things that with all the pipes and tubes going into the middle.And so, this is what we really observed. And so, when you come in to Rockset and you point us at your DynamoDB table, you don't write a single line of code, and Rockset will automatically scan your Dynamo tables, move that into Rockset, and in real-time, your changes, insert, updates, deletes to Dynamo will be reflected in Rockset. And this is all using Dynamo Streams API, Dynamo Scan API, and whatnot, behind the scenes. And this just gives you an example of if you use the right tool for the job here, when suddenly your application is demanding analytical queries on Dynamo, and you do the right research and find the right tool, your complexity doesn't explode at all, and you can still, again, continue to use Dynamo for what it is very, very good at while augmenting that with a system built for analytics with full-featured SQL and other capabilities that I can talk about, for the parts of your application for which Dynamo is not a good fit. And so, if you use the right tool for the job, you should be in very good place.The other thing is part about this wrong part of the stack. I'll give a very kind of naive example, and then maybe you can extrapolate that to, like, other patterns on how people could—you know, accidental complexities the worst. So, let's just say you need to implement access control on your data. Let's say the best place to implement access control is at the database level, just happens to be that is the right thing. But this database that I picked, doesn't really have role-based access control or what have you, it doesn't really give me all the security features to be able to protect the data the way I want it.So, then what I'm going to do is, I'm going to go look at all the places that is actually having business logic and querying the database and I'm going to put a whole bunch of permission management and roles and privileges, and you can just see how that will be so error-prone, so hard to maintain, and it will be impossible to scale. And this is what is the worst form of accidental complexity because if you had just looked at it that one week or two weeks, how do I get something out, or the database I picked doesn't have it, and then the two weeks, you feel like you made some progress by, kind of like, putting some duct tape if conditions on all the access paths. But now, [laugh] you've just painted yourself into a really, really bad corner.And so, this is another variation of the same problem where you end up solving the right problems in the wrong part of the stack, and that just introduces tremendous amount of accidental complexity. And so, I think yeah, both of these are the common pitfalls that I think people make. I think it's easy to avoid them. I would say there's so much research, there's so much content, and if you know how to search for these things, they're available in the internet. It's a beautiful place. [laugh]. But I guess you have to know how to search for these things. But in my experience, these are the two common pitfalls a lot of people fall into and paint themselves in a corner.Corey: Couchbase Capella Database-as-a-Service is flexible, full-featured and fully managed with built in access via key-value, SQL, and full-text search. Flexible JSON documents aligned to your applications and workloads. Build faster with blazing fast in-memory performance and automated replication and scaling while reducing cost. Capella has the best price performance of any fully managed document database. Visit couchbase.com/screaminginthecloud to try Capella today for free and be up and running in three minutes with no credit card required. Couchbase Capella: make your data sing.Corey: A question I have, though, that is an extension is this—and I want to give some flavor to it—but why is there a market for real-time analytics? And what I mean by that is, early on in my tenure of fixing horrifying AWS bills, I saw a giant pile of money being hurled over at effectively a MapReduce cluster for Elastic MapReduce. Great. Okay, well, stream-processing is kind of a thing; what about migrating to that? Well, that was a complete non-starter because it wasn't just the job running on those things; there were downstream jobs, and with their own downstream jobs. There were thousands of business processes tied to that thing.And similarly, the idea of real-time analytics, we don't have any use for that because of, oh I don't know, I only wind up pulling these reports on a once-a-week basis, and that's fine, so what do I need that updated for in real-time if I'm looking at them once a week? In practice, the answer is often something aligned with the, “Well, yeah, but you had a real-time updating dashboard, you would find that more useful than those reports.” But people's expectations and business processes have shaped themselves around constraints that now can be removed, but how do you get them to see that? How do you get them to buy in on that? And then how do you untangle that enormous pile of previous constraint into something that leverages the technology that's now available for a brighter future?Venkat: I think [unintelligible 00:21:40] a really good question, who are the people moving to real-time analytics? What do they see? And why can they do it with other tech? Like, you know, as you say… EMR, you know, it's just MapReduce; can't I just run it in sort of every twenty-four hours, every six hours, every hour? How about every five minutes? It doesn't work that way.Corey: How about I spin up a whole bunch of parallel clusters on different timescales so I constantly—Venkat: [laugh].Corey: Have a new report coming in. It's real-time, except—Venkat: Exactly.Corey: You're constantly putting out new ones, but they're just six hours delayed every time.Venkat: Exactly. So, you don't really want to do this. And so, let me unpack it one at a time, right? I mean, we talked about a very good example of a business team which is building business observability at the buy now, pay later company. That's a very clear value-prop on why they want to go from batch to real-time because it saves their company tremendous losses—potential losses—and also allows them to build a better product.So, it could be a marketing operations team looking to get more real-time observability to see what campaigns are working well today and how do I double down and make sure my ad budget for the day is put to good use? I don't have to mention security operations, you know, needing real-time. Don't tell me I got owned three days ago. Tell me—[laugh] somebody is, you know, breaking glass and might be, you know, entering into your house right now. And tell me then and not three days later, you know—Corey: “Yeah, what alert system do you have for security intrusion?” “So, I read the front page of_The New York Times_ every morning and waiting to see my company's name.” Yeah, there probably are better ways to reduce that cycle time.Venkat: Exactly, right. And so, that is really the need, right? Like, I think more and more business teams are saying, “I need operational intelligence and not business intelligence.” Don't make me play Monday morning quarterback.My favorite analogy is it's the middle of the third quarter. I'm six points down. A couple of people, star players in my team and my opponent's team are injured, but there's some in offense, some in defense. What plays do I do and how do I play the game slightly differently to change the outcome of the game and win this game as opposed to losing by six points. So, that I think is kind of really what is driving businesses.You know, I want to be more agile, I want to be more nimble, and take, kind of, being data-driven decision-making to another level. So that, I think, is the real force in play. So, now the real question is, why can they do it already? Because if you go ask a hundred people, “Do you want fast analytics on real-time data or slow analytics on stale data?” How many people are going to say give me slow and stale? Zero, right? Exactly zero people.So, but then why hasn't it happened yet? I think it goes back to the world only has seen two kinds of databases: Transaction processing systems, built for system of record, don't lose my data kind of systems; and then batch analytics, you know, all these warehouses and data lakes. And so, in real-time analytics use cases, the data never stops coming, so you have to actually need a system that is running 24/7. And then what happens is, as soon as you build a real-time dashboard, like this example that you gave, which is, like, I just want all of these dashboards to automatically update all the time, immediately people respond, says, “But I'm not going to be like Clockwork Orange, you know, toothpicks in my eyelids and be staring at this 24/7. Can you do something to alert or detect some anomalies and tap on my shoulder when something off is going on?”And so, now what happens is somebody's actually—a program more than a person—is actually actively monitoring all of these metrics and graphs and doing some analysis, and only bringing this to your attention when you really need to because something is off, right? So, then suddenly what happens is you went from, accumulate all the data and run a batch report to [unintelligible 00:25:16], like, the data never stops coming, the queries never stopped coming, I never stop asking questions; it's just a programmatic way of asking those things. And at that point, you have a data app. This is not a analytics dashboard report anymore. You have a full-fledged application.In fact, that application is harder to build and scale than any application you've ever built before [laugh] because in those situations, again, you don't have this torrent of data coming in all the time and complex analytical questions you're asking on the data 24/7, you know? And so, that I think is really why real-time analytics platform has to be built as almost a third leg. So, this is what we call data apps, which is when your data never stops coming and your queries never stop coming. So, this is really, I think, what is pushing all the expensive EMR clusters or misusing your warehouse, misusing your data lakes. At the end of the day, is what is I think blowing up your Snowflake bills, is what blowing up your warehouse builds because you somehow accidentally use the wrong tool for the job [laugh] going back to the one that we just talked about.You accidentally say, “Oh, God, like, I just need some real-time.” With enough thrust, pigs can fly. Is that a good idea? Probably not, right? And so, I don't want to be building a data app on my warehouse just because I can. You should probably use the best tool for the job, and really use something that was built ground up for it.And I'll give you one technical insight about how real-time analytics platforms are different than warehouses.Corey: Please. I'm here for this.Venkat: Yes. So really, if you think about warehouses and data lakes, I call them storage-optimized systems. I've been building databases all my life, so if I have to really build a database that is for batch analytics, you just break down all of your expenses in terms of let's say, compute and storage. What I'm burning 24/7 is storage. Compute comes and goes when I'm doing a batch data load, or I'm running—an analyst who logs in and tries to run some queries.But what I'm actually burning 24/7 is storage, so I want to compress the heck out of the data, and I want to store it in very cheap media. I want to store it—and I want to make the storage as cheap as possible, so I want to optimize the heck out of the storage use. And I want to make computation on that possible but not efficient. I can shuffle things around and make the analysis possible, but I'm not trying to be compute-efficient. And we just talked about how, as soon as you get into real-time analytics, you very quickly get into the data app business. You're not building a real-time dashboard anymore, you're actually building your application.So, as soon as you get into that, what happens is you start burning both storage and compute 24/7. And we all know, relatively, [laugh] compute and RAM is about a hundred to a thousand times more expensive than storage in the grand scheme of things. And so, if you actually go and look at your Snowflake bill, if you go look at your warehouse bill—BigQuery, no matter what—I bet the computational part of it is about 90 to 95% of the bill and not the storage. And then, if you again, break down, okay, who's spending all the compute, and you'll very quickly narrow down all these real-time-y and data app-y use cases where you can never turn off the compute on your warehouse or your BigQuery, and those are the ones that are blowing up your costs and complexity. And on the Rockset side, we are actually not storage-optimized; we're compute-optimized.So, we index all the data as it comes in. And so, the storage actually goes slightly higher because the, you know, we stored the data and also the indexes of those data automatically, but we usually fold the computational cost to a quarter of what a typical warehouse needs. So, the TCO for our customers goes down by two to four folds, you know? It goes down by half or even to a quarter of what they used to spend. Even though their storage cost goes up in net, that is a very, very small fraction of their spend.And so really, I think, good real-time analytics platforms are all compute-optimized and not storage-optimized, and that is what allows them to be a lot more efficient at being the backend for these data applications.Corey: As someone who spends a lot of time staring into the depths of AWS bills, I think that people also lose sight of the reality that it doesn't matter what you're spending on AWS; it invariably pales in comparison to what you're spending on people to work with these things. The reason to go to cloud is not because it is the cheapest possible way to get computers to do things; it's because it's a capability story. It's about unlocking capacity and capabilities you do not have otherwise. And that dramatically increases your feature velocity and it lets you to achieve things faster, sooner, with better results. And unlocking a capability is always going to be more interesting to a company than saving money on it. When a company cares first, last, and always about just save money, make the bill lower, the end, it's usually a company in decline. Or alternately, something very strange is going on over there.Venkat: I agree with that. One of our favorite customers told us that Rockset took their six-month roadmap and shrunk it to a single afternoon. And their supply chain SaaS backend for heavy construction, 80% of concrete that are being delivered and tracked in North America follows through their platform, and Rockset powers all of their real-time analytics and reporting. And before Rockset, what did they have? They had built a beautiful serverless stack using DynamoDB, even have AWS Lambdas and what-have-you.And why did they have to do all serverless? Because the entire team was two people. [laugh]. And maybe a third person once in a while, they'll get, so 2.5. Brilliant people, like, you know, really pioneers of building an entire data stack on AWS in a serverless fashion; no pipes, no ETL.And then they were like, oh God, finally, I have to do something because my business demands and my customers are demanding real-time reporting on all of these concrete trucks and aggregate trucks delivering stuff. And real-time reporting is the name of the game for them, and so how do I power this? So, I have to build a whole bunch of pipes, deliver it to, like, some Elasticsearch or some kind of like a cluster that I had to keep up in real-time. And this will take me a couple of months, that will take me a couple of months. They came into Rockset on a Thursday, built their MVP over the weekend, and they had the first working version of their product the following Tuesday.So—and then, you know, there was no turning back at that point, not a single line of code was written. You know, you just go and create an account with Rockset, point us at your Dynamo, and then off you go. You know, you can use start using SQL and go start building your real-time application. So again, I think the tremendous value, I think a lot of customers like us, and a lot of customers love us. And if you really ask them what is one thing about Rockset that you really like, I think it'll come back to the same thing, which is, you gave me a lot of time back.What I thought would take six months is now a week. What I thought would be three weeks, we got that in a day. And that allows me to focus on my business. I want to spend more time with my stakeholders, you know, my CPO, my sales teams, and see what they need to grow our business and succeed, and not build yet another data pipeline and have data pipelines and other things coming out of my nose, you know? So, at the end of the day, the simplicity aspects of it is very, very important for real-time analytics because, you know, we can't really realize our vision for real-time being the new default in every enterprise for whenever analytics concern without making it very, very simple and accessible to everybody.And so, that continues to be one of our core thing. And I think you're absolutely right when you say the biggest expense is actually the people and the time and the energy they have to spend. And not having to stand up a huge data ops team that is building and managing all of these things, is probably the number one reason why customers really, really like working with our product.Corey: I want to thank you for taking so much time to talk me through what you're working on these days. If people want to learn more, where's the best place to find you?Venkat: We are Rockset, I'll spell it out for your listeners ROCKSET—rock set—rockset.com. You can go there, you can start a free trial. There is a blog, rockset.com/blog has a prolific blog that is very active. We have all sorts of stories there, and you know engineers talking about how they implemented certain things, to customer case studies.So, if you're really interested in this space, that's one on space to follow and watch. If you're interested in giving this a spin, you know, you can go to rockset.com and start a free trial. If you want to talk to someone, there is, like, a ‘Request Demo' button there; you click it and one of our solutions people or somebody that is more familiar with Rockset would get in touch with you and you can have a conversation with them.Corey: Excellent. And links to that will of course go in the [show notes 00:34:20]. Thank you so much for your time today. I appreciate it.Venkat: Thanks, Corey. It was great.Corey: Venkat Venkataramani, co-founder and CEO at Rockset. I'm Cloud Economist Corey Quinn and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice along with an insulting crappy comment that I will immediately see show up on my real-time dashboard.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Dhruba Borthakur is CTO at Rockset and a passionate Data Engineer. Before co-founding Rockset he played a big role in development of Hadoop HDFS at Yahoo as well as HBase and RocksDB at Facebook. His current project is the serverless Rockset platform where you can gain real time analytics insight into your data. I tried it out before our talk and really liked it.
Angelo begins this episode with the predictions of Moore's Law. In the early years, systems were restricted based on the CPUs ability to keep up. As the CPUs continued to advance, the bottlenecks ended up around data movement. Data movement of information from disc to memory and memory to cache become the big bottlenecks. Then of course, disks got faster and eventually you'd have so much RAM on a machine that it was just memory movement inside of RAM. Eventually, we believe the bottleneck will return to the CPU.Quantum computing is on the rise—that, we believe, is a game changer for Moore's Law. Because we're no longer talking about conventional computer chips and transistors, instead we're talking about something completely different.Additionally, as machine learning and artificial intelligence systems make advancements, or like Angelo's thesis using AI to tune data systems, the advancement of speed and acceleration will be impactful in order of magnitude from traditional systems.Talent and staffing will also change as we adapt to the future. Angelo admires Google's practices of hiring ability over experience because the problems we face tomorrow are different than today. The key thing is to be able to independently make progress because there isn't much room for babysitting. It's too hard to predict where the next fire will be. Angelo explains further why he hires ability over experience every single time, because it is true, someone who has ability, someone who's brilliant and has the hunger to learn new things can be programmed like a stem cell. They can just inject themselves into whatever problem they might have.Angelo transitions into his own personal story and his quest for fulfillment and happiness. He introduces a personal story of a boy who was dying during the Nazi occupied island in Chios, Greece. A doctor took pity on this boy and secretly nursed him to health. We later learn that this boy is Angelo's father. Angelo shares, “My father grew up in a world much different than mine. His siblings related stories of famine and suffering, but he never ever spoke of those things. What he chose to relate were accounts of human triumph, perseverance, hope, aspiration. The sea was his salvation, carrying him from Chios as a sailor, eventually to the United States.”So, what is our true potential? Intellectual achievements can be ignored or forgotten. But to be a successful family person, a husband, a father, a human, Angelo needed to be something more, something enduring.Education builds the qualities of perseverance, hard work, and accomplishment. There is no doubt you'll accomplish many things, but think about what it is that you're really trying to do. You see, building technical solutions isn't just about doing interesting stuff. Ultimately we're building these things for a reason. We're building technology. For example, if you're doing a healthcare application, it's going to touch somebody's life. That's the point of this breakthrough, right? You want to increase throughput, for example, in decision support, something Angelo spends a lot of time on. We want to say, increase throughput, build a system that can compute faster and bigger sets of data. Why are we doing that? Just because of the challenge of the data? No, we want to find out if a clinical intervention is working so that we can feed that information forward to those making the guidelines.You see, that's the real reason behind doing this. The great resignation has shown us that people care more about what it is they're doing and why they're doing it than just simply being interesting work. We owe it to our family to use our gifts, talents, and opportunities to the best of our ability, but to use them on something that matters.Angelo is really excited that we're going to have interesting conversations around things like the universe, data centers, energy and how they work. There's a reason the hard problem exists. Don't fixate on the fact that it's a problem. Although there is joy in having a problem and solving it.We're trying something a little bit new this season and we would love to hear which kinds of episodes you like most. Do you like interviews or do you like some of the educational discussional episodes?We're going to start a YouTube channel to help deep dive on topics like LSM Trees or RocksDB which are better served with diagrams than with just voice. Seeing the math for yourself or seeing the way that they operate for yourself on video is much more helpful. We're going to have supplementary content, bonus material that you can find on our YouTube channel, and we'll also have some bonus podcast episodes. We look forward to your feedback. Tell us what you like about the show, which topics you prefer, and what you wish we would dive a little deeper on. And we'll really try to do that. CitationsGordon Moore, Co-Founder of IntelHeisenberg, Uncertainty PrinciplePowell, James. (2008). The Quantum Limit to Moore's Law. Proceedings of the IEEE. 96. 1247 - 1248. 10.1109/JPROC.2008.925411.Merritt, Rick. (2013). Moore's Law Dead by 2022, Expert Says of EETimesAtomic Hire (2019) Further ReadingMoore's Law EndingWork and Culture at GoogleGoogle Strategy to Hire About the HostAngelo Kastroulis is an award-winning technologist, inventor, entrepreneur, speaker, data scientist, and author best known for his high-performance computing and Health IT experience. He is the principal consultant, lead architect, and owner of Carrera Group, a consulting firm specializing in software modernization, event streaming (Kafka), big data, analytics (Spark, elastic Search, and Graph), and high-performance software development on many technical stacks (Java, .net, Scala, C++, and Rust). A Data Scientist at heart, trained at the Harvard Data Systems Lab, Angelo enjoys a research-driven approach to creating powerful, massively scalable applications and innovating new methods for superior performance. He loves to educate, discover, then see the knowledge through to practical implementation. Host: Angelo KastroulisExecutive Producer: Náture KastroulisProducer: Albert Perrotta;Communications Strategist: Albert Perrotta;Audio Engineer: Ryan ThompsonMusic: All Things Grow by Oliver Worth
Data is the most important element of artificial intelligence, but how is that data managed and stored? In this episode of Utilizing AI, Adi Gelvan of Speedb goes deep under the hood to take a look at the data engine along with Frederic Van Haren and Stephen Foskett. Facebook's RocksDB provides the basic storage for many webscale projects, managing metadata in a massive scale. Because of the inherent limits of RocksDB, most cloud applications shard data across many data engines. But Speedb takes a different approach, bringing more advanced storage technology to build a compatible data engine. A good data engine can massively improve overall performance, and data scientists and AI engineers would be wise to consider the storage engine, not just the processing components and models. Three Questions: Frederic Van Haren: In what areas will AI have little to no impact? Stephen: Is AI just a new aspect of data science or is it truly a unique field? Rob Telson of BrainChip: Where do you see AI having the most beneficial impact on our society? Gests and Hosts Adi Gelvan, Co-Founder and CEO of Speedb. Find out more at www.speedb.io or reach out to Adi at adi@speedb.io. Frederic Van Haren, Founder at HighFens Inc., Consultancy & Services. Connect with Frederic on Highfens.com or on Twitter at @FredericVHaren. Stephen Foskett, Publisher of Gestalt IT and Organizer of Tech Field Day. Find Stephen's writing at GestaltIT.com and on Twitter at @SFoskett. Date: 2/08/2022 Tags: @SFoskett, @FredericVHaren, @speedb_io
About ABAB Periasamy is the co-founder and CEO of MinIO, an open source provider of high performance, object storage software. In addition to this role, AB is an active investor and advisor to a wide range of technology companies, from H2O.ai and Manetu where he serves on the board to advisor or investor roles with Humio, Isovalent, Starburst, Yugabyte, Tetrate, Postman, Storj, Procurify, and Helpshift. Successful exits include Gitter.im (Gitlab), Treasure Data (ARM) and Fastor (SMART).AB co-founded Gluster in 2005 to commoditize scalable storage systems. As CTO, he was the primary architect and strategist for the development of the Gluster file system, a pioneer in software defined storage. After the company was acquired by Red Hat in 2011, AB joined Red Hat's Office of the CTO. Prior to Gluster, AB was CTO of California Digital Corporation, where his work led to scaling of the commodity cluster computing to supercomputing class performance. His work there resulted in the development of Lawrence Livermore Laboratory's “Thunder” code, which, at the time was the second fastest in the world. AB holds a Computer Science Engineering degree from Annamalai University, Tamil Nadu, India.AB is one of the leading proponents and thinkers on the subject of open source software - articulating the difference between the philosophy and business model. An active contributor to a number of open source projects, he is a board member of India's Free Software Foundation.Links: MinIO: https://min.io/ Twitter: https://twitter.com/abperiasamy MinIO Slack channel: https://minio.slack.com/join/shared_invite/zt-11qsphhj7-HpmNOaIh14LHGrmndrhocA LinkedIn: https://www.linkedin.com/in/abperiasamy/ TranscriptAnnouncer: Hello, and welcome to Screaming in the Cloud with your host, Chief Cloud Economist at The Duckbill Group, Corey Quinn. This weekly show features conversations with people doing interesting work in the world of cloud, thoughtful commentary on the state of the technical world, and ridiculous titles for which Corey refuses to apologize. This is Screaming in the Cloud.Corey: This episode is sponsored in part by our friends at Sysdig. Sysdig is the solution for securing DevOps. They have a blog post that went up recently about how an insecure AWS Lambda function could be used as a pivot point to get access into your environment. They've also gone deep in-depth with a bunch of other approaches to how DevOps and security are inextricably linked. To learn more, visit sysdig.com and tell them I sent you. That's S-Y-S-D-I-G dot com. My thanks to them for their continued support of this ridiculous nonsense.Corey: This episode is sponsored in part by our friends at Rising Cloud, which I hadn't heard of before, but they're doing something vaguely interesting here. They are using AI, which is usually where my eyes glaze over and I lose attention, but they're using it to help developers be more efficient by reducing repetitive tasks. So, the idea being that you can run stateless things without having to worry about scaling, placement, et cetera, and the rest. They claim significant cost savings, and they're able to wind up taking what you're running as it is, in AWS, with no changes, and run it inside of their data centers that span multiple regions. I'm somewhat skeptical, but their customers seem to really like them, so that's one of those areas where I really have a hard time being too snarky about it because when you solve a customer's problem, and they get out there in public and say, “We're solving a problem,” it's very hard to snark about that. Multus Medical, Construx.ai, and Stax have seen significant results by using them, and it's worth exploring. So, if you're looking for a smarter, faster, cheaper alternative to EC2, Lambda, or batch, consider checking them out. Visit risingcloud.com/benefits. That's risingcloud.com/benefits, and be sure to tell them that I said you because watching people wince when you mention my name is one of the guilty pleasures of listening to this podcast.in a siloCorey: Welcome to Screaming in the Cloud. I'm Corey Quinn. I'm joined this week by someone who's doing something a bit off the beaten path when we talk about cloud. I've often said that S3 is sort of a modern wonder of the world. It was the first AWS service brought into general availability. Today's promoted guest is the co-founder and CEO of MinIO, Anand Babu Periasamy, or AB as he often goes, depending upon who's talking to him. Thank you so much for taking the time to speak with me today.AB: It's wonderful to be here, Corey. Thank you for having me.Corey: So, I want to start with the obvious thing, where you take a look at what is the cloud and you can talk about AWS's ridiculous high-level managed services, like Amazon Chime. Great, we all see how that plays out. And those are the higher-level offerings, ideally aimed at problems customers have, but then they also have the baseline building blocks services, and it's hard to think of a more baseline building block than an object store. That's something every cloud provider has, regardless of how many scare quotes there are around the word cloud; everyone offers the object store. And your solution is to look at this and say, “Ah, that's a market ripe for disruption. We're going to build through an open-source community software that emulates an object store.” I would be sitting here, more or less poking fun at the idea except for the fact that you're a billion-dollar company now.AB: Yeah.Corey: How did you get here?AB: So, when we started, right, we did not actually think about cloud that way, right? “Cloud, it's a hot trend, and let's go disrupt is like that. It will lead to a lot of opportunity.” Certainly, it's true, it lead to the M&S, right, but that's not how we looked at it, right? It's a bad idea to build startups for M&A.When we looked at the problem, when we got back into this—my previous background, some may not know that it's actually a distributed file system background in the open-source space.Corey: Yeah, you were one of the co-founders of Gluster—AB: Yeah.Corey: —which I have only begrudgingly forgiven you. But please continue.AB: [laugh]. And back then we got the idea right, but the timing was wrong. And I had—while the data was beginning to grow at a crazy rate, end of the day, GlusterFS has to still look like an FS, it has to look like a file system like NetApp or EMC, and it was hugely limiting what we can do with it. The biggest problem for me was legacy systems. I have to build a modern system that is compatible with a legacy architecture, you cannot innovate.And that is where when Amazon introduced S3, back then, like, when S3 came, cloud was not big at all, right? When I look at it, the most important message of the cloud was Amazon basically threw everything that is legacy. It's not [iSCSI 00:03:21] as a Service; it's not even FTP as a Service, right? They came up with a simple, RESTful API to store your blobs, whether it's JavaScript, Android, iOS, or [AAML 00:03:30] application, or even Snowflake-type application.Corey: Oh, we spent ten years rewriting our apps to speak object store, and then they released EFS, which is NFS in the cloud. It's—AB: Yeah.Corey: —I didn't realize I could have just been stubborn and waited, and the whole problem would solve itself. But here we are. You're quite right.AB: Yeah. And even EFS and EBS are more for legacy stock can come in, buy some time, but that's not how you should stay on AWS, right? When Amazon did that, for me, that was the opportunity. I saw that… while world is going to continue to produce lots and lots of data, if I built a brand around that, I'm not going to go wrong.The problem is data at scale. And what do I do there? The opportunity I saw was, Amazon solved one of the largest problems for a long time. All the legacy systems, legacy protocols, they convinced the industry, throw them away and then start all over from scratch with the new API. While it's not compatible, it's not standard, it is ridiculously simple compared to anything else.No fstabs, no [unintelligible 00:04:27], no [root 00:04:28], nothing, right? From any application anywhere you can access was a big deal. When I saw that, I was like, “Thank you Amazon.” And I also knew Amazon would convince the industry that rewriting their application is going to be better and faster and cheaper than retrofitting legacy applications.Corey: I wonder how much that's retconned because talking to some of the people involved in the early days, they were not at all convinced they [laugh] would be able to convince the industry to do this.AB: Actually, if you talk to the analyst reporters, the IDC's, Gartner's of the world to the enterprise IT, the VMware community, they would say, “Hell no.” But if you talk to the actual application developers, data infrastructure, data architects, the actual consumers of data, for them, it was so obvious. They actually did not know how to write an fstab. The iSCSI and NFS, you can't even access across the internet, and the modern applications, they ran across the globe, in JavaScript, and all kinds of apps on the device. From [Snap 00:05:21] to Snowflake, today is built on object store. It was more natural for the applications team, but not from the infrastructure team. So, who you asked that mattered.But nevertheless, Amazon convinced the rest of the world, and our bet was that if this is going to be the future, then this is also our opportunity. S3 is going to be limited because it only runs inside AWS. Bulk of the world's data is produced everywhere and only a tiny fraction will go to AWS. And where will the rest of the data go? Not SAN, NAS, HDFS, or other blob store, Azure Blob, or GCS; it's not going to be fragmented. And if we built a better object store, lightweight, faster, simpler, but fully compatible with S3 API, we can sweep and consolidate the market. And that's what happened.Corey: And there is a lot of validity to that. We take a look across the industry, when we look at various standards—I mean, one of the big problems with multi-cloud in many respects is the APIs are not quite similar enough. And worse, the failure patterns are very different, of I don't just need to know how the load balancer works, I need to know how it breaks so I can detect and plan for that. And then you've got the whole identity problem as well, where you're trying to manage across different frames of reference as you go between providers, and leads to a bit of a mess. What is it that makes MinIO something that has been not just something that has endured since it was created, but clearly been thriving?AB: The real reason, actually is not the multi-cloud compatibility, all that, right? Like, while today, it is a big deal for the users because the deployments have grown into 10-plus petabytes, and now the infrastructure team is taking it over and consolidating across the enterprise, so now they are talking about which key management server for storing the encrypted keys, which key management server should I talk to? Look at AWS, Google, or Azure, everyone has their own proprietary API. Outside they, have [YAML2 00:07:18], HashiCorp Vault, and, like, there is no standard here. It is supposed to be a [KMIP 00:07:23] standard, but in reality, it is not. Even different versions of Vault, there are incompatibilities for us.That is where—like from Key Management Server, Identity Management Server, right, like, everything that you speak around, how do you talk to different ecosystem? That, actually, MinIO provides connectors; having the large ecosystem support and large community, we are able to address all that. Once you bring MinIO into your application stack like you would bring Elasticsearch or MongoDB or anything else as a container, your application stack is just a Kubernetes YAML file, and you roll it out on any cloud, it becomes easier for them, they're able to go to any cloud they want. But the real reason why it succeeded was not that. They actually wrote their applications as containers on Minikube, then they will push it on a CI/CD environment.They never wrote code on EC2 or ECS writing objects on S3, and they don't like the idea of [past 00:08:15], where someone is telling you just—like you saw Google App Engine never took off, right? They liked the idea, here are my building blocks. And then I would stitch them together and build my application. We were part of their application development since early days, and when the application matured, it was hard to remove. It is very much like Microsoft Windows when it grew, even though the desktop was Microsoft Windows Server was NetWare, NetWare lost the game, right?We got the ecosystem, and it was actually developer productivity, convenience, that really helped. The simplicity of MinIO, today, they are arguing that deploying MinIO inside AWS is easier through their YAML and containers than going to AWS Console and figuring out how to do it.Corey: As you take a look at how customers are adopting this, it's clear that there is some shift in this because I could see the story for something like MinIO making an awful lot of sense in a data center environment because otherwise, it's, “Great. I need to make this app work with my SAN as well as an object store.” And that's sort of a non-starter for obvious reasons. But now you're available through cloud marketplaces directly.AB: Yeah.Corey: How are you seeing adoption patterns and interactions from customers changing as the industry continues to evolve?AB: Yeah, actually, that is how my thinking was when I started. If you are inside AWS, I would myself tell them that why don't use AWS S3? And it made a lot of sense if it's on a colo or your own infrastructure, then there is an object store. It even made a lot of sense if you are deploying on Google Cloud, Azure, Alibaba Cloud, Oracle Cloud, it made a lot of sense because you wanted an S3 compatible object store. Inside AWS, why would you do it, if there is AWS S3?Nowadays, I hear funny arguments, too. They like, “Oh, I didn't know that I could use S3. Is S3 MinIO compatible?” Because they will be like, “It came along with the GitLab or GitHub Enterprise, a part of the application stack.” They didn't even know that they could actually switch it over.And otherwise, most of the time, they developed it on MinIO, now they are too lazy to switch over. That also happens. But the real reason that why it became serious for me—I ignored that the public cloud commercialization; I encouraged the community adoption. And it grew to more than a million instances, like across the cloud, like small and large, but when they start talking about paying us serious dollars, then I took it seriously. And then when I start asking them, why would you guys do it, then I got to know the real reason why they wanted to do was they want to be detached from the cloud infrastructure provider.They want to look at cloud as CPU network and drive as a service. And running their own enterprise IT was more expensive than adopting public cloud, it was productivity for them, reducing the infrastructure, people cost was a lot. It made economic sense.Corey: Oh, people always cost more the infrastructure itself does.AB: Exactly right. 70, 80%, like, goes into people, right? And enterprise IT is too slow. They cannot innovate fast, and all of those problems. But what I found was for us, while we actually build the community and customers, if you're on AWS, if you're running MinIO on EBS, EBS is three times more expensive than S3.Corey: Or a single copy of it, too, where if you're trying to go multi-AZ and you have the replication traffic, and not to mention you have to over-provision it, which is a bit of a different story as well. So, like, it winds up being something on the order of 30 times more expensive, in many cases, to do it right. So, I'm looking at this going, the economics of running this purely by itself in AWS don't make sense to me—long experience teaches me the next question of, “What am I missing?” Not, “That's ridiculous and you're doing it wrong.” There's clearly something I'm not getting. What am I missing?AB: I was telling them until we made some changes, right—because we saw a couple of things happen. I was initially like, [unintelligible 00:12:00] does not make 30 copies. It makes, like, 1.4x, 1.6x.But still, the underlying block storage is not only three times more expensive than S3, it's also slow. It's a network storage. Trying to put an object store on top of it, another, like, software-defined SAN, like EBS made no sense to me. Smaller deployments, it's okay, but you should never scale that on EBS. So, it did not make economic sense. I would never take it seriously because it would never help them grow to scale.But what changed in recent times? Amazon saw that this was not only a problem for MinIO-type players. Every database out there today, every modern database, even the message queues like Kafka, they all have gone scale-out. And they all depend on local block store and putting a scale-out distributed database, data processing engines on top of EBS would not scale. And Amazon introduced storage optimized instances. Essentially, that reduced to bet—the data infrastructure guy, data engineer, or application developer asking IT, “I want a SuperMicro, or Dell server, or even virtual machines.” That's too slow, too inefficient.They can provision these storage machines on demand, and then I can do it through Kubernetes. These two changes, all the public cloud players now adopted Kubernetes as the standard, and they have to stick to the Kubernetes API standard. If they are incompatible, they won't get adopted. And storage optimized that is local drives, these are machines, like, [I3 EN 00:13:23], like, 24 drives, they have SSDs, and fast network—like, 25-gigabit 200-gigabit type network—availability of these machines, like, what typically would run any database, HDFS cluster, MinIO, all of them, those machines are now available just like any other EC2 instance.They are efficient. You can actually put MinIO side by side to S3 and still be price competitive. And Amazon wants to—like, just like their retail marketplace, they want to compete and be open. They have enabled it. In that sense, Amazon is actually helping us. And it turned out that now I can help customers build multiple petabyte infrastructure on Amazon and still stay efficient, still stay price competitive.Corey: I would have said for a long time that if you were to ask me to build out the lingua franca of all the different cloud providers into a common API, the S3 API would be one of them. Now, you are building this out, multi-cloud, you're in all three of the major cloud marketplaces, and the way that you do that and do those deployments seems like it is the modern multi-cloud API of Kubernetes. When you first started building this, Kubernetes was very early on. What was the evolution of getting there? Or were you one of the first early-adoption customers in a Kubernetes space?AB: So, when we started, there was no Kubernetes. But we saw the problem was very clear. And there was containers, and then came Docker Compose and Swarm. Then there was Mesos, Cloud Foundry, you name it, right? Like, there was many solutions all the way up to even VMware trying to get into that space.And what did we do? Early on, I couldn't choose. I couldn't—it's not in our hands, right, who is going to be the winner, so we just simply embrace everybody. It was also tiring that to allow implement native connectors to all of them different orchestration, like Pivotal Cloud Foundry alone, they have their own standard open service broker that's only popular inside their system. Go outside elsewhere, everybody was incompatible.And outside that, even, Chef Ansible Puppet scripts, too. We just simply embraced everybody until the dust settle down. When it settled down, clearly a declarative model of Kubernetes became easier. Also Kubernetes developers understood the community well. And coming from Borg, I think they understood the right architecture. And also written in Go, unlike Java, right?It actually matters, these minute new details resonating with the infrastructure community. It took off, and then that helped us immensely. Now, it's not only Kubernetes is popular, it has become the standard, from VMware to OpenShift to all the public cloud providers, GKS, AKS, EKS, whatever, right—GKE. All of them now are basically Kubernetes standard. It made not only our life easier, it made every other [ISV 00:16:11], other open-source project, everybody now can finally write one code that can be operated portably.It is a big shift. It is not because we chose; we just watched all this, we were riding along the way. And then because we resonated with the infrastructure community, modern infrastructure is dominated by open-source. We were also the leading open-source object store, and as Kubernetes community adopted us, we were naturally embraced by the community.Corey: Back when AWS first launched with S3 as its first offering, there were a bunch of folks who were super excited, but object stores didn't make a lot of sense to them intrinsically, so they looked into this and, “Ah, I can build a file system and users base on top of S3.” And the reaction was, “Holy God don't do that.” And the way that AWS decided to discourage that behavior is a per request charge, which for most workloads is fine, whatever, but there are some that causes a significant burden. With running something like MinIO in a self-hosted way, suddenly that costing doesn't exist in the same way. Does that open the door again to so now I can use it as a file system again, in which case that just seems like using the local file system, only with extra steps?AB: Yeah.Corey: Do you see patterns that are emerging with customers' use of MinIO that you would not see with the quote-unquote, “Provider's” quote-unquote, “Native” object storage option, or do the patterns mostly look the same?AB: Yeah, if you took an application that ran on file and block and brought it over to object storage, that makes sense. But something that is competing with object store or a layer below object store, that is—end of the day that drives our block devices, you have a block interface, right—trying to bring SAN or NAS on top of object store is actually a step backwards. They completely missed the message that Amazon told that if you brought a file system interface on top of object store, you missed the point, that you are now bringing the legacy things that Amazon intentionally removed from the infrastructure. Trying to bring them on top doesn't make it any better. If you are arguing from a compatibility some legacy applications, sure, but writing a file system on top of object store will never be better than NetApp, EMC, like EMC Isilon, or anything else. Or even GlusterFS, right?But if you want a file system, I always tell the community, they ask us, “Why don't you add an FS option and do a multi-protocol system?” I tell them that the whole point of S3 is to remove all those legacy APIs. If I added POSIX, then I'll be a mediocre object storage and a terrible file system. I would never do that. But why not write a FUSE file system, right? Like, S3Fs is there.In fact, initially, for legacy compatibility, we wrote MinFS and I had to hide it. We actually archived the repository because immediately people started using it. Even simple things like end of the day, can I use Unix [Coreutils 00:19:03] like [cp, ls 00:19:04], like, all these tools I'm familiar with? If it's not file system object storage that S3 [CMD 00:19:08] or AWS CLI is, like, to bloatware. And it's not really Unix-like feeling.Then what I told them, “I'll give you a BusyBox like a single static binary, and it will give you all the Unix tools that works for local filesystem as well as object store.” That's where the [MC tool 00:19:23] came; it gives you all the Unix-like programmability, all the core tool that's object storage compatible, speaks native object store. But if I have to make object store look like a file system so UNIX tools would run, it would not only be inefficient, Unix tools never scaled for this kind of capacity.So, it would be a bad idea to take step backwards and bring legacy stuff back inside. For some very small case, if there are simple POSIX calls using [ObjectiveFs 00:19:49], S3Fs, and few, for legacy compatibility reasons makes sense, but in general, I would tell the community don't bring file and block. If you want file and block, leave those on virtual machines and leave that infrastructure in a silo and gradually phase them out.Corey: This episode is sponsored in part by our friends at Vultr. Spelled V-U-L-T-R because they're all about helping save money, including on things like, you know, vowels. So, what they do is they are a cloud provider that provides surprisingly high performance cloud compute at a price that—while sure they claim its better than AWS pricing—and when they say that they mean it is less money. Sure, I don't dispute that but what I find interesting is that it's predictable. They tell you in advance on a monthly basis what it's going to going to cost. They have a bunch of advanced networking features. They have nineteen global locations and scale things elastically. Not to be confused with openly, because apparently elastic and open can mean the same thing sometimes. They have had over a million users. Deployments take less that sixty seconds across twelve pre-selected operating systems. Or, if you're one of those nutters like me, you can bring your own ISO and install basically any operating system you want. Starting with pricing as low as $2.50 a month for Vultr cloud compute they have plans for developers and businesses of all sizes, except maybe Amazon, who stubbornly insists on having something to scale all on their own. Try Vultr today for free by visiting: vultr.com/screaming, and you'll receive a $100 in credit. Thats v-u-l-t-r.com slash screaming.Corey: So, my big problem, when I look at what S3 has done is in it's name because of course, naming is hard. It's, “Simple Storage Service.” The problem I have is with the word simple because over time, S3 has gotten more and more complex under the hood. It automatically tiers data the way that customers want. And integrated with things like Athena, you can now query it directly, whenever of an object appears, you can wind up automatically firing off Lambda functions and the rest.And this is increasingly looking a lot less like a place to just dump my unstructured data, and increasingly, a lot like this is sort of a database, in some respects. Now, understand my favorite database is Route 53; I have a long and storied history of misusing services as databases. Is this one of those scenarios, or is there some legitimacy to the idea of turning this into a database?AB: Actually, there is now S3 Select API that if you're storing unstructured data like CSV, JSON, Parquet, without downloading even a compressed CSV, you can actually send a SQL query into the system. IN MinIO particularly the S3 Select is [CMD 00:21:16] optimized. We can load, like, every 64k worth of CSV lines into registers and do CMD operations. It's the fastest SQL filter out there. Now, bringing these kinds of capabilities, we are just a little bit away from a database; should we do database? I would tell definitely no.The very strength of S3 API is to actually limit all the mutations, right? Particularly if you look at database, they're dealing with metadata, and querying; the biggest value they bring is indexing the metadata. But if I'm dealing with that, then I'm dealing with really small block lots of mutations, the separation of objects storage should be dealing with persistence and not mutations. Mutations are [AWS 00:21:57] problem. Separation of database work function and persistence function is where object storage got the storage right.Otherwise, it will, they will make the mistake of doing POSIX-like behavior, and then not only bringing back all those capabilities, doing IOPS intensive workloads across the HTTP, it wouldn't make sense, right? So, object storage got the API right. But now should it be a database? So, it definitely should not be a database. In fact, I actually hate the idea of Amazon yielding to the file system developers and giving a [file three 00:22:29] hierarchical namespace so they can write nice file managers.That was a terrible idea. Writing a hierarchical namespace that's also sorted, now puts tax on how the metadata is indexed and organized. The Amazon should have left the core API very simple and told them to solve these problems outside the object store. Many application developers don't need. Amazon was trying to satisfy everybody's need. Saying no to some of these file system-type, file manager-type users, what should have been the right way.But nevertheless, adding those capabilities, eventually, now you can see, S3 is no longer simple. And we had to keep that compatibility, and I hate that part. I actually don't mind compatibility, but then doing all the wrong things that Amazon is adding, now I have to add because it's compatible. I kind of hate that, right?But now going to a database would be pushing it to the whole new level. Here is the simple reason why that's a bad idea. The right way to do database—in fact, the database industry is already going in the right direction. Unstructured data, the key-value or graph, different types of data, you cannot possibly solve all that even in a single database. They are trying to be multimodal database; even they are struggling with it.You can never be a Redis, Cassandra, like, a SQL all-in-one. They tried to say that but in reality, that you will never be better than any one of those focused database solutions out there. Trying to bring that into object store will be a mistake. Instead, let the databases focus on query language implementation and query computation, and leave the persistence to object store. So, object store can still focus on storing your database segments, the table segments, but the index is still in the memory of the database.Even the index can be snapshotted once in a while to object store, but use objects store for persistence and database for query is the right architecture. And almost all the modern databases now, from Elasticsearch to [unintelligible 00:24:21] to even Kafka, like, message queue. They all have gone that route. Even Microsoft SQL Server, Teradata, Vertica, name it, Splunk, they all have gone object storage route, too. Snowflake itself is a prime example, BigQuery and all of them.That's the right way. Databases can never be consolidated. There will be many different kinds of databases. Let them specialize on GraphQL or Graph API, or key-value, or SQL. Let them handle the indexing and persistence, they cannot handle petabytes of data. That [unintelligible 00:24:51] to object store is how the industry is shaping up, and it is going in the right direction.Corey: One of the ways I learned the most about various services is by talking to customers. Every time I think I've seen something, this is amazing. This service is something I completely understand. All I have to do is talk to one more customer. And when I was doing a bill analysis project a couple of years ago, I looked into a customer's account and saw a bucket with okay, that has 280 billion objects in it—and wait was that billion with a B?And I asked them, “So, what's going on over there?” And there's, “Well, we built our own columnar database on top of S3. This may not have been the best approach.” It's, “I'm going to stop you there. With no further context, it was not, but please continue.”It's the sort of thing that would never have occurred to me to even try, do you tend to see similar—I would say they're anti-patterns, except somehow they're made to work—in some of your customer environments, as they are using the service in ways that are very different than ways encouraged or even allowed by the native object store options?AB: Yeah, when I first started seeing the database-type workloads coming on to MinIO, I was surprised, too. That was exactly my reaction. In fact, they were storing these 256k, sometimes 64k table segments because they need to index it, right, and the table segments were anywhere between 64k to 2MB. And when they started writing table segments, it was more often [IOPS-type 00:26:22] I/O pattern, then a throughput-type pattern. Throughput is an easier problem to solve, and MinIO always saturated these 100-gigabyte NVMe-type drives, they were I/O intensive, throughput optimized.When I started seeing the database workloads, I had to optimize for small-object workloads, too. We actually did all that because eventually I got convinced the right way to build a database was to actually leave the persistence out of database; they made actually a compelling argument. If historically, I thought metadata and data, data to be very big and coming to object store make sense. Metadata should be stored in a database, and that's only index page. Take any book, the index pages are only few, database can continue to run adjacent to object store, it's a clean architecture.But why would you put database itself on object store? When I saw a transactional database like MySQL, changing the [InnoDB 00:27:14] to [RocksDB 00:27:15], and making changes at that layer to write the SS tables [unintelligible 00:27:19] to MinIO, and then I was like, where do you store the memory, the journal? They said, “That will go to Kafka.” And I was like—I thought that was insane when it started. But it continued to grow and grow.Nowadays, I see most of the databases have gone to object store, but their argument is, the databases also saw explosive growth in data. And they couldn't scale the persistence part. That is where they realized that they still got very good at the indexing part that object storage would never give. There is no API to do sophisticated query of the data. You cannot peek inside the data, you can just do streaming read and write.And that is where the databases were still necessary. But databases were also growing in data. One thing that triggered this was the use case moved from data that was generated by people to now data generated by machines. Machines means applications, all kinds of devices. Now, it's like between seven billion people to a trillion devices is how the industry is changing. And this led to lots of machine-generated, semi-structured, structured data at giant scale, coming into database. The databases need to handle scale. There was no other way to solve this problem other than leaving the—[unintelligible 00:28:31] if you looking at columnar data, most of them are machine-generated data, where else would you store? If they tried to build their own object storage embedded into the database, it would make database mentally complicated. Let them focus on what they are good at: Indexing and mutations. Pull the data table segments which are immutable, mutate in memory, and then commit them back give the right mix. What you saw what's the fastest step that happened, we saw that consistently across. Now, it is actually the standard.Corey: So, you started working on this in 2014, and here we are—what is it—eight years later now, and you've just announced a Series B of $100 million dollars on a billion-dollar valuation. So, it turns out this is not just one of those things people are using for test labs; there is significant momentum behind using this. How did you get there from—because everything you're saying makes an awful lot of sense, but it feels, at least from where I sit, to be a little bit of a niche. It's a bit of an edge case that is not the common case. Obviously, I missing something because your investors are not the types of sophisticated investors who see something ridiculous and, “Yep. That's the thing we're going to go for.” There right more than they're not.AB: Yeah. The reason for that was the saw what we were set to do. In fact, these are—if you see the lead investor, Intel, they watched us grow. They came into Series A and they saw, everyday, how we operated and grew. They believed in our message.And it was actually not about object store, right? Object storage was a means for us to get into the market. When we started, our idea was, ten years from now, what will be a big problem? A lot of times, it's hard to see the future, but if you zoom out, it's hidden in plain sight.These are simple trends. Every major trend pointed to world producing more data. No one would argue with that. If I solved one important problem that everybody is suffering, I won't go wrong. And when you solve the problem, it's about building a product with fine craftsmanship, attention to details, connecting with the user, all of that standard stuff.But I picked object storage as the problem because the industry was fragmented across many different data stores, and I knew that won't be the case ten years from now. Applications are not going to adopt different APIs across different clouds, S3 to GCS to Azure Blob to HDFS to everything is incompatible. I saw that if I built a data store for persistence, industry will consolidate around S3 API. Amazon S3, when we started, it looked like they were the giant, there was only one cloud industry, it believed mono-cloud. Almost everyone was talking to me like AWS will be the world's data center.I certainly see that possibility, Amazon is capable of doing it, but my bet was the other way, that AWS S3 will be one of many solutions, but not—if it's all incompatible, it's not going to work, industry will consolidate. Our bet was, if world is producing so much data, if you build an object store that is S3 compatible, but ended up as the leading data store of the world and owned the application ecosystem, you cannot go wrong. We kept our heads low and focused on the first six years on massive adoption, build the ecosystem to a scale where we can say now our ecosystem is equal or larger than Amazon, then we are in business. We didn't focus on commercialization; we focused on convincing the industry that this is the right technology for them to use. Once they are convinced, once you solve business problems, making money is not hard because they are already sold, they are in love with the product, then convincing them to pay is not a big deal because data is so critical, central part of their business.We didn't worry about commercialization, we worried about adoption. And once we got the adoption, now customers are coming to us and they're like, “I don't want open-source license violation. I don't want data breach or data loss.” They are trying to sell to me, and it's an easy relationship game. And it's about long-term partnership with customers.And so the business started growing, accelerating. That was the reason that now is the time to fill up the gas tank and investors were quite excited about the commercial traction as well. And all the intangible, right, how big we grew in the last few years.Corey: It really is an interesting segment, that has always been something that I've mostly ignored, like, “Oh, you want to run your own? Okay, great.” I get it; some people want to cosplay as cloud providers themselves. Awesome. There's clearly a lot more to it than that, and I'm really interested to see what the future holds for you folks.AB: Yeah, I'm excited. I think end of the day, if I solve real problems, every organization is moving from compute technology-centric to data-centric, and they're all looking at data warehouse, data lake, and whatever name they give data infrastructure. Data is now the centerpiece. Software is a commodity. That's how they are looking at it. And it is translating to each of these large organizations—actually, even the mid, even startups nowadays have petabytes of data—and I see a huge potential here. The timing is perfect for us.Corey: I'm really excited to see this continue to grow. And I want to thank you for taking so much time to speak with me today. If people want to learn more, where can they find you?AB: I'm always on the community, right. Twitter and, like, I think the Slack channel, it's quite easy to reach out to me. LinkedIn. I'm always excited to talk to our users or community.Corey: And we will of course put links to this in the [show notes 00:33:58]. Thank you so much for your time. I really appreciate it.AB: Again, wonderful to be here, Corey.Corey: Anand Babu Periasamy, CEO and co-founder of MinIO. I'm Cloud Economist Corey Quinn, and this is Screaming in the Cloud. If you've enjoyed this podcast, please leave a five-star review on your podcast platform of choice, whereas if you've hated this podcast, please leave a five-star review on your podcast platform of choice, along with what starts out as an angry comment but eventually turns into you, in your position on the S3 product team, writing a thank you note to MinIO for helping validate your market.Corey: If your AWS bill keeps rising and your blood pressure is doing the same, then you need The Duckbill Group. We help companies fix their AWS bill by making it smaller and less horrifying. The Duckbill Group works for you, not AWS. We tailor recommendations to your business and we get to the point. Visit duckbillgroup.com to get started.Announcer: This has been a HumblePod production. Stay humble.
Orchestrate all the Things podcast: Connecting the Dots with George Anadiotis
RocksDB is the secret sauce underlying many data management systems. Speedb is a drop-in replacement for RocksDB that offers a significant boost in performance and now powers Redis on Flash. Article published on ZDNet
Matt Yonkovit, The HOSS at Percona, sits down with Nagavamsi (Vamsi) Ponnekanti, Software Engineer at Quora. During the show we dive into the details on how and why Quora moved from HBase to RocksDB via MyRocks. We also learn about the need to reduce latency and improve predictability in performance in large infrastructure and database systems. Vamsi highlights some of his favorite features and tools and gives us tips and tricks on database migrations.
Step with Venkat into a world where data is always fresh, queries run in 1ms, and analytics engineers build web-scale, real-time data apps. As Engineering Director at Facebook, Venkat helped build the RocksDB real-time database that powered growth to 5 billion queries per second(!)—and now with his colleagues at Rockset, he's bringing that real-time database infrastructure to the rest of us. In this conversation, Tristan, Julia and Venkat explore the fundamental technological advances that are empowering analytics engineers to enter the real-time future. For full show notes and to read 6+ years of back issues of the podcast's companion newsletter, head to https://roundup.getdbt.com. The Analytics Engineering Podcast is sponsored by dbt Labs.
We had a very fun and engaging chat with Matt Yonkovit who is the Chief Experience Officer at Percona, a service provider for open source databases like MySQL, PostGresQL, MariaDB and RocksDB. In total, Matt's been working with databases and open source for nearly 25 years. Link to transcript: https://openteams.com/landing/podcast-ep-11/ Percona’s Database Performance Blog: https://www.percona.com/blog/ Percona’s Open Source Data Management Survey: https://www.percona.com/open-source-data-management-software-survey Follow OpenTeams on: Twitter: https://twitter.com/openteamsinc LinkedIn: https://www.linkedin.com/company/open... Facebook: https://www.facebook.com/openteamsinc Instagram: https://www.instagram.com/openteams/ Support this podcast by liking this video and subscribing to OpenTeams’ YouTube channel: https://bit.ly/2ZBPGnt You can also show support for this podcast by leaving a rating and review for the podcast on Apple Podcasts. Link to podcast channel: https://apple.co/3itAzne Thanks for listening!
Toutes les notes sont disponibles sur https://www.clever-cloud.com/fr/podcast/episode16 Avec par ordre d'apparition : @ldoguin @noisegrrrl @blackyoup @judu Twitter is looking into why its photo preview appears to favor white faces over Black faces https://www.theverge.com/2020/9/20/21447998/twitter-photo-preview-white-black-faces Millions of black people affected by racial bias in health-care algorithms https://www.nature.com/articles/d41586-019-03228-6 Predictive policing algorithms are racist. They need to be dismantled. https://www.technologyreview.com/2020/07/17/1005396/predictive-policing-algorithms-racist-dismantled-machine-learning-bias-criminal-justice/ Intervention de Mickael Roger sur l'ITAR @MickaelRoger https://twitter.com/MickaelRoger/status/1307374729666461697 RFC “Les portails captifs c'est de la daube mais faut faire avec”. Portail Captif API + DHCP extensions https://www.bortzmeyer.org/8910.html https://www.bortzmeyer.org/8908.html Le DNS ne répond pas https://www.bortzmeyer.org/8906.html Percona a ajouté RocksDB comme alternative à InnoDB dans MySQL https://www.percona.com/blog/2020/02/20/when-to-use-myrocks-in-mysql/ CockroachDB remplace RocksDB par Pebble https://www.cockroachlabs.com/blog/pebble-rocksdb-kv-store Nouvelle release Warp10 https://blog.senx.io/warp-10-release-2-7-0-flows Nouvelle version de Firefox https://www.mozilla.org/en-US/firefox/81.0/releasenotes/ https://support.mozilla.org/en-US/kb/view-pdf-files-firefox-or-choose-another-viewer Voir comme le pire des daltoniens : ça fait quoi ? https://www.youtube.com/watch?v=bZRD9_waays Visualisation dans la future release Wirehsark https://twitter.com/chrissanders88/status/1306262108665982976?s=20 IP Fragmentation: https://www.bortzmeyer.org/8900.html Microsoft ressort son DC sousmarin https://www.bbc.com/news/technology-54146718 Choix musical de Judu: https://www.youtube.com/watch?v=xy1vUfk3Nqk
Hiếu Phạm từng là Senior Software Engineer tại Facebook, và hiện tại đang làm Senior Software Engineer tại Rockset, một công ty khởi nghiệp về cơ sở dữ liệu dựa trên Rocksdb. Trong kì lần này Hiếu sẽ chia sẻ về cuộc sống tại Sillicon Valley, trông cây gì nuôi con gì, ăn cái gì cũng như là những điều thú vị khác tại nơi nóng nhất thế giới về công nghệ này.
Welcome to our 5rd episode. This is the second part of a two part series where go deep into the internals of Yugabyte with Karthik and Kannan. Yugabyte is a highly scalable and developer friendly open source distributed SQL database. Yugabyte is built by an Ex-Facebook team that wanted to bring what they learnt running one of the latest databases on the planet out into the open source world. Learn more about how the shared-nothing architecture used by Yugabyte works and how the team build Postgres and other API layers on top of a highly-scalable document DB powered by their own fork of RocksDB. Our guests for this episode are: Kannan Muthukkaruppan, Founder & President, Product Dev. @ Yugabyte Karthik Ranganathan, Founder & CTO @ YugaByte Links: Kudu: Storage for Fast Analytics on Fast Data - https://kudu.apache.org/kudu.pdf Under the Hood: Building and open-sourcing RocksDB - https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-and-open-sourcing-rocksdb/10151822347683920/ The Log-Structured Merge-Tree (LSM-Tree) - https://www.cs.umb.edu/~poneil/lsmtree.pdf Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases - https://dl.acm.org/doi/epdf/10.1145/3035918.3056101
Welcome to our 5rd episode. This is the second part of a two part series where go deep into the internals of Yugabyte with Karthik and Kannan. Yugabyte is a highly scalable and developer friendly open source distributed SQL database. Yugabyte is built by an Ex-Facebook team that wanted to bring what they learnt running one of the latest databases on the planet out into the open source world. One thing I find really fascinating with Yugabyte is that they are fully compatible with Postgres, Redis and Apache Cassandra which makes it easy to replace a lot of infrastructure with just Yugabyte. Hope you enjoy the listen and remember to subscribe for many more of these deep technical discussions. Our guests for this episode are: Kannan Muthukkaruppan, Founder & President, Product Dev. @ Yugabyte Karthik Ranganathan, Founder & CTO @ YugaByte Links: Kudu: Storage for Fast Analytics on Fast Data - https://kudu.apache.org/kudu.pdf Under the Hood: Building and open-sourcing RocksDB - https://www.facebook.com/notes/facebook-engineering/under-the-hood-building-and-open-sourcing-rocksdb/10151822347683920/ The Log-Structured Merge-Tree (LSM-Tree) - https://www.cs.umb.edu/~poneil/lsmtree.pdf Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases - https://dl.acm.org/doi/epdf/10.1145/3035918.3056101
An airhacks.fm conversation with Bela Ban belaban.blogspot.com about: C64 wasn't real, Atari was the way to go, Atari ST vs. Amiga wars, Pascal, Modula-2 and Modula 3, Atari had a nice IDE with 1MB RAM, War Games movie, contact list application as "hello, world", fixing Epson printer hexcodes, chess and tennis over programming, learning C was a step down from Modula, system programming and the fascination with immediate feedback, writing CORBA to CMIP bridges in GDMO, C++ templates are an own language, "C++ is crap", Java at the first World Wide Web conference in 1995 in ...Darmstadt, starting with oak, applets and NCSA Mosaic, Netscape server, extracting data from mainsframes with Java over JNI, Cornell University research with Sun's Java 1.0, working with Ken Birman, Robbert van Renesse, Werner Vogels, Ensemble in Ocaml, replacing Ocaml with Java the "Java Groups", Jim Waldo was leading the JINI project, Sun Microsystems and Cornell worked together to make Java Intelligent Network Infrastructure (JINI) reliable using Java Groups, leasing JINI was revolutionary, JINI message was changed several times, there was no elevator pitch for JINI, Sun tried to keep the JINI / Java Groups cooperation secret, A Note on Distributed computing by Jim Waldo, the Eight Fallacies of Distributed Computing, JGroups on Sourceforge in 2000 (and still on available), revival of JGroups at Fujitsus's Network Management System, the Sacha Labourey and Marc Fleury contact, writing JBoss Cache on unpaid vacation in 6 weeks, the Blue and Red Papers from Mark Fleury, the EJB Open Source System, Mark Fleury and paratroopers, JBoss Cache started as tree and became a distributed map, meeting Manik Surtani in a Taxi, JBoss Cache became Infinispan, JGroups is the communication layer of Infinispan, the CP of CAP interests resulted in RAFT, JGroups RAFT is used in production, there are many Paxos implementations Raff is a Paxos simplification, RAFT for kids in JBoss Distributed Singletons, useless but consistent systems, vector clocks is an inconvenient reconciliation system, JGroups is using RocksDB and MapDB, JGroups makes UDP and other protocols like RDMA reliable, JGroups is particularly efficient with many nodes, JGroups and Sun Cluster Lab in Switzerland, running JGroups on 2000+ nodes at Gcloud, Project Loom and Fibers, mini sabaticals for hype chasing, back to easy request response to Project Java's Loom and Fibers, injecting JChannel in Quarkus, JGroups runs on Quarkus in native mode, KISS and JGroups - No Dependencies in JGroups, Bela's blog: belaban.blogspot.com
Apache Kafka 2.4 includes new Kafka Core developments and improvements to Kafka Streams and Kafka Connect, including MirrorMaker 2.0, RocksDB metrics, and more.EPISODE LINKSRead about what's new in Apache Kafka 2.4Check out the Apache Kafka 2.4 release notesWatch the video version of this podcast
Instagram engineer Michaël Figuière talks with Jeff Carpenter about the Cassandra abstraction layer his team maintains, how it helps development teams move faster, and when companies should consider creating their own abstractions. Highlights: 0:00 - Welcoming Michaël to the show and recapping past conversations on Cassandra usage at Instagram including the RocksDB storage engine we discussed with Dikang Gu and the geographic replication approach that Andrew Whang presented at the 2019 DataStax Accelera See omnystudio.com/listener for privacy information.
In this episode, Jeff Carpenter talks with Dikang Gu about the origins of Cassandra at Instagram, an update on how the adoption of RocksDB as a storage engine for Cassandra is progressing, geographic data partitioning, and how his team is providing Cassandra as a Service inside Instagram. See omnystudio.com/listener for privacy information.
In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing […]
In this episode of the Data Show, I spoke with Dhruba Borthakur (co-founder and CTO) and Shruti Bhat (SVP of Product) of Rockset, a startup focused on building solutions for interactive data science and live applications. Borthakur was the founding engineer of HDFS and creator of RocksDB, while Bhat is an experienced product and marketing […]
Infrastructure Week, Episode 2! Charity and Jay sit down for a discussion on her career and a deep dive into a database incident. You'll get some interesting thoughts on how monitoring has changed in operations. Charity is cofounder and CEO of Honeycomb.io, a startup aimed at debugging complex systems. (“It’s like strace for systems!”) Previously, Charity ran infrastructure at Parse and was an engineering manager at Facebook. She also worked with the RocksDB team to build and deploy the world’s first Mongo + Rocks in production. She likes single malt scotch. https://honeycomb.io https://twitter.com/mipsytipsy
Mike Perham is back for his 4th appearance to talk about his new project Faktory, a new background job system that’s aiming to bring the best practices developed over the last five years in Sidekiq to every programming language. We catch up with Mike on the continued success and model of Sidekiq, the future of background jobs, his thoughts on RocksDB in Faktory vs BoltDB, Redis, or SQLite, how he plans to support Sidekiq for the next 10 years, and his thoughts on Faktory being a SaaS option in the future.
Mike Perham is back for his 4th appearance to talk about his new project Faktory, a new background job system that’s aiming to bring the best practices developed over the last five years in Sidekiq to every programming language. We catch up with Mike on the continued success and model of Sidekiq, the future of background jobs, his thoughts on RocksDB in Faktory vs BoltDB, Redis, or SQLite, how he plans to support Sidekiq for the next 10 years, and his thoughts on Faktory being a SaaS option in the future.
У меня в гостях Иван Круглов, програмист Booking.com (http://booking.com/). Некоторое время Иван занимался разработкой подсистемы поиска отелей, именно об этом и пойдёт речь в этом выпуске. Офис разработки Booking.com (http://booking.com/) расположен в Амстердаме и в начале Иван рассказал как он туда попал. Поделился опытом прохождения интервью, подготовкой к собеседованиям в различных российских, европейских и американских компаниях. Обсудили архитектуру поиска отелей: как в целом оно работает, из каких компонентов и сервисов состоит, на каких языках/платформах написано. Иван рассказал как эволюционировала система в целом, как она была устроена в начале, какие появлялись узкие места и как они в дальнейшем разрешались. Иван рассказал про то, как они ушли от баз данных MySQL к встраиваемому решению, а именно RocksDB. Рассказал про материализацию данных для поиска, шардирование данных и эффективную обработку запросов. Так же Иван немного рассказал про то, как у них происходит выкладка изменений в продакшн, мониторинг и обслуживание системы. Ссылки на ресурсы по темам выпуска: * Доклад Ивана “Архитектура поиска в Booking.com” с Highload++ (видео (https://www.youtube.com/watch?v=VdRmsOAvv0A&list=PLiBYz3OQubLryzLMpCrSZ7rS87sxo5ZO_&index=15), слайды (https://www.slideshare.net/profyclub_ru/bookingcom-bookingcom)) * RocksDB (http://rocksdb.org/). A persistent key-value store for fast storage environments Понравился выпуск? — Поддержи подкаст на patreon.com/KSDaemon (https://www.patreon.com/KSDaemon) а так же ретвитом, постом и просто рассказом друзьям!