Podcasts about stream processing

  • 49PODCASTS
  • 113EPISODES
  • 48mAVG DURATION
  • 1MONTHLY NEW EPISODE
  • Apr 11, 2025LATEST

POPULARITY

20172018201920202021202220232024


Best podcasts about stream processing

Latest podcast episodes about stream processing

IBS Intelligence Podcasts
EP860: The value of stream processing and real-time analytics in financial services

IBS Intelligence Podcasts

Play Episode Listen Later Apr 11, 2025 13:34


Ben Gamble, Field CTO, VervericaInvestigating the use cases for stream processing technology, such as fraud and anomaly detection; we quantify the value of real-time data analytics for business intelligence and decision-making. Ben Gamble, Field CTO of Ververica talks to Robin Amlôt of IBS Intelligence about the convergence of technologies such as agentic AI and stream processing, and how it can help in the drive towards personalisation.

In Numbers We Trust - Der Data Science Podcast
#66: Developer vs. Data Scientist mit Andy Grunwald und Wolfgang Gassler

In Numbers We Trust - Der Data Science Podcast

Play Episode Listen Later Feb 20, 2025 63:42


Warum knirscht es immer wieder zwischen Data Scientists und Developern? In dieser Episode holen wir uns Verstärkung von Andy und Wolfi vom Engineering Kiosk Podcast um dieser Frage auf den Grund zu gehen. Wir reden über typische Klischees und warum diese zu Konflikten führen. Gemeinsam sprechen wir darüber, welche Skills helfen, damit beide Spezies am Ende harmonisch zusammenarbeiten können – statt sich gegenseitig auszubremsen. Zusammenfassung Klischees und Konflikte: Stereotype über Data Scientists (Jupyter-Fans, Doktortitel) und Developer (Perfektionismus, Black-Box-Furcht) Teamorganisation: Cross-funktionale Teams vs. getrennte Abteilungen (Vor- und Nachteile, Agenturmodell) Typische Herausforderungen: Übergabe von Prototypen an die Entwicklung, Verständnis von SLAs/Responsezeiten, Datenbankauswahl Skill-Set und Zusammenarbeit: Generalistisches Grundwissen in DevOps und Softwarearchitektur, offenes Mindset Links Engineering Kiosk Podcast: https://engineeringkiosk.dev/ Andy Grunwald auf LinkedIn: https://www.linkedin.com/in/andy-grunwald-09aa265a/ Wolfgang Gassler auf LinkedIn: https://www.linkedin.com/in/wolfganggassler/ [Engineering Kiosk] #179 MLOps: Machine Learning in die Produktion bringen mit Michelle Golchert und Sebastian Warnholz https://engineeringkiosk.dev/podcast/episode/179-mlops-machine-learning-in-die-produktion-bringen-mit-michelle-golchert-und-sebastian-warnholz/ [Engineering Kiosk] #178 Code der bewegt: Infotainmentsysteme auf Kreuzfahrtschiffen mit Sebastian Hammerl https://engineeringkiosk.dev/podcast/episode/178-code-der-bewegt-infotainmentsysteme-auf-kreuzfahrtschiffen-mit-sebastian-hammerl/ [Engineering Kiosk] #177 Stream Processing & Kafka: Die Basis moderner Datenpipelines mit Stefan Sprenger https://engineeringkiosk.dev/podcast/episode/177-stream-processing-kafka-die-basis-moderner-datenpipelines-mit-stefan-sprenger/ [Data Science Deep Dive] #30: Agile Softwareentwicklung im Data-Science-Kontext https://www.podbean.com/ew/pb-mvspn-1482ea4 [Data Science Deep Dive] #23: Unsexy aber wichtig: Tests und Monitoring https://www.podbean.com/ew/pb-vxp58-13f311a [Data Science Deep Dive] #20: Ist Continuous Integration (CI) ein Muss für Data Scientists? https://www.podbean.com/ew/pb-4mkqh-13bb3b3 Fragen, Feedback und Themenwünsche gern an podcast@inwt-statistics.de

Engineering Kiosk
#177 Stream Processing & Kafka: Die Basis moderner Datenpipelines mit Stefan Sprenger

Engineering Kiosk

Play Episode Listen Later Jan 7, 2025 67:40


Data Streaming und Stream Processing mit Apache Kafka und dem entsprechenden Ecosystem.Eine ganze Menge Prozesse in der Softwareentwicklung bzw. für die Verarbeitung von Daten müssen nicht zur Laufzeit, sondern können asynchron oder dezentral bearbeitet werden. Begriffe wie Batch-Processing oder Message Queueing / Pub-Sub sind dafür geläufig. Es gibt aber einen dritten Player in diesem Spiel: Stream Processing. Da ist Apache Kafka das Flaggschiff, bzw. die verteilte Event Streaming Platform, die oft als erstes genannt wird.Doch was ist denn eigentlich Stream Processing und wie unterscheidet es sich zu Batch Processing oder Message Queuing? Wie funktioniert Kafka und warum ist es so erfolgreich und performant? Was sind Broker, Topics, Partitions, Producer und Consumer? Was bedeutet Change Data Capture und was ist ein Sliding Window? Auf was muss man alles acht geben und was kann schief gehen, wenn man eine Nachricht schreiben und lesen möchte?Die Antworten und noch viel mehr liefert unser Gast Stefan Sprenger.Bonus: Wie man Stream Processing mit einem Frühstückstisch für 5-jährige beschreibt.Unsere aktuellen Werbepartner findest du auf https://engineeringkiosk.dev/partnersDas schnelle Feedback zur Episode:

The MongoDB Podcast
EP. 248 Stream Processing Simplified: Joe Niemiec on MongoDB's Latest Innovations

The MongoDB Podcast

Play Episode Listen Later Nov 29, 2024 10:27


Join us for an insightful conversation with Joe Niemiec, Senior Product Manager for Streaming at MongoDB, recorded live at MongoDB Local London. In this video, Joe explains the fundamentals of stream processing and how it empowers developers to run continuous aggregation queries on real-time data. Discover practical use cases, including monitoring oil well pumps and smart grid applications, that showcase the power of stream processing in various industries. Joe also discusses the latest enhancements, such as expanded regional support, VPC peering for secure connections, and improved Kafka integration. Whether you're new to MongoDB or looking to enhance your data processing capabilities, this video is packed with valuable information to help you get started with stream processing!

Data Engineering Podcast
X-Ray Vision For Your Flink Stream Processing With Datorios

Data Engineering Podcast

Play Episode Listen Later Jun 9, 2024 42:22


Summary Streaming data processing enables new categories of data products and analytics. Unfortunately, reasoning about stream processing engines is complex and lacks sufficient tooling. To address this shortcoming Datorios created an observability platform for Flink that brings visibility to the internals of this popular stream processing system. In this episode Ronen Korman and Stav Elkayam discuss how the increased understanding provided by purpose built observability improves the usefulness of Flink. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is supported by Code Comments, an original podcast from Red Hat. As someone who listens to the Data Engineering Podcast, you know that the road from tool selection to production readiness is anything but smooth or straight. In Code Comments, host Jamie Parker, Red Hatter and experienced engineer, shares the journey of technologists from across the industry and their hard-won lessons in implementing new technologies. I listened to the recent episode "Transforming Your Database" and appreciated the valuable advice on how to approach the selection and integration of new databases in applications and the impact on team dynamics. There are 3 seasons of great episodes and new ones landing everywhere you listen to podcasts. Search for "Code Commentst" in your podcast player or go to dataengineeringpodcast.com/codecomments (https://www.dataengineeringpodcast.com/codecomments) today to subscribe. My thanks to the team at Code Comments for their support. Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst is an end-to-end data lakehouse platform built on Trino, the query engine Apache Iceberg was designed for, with complete support for all table formats including Apache Iceberg, Hive, and Delta Lake. Trusted by teams of all sizes, including Comcast and Doordash. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Ronen Korman and Stav Elkayam about pulling back the curtain on your real-time data streams by bringing intuitive observability to Flink streams Interview Introduction How did you get involved in the area of data management? Can you describe what Datorios is and the story behind it? Data observability has been gaining adoption for a number of years now, with a large focus on data warehouses. What are some of the unique challenges posed by Flink? How much of the complexity is due to the nature of streaming data vs. the architectural realities of Flink? How has the lack of visibility into the flow of data in Flink impacted the ways that teams think about where/when/how to apply it? How have the requirements of generative AI shifted the demand for streaming data systems? What role does Flink play in the architecture of generative AI systems? Can you describe how Datorios is implemented? How has the design and goals of Datorios changed since you first started working on it? How much of the Datorios architecture and functionality is specific to Flink and how are you thinking about its potential application to other streaming platforms? Can you describe how Datorios is used in a day-to-day workflow for someone building streaming applications on Flink? What are the most interesting, innovative, or unexpected ways that you have seen Datorios used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datorios? When is Datorios the wrong choice? What do you have planned for the future of Datorios? Contact Info Ronen LinkedIn (https://www.linkedin.com/in/ronen-korman/) Stav LinkedIn (https://www.linkedin.com/in/stav-elkayam-118a2795/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com) with your story. Links Datorios (https://datorios.com/) Apache Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) ChatGPT-4o (https://openai.com/index/hello-gpt-4o/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

The GeekNarrator
Volt Active Data: Low Latency Stream processing

The GeekNarrator

Play Episode Listen Later Mar 8, 2024 48:08


In this episode of The GeekNarrator podcast, our host Kaivalya talks to Seeta Somagani from Volt Active Data, a low latency stream processing platform. They discuss fascinating topics about what low latency stream processing means, the different guarantees that Volt Active Data provides, and the various problems it can solve. They delve into the evolution of VoltDB to Volt Active Data, real-time data processing use cases, the high-level architecture, and how the platform effectively addresses high-concurrency challenges. This is a must-listen for anyone interested in understanding latency critical applications, data processing, and high performance computing. Chapters: 00:00 Welcome to The GeekNarrator Podcast with Special Guest from Volt Active Data 00:41 Introduction 01:45 The Evolution of VoltDB to Volt Active Data 06:13 Exploring Real-Time Data Processing and Use Cases 08:25 Addressing High-Concurrency Challenges in Various Industries 12:57 High-Level Architecture of Volt Active Data 19:26 Understanding Stored Procedures and Data Processing in Volt 22:48 Practical Application: Tracking Data Usage with Volt Active Data 25:16 Diving into Replicated and Partitioned Tables 25:44 Exploring Event Processing and Exporting 26:57 Understanding Stored Procedures and Performance 29:03 Partitioning Strategies and Recommendations 31:39 Ensuring Determinism in Stored Procedures 35:02 Handling Complex Requirements with Compound Procedures 37:25 Fault Tolerance and Data Replication Strategies 40:44 Exploring Use Cases for VoltActiveData 43:30 The Future of Streaming and VoltActiveData's Role 47:05 Closing Remarks and How to Learn More Volt Active Data: https://www.voltactivedata.com/use-cases/activesd-streaming-data/ =============================================================================== For discount on the below courses: Appsync: https://appsyncmasterclass.com/?affiliateId=41c07a65-24c8-4499-af3c-b853a3495003 Testing serverless: https://testserverlessapps.com/?affiliateId=41c07a65-24c8-4499-af3c-b853a3495003 Production-Ready Serverless: https://productionreadyserverless.com/?affiliateId=41c07a65-24c8-4499-af3c-b853a3495003 Use the button, Add Discount and enter "geeknarrator" discount code to get 20% discount. =============================================================================== Follow me on Linkedin and Twitter: https://www.linkedin.com/in/kaivalyaapte/ and https://twitter.com/thegeeknarrator If you like this episode, please hit the like button and share it with your network. Also please subscribe if you haven't yet. Database internals series: https://youtu.be/yV_Zp0Mi3xs Popular playlists: Realtime streaming systems: https://www.youtube.com/playlist?list=PLL7QpTxsA4se-mAKKoVOs3VcaP71X_LA- Software Engineering: https://www.youtube.com/playlist?list=PLL7QpTxsA4sf6By03bot5BhKoMgxDUU17 Distributed systems and databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4sfLDUnjBJXJGFhhz94jDd_d Modern databases: https://www.youtube.com/playlist?list=PLL7QpTxsA4scSeZAsCUXijtnfW5ARlrsN Stay Curios! Keep Learning! #sql #streamprocessing #java #acid

CZPodcast
CZ Podcast 314 - Stream processing a Apache Flink

CZPodcast

Play Episode Listen Later Mar 2, 2024 68:00


Datovy processing se zacina presouvat z batch zpracovani k streamovemu zpracovani. Apache Flink se stava go to resenim pro tento typ zpracovani dat. V tomto dile jsme se potkali s Davidem Moravkem a Janem Svobodou z Confluent.io, kteri maji s Apache Flink zkusenosti. Spolecne s Dagim se ptal Vaclav Brodec, ktery je v Ataccame zodpovedny za next generation engine pro zajisteni datove kvality.

The MongoDB Podcast
Ep. 204 Streamlining Data: Inside MongoDB's Atlas Stream Processing with Kenny Gorman

The MongoDB Podcast

Play Episode Listen Later Feb 13, 2024 17:26


In this episode of the MongoDB Podcast, host Michael Lynn is thrilled to welcome Kenny Gorman, Head of Streaming Products at MongoDB, for an in-depth discussion on the advancements in Atlas Stream Processing. Gorman shares the journey of developing MongoDB's stream processing capabilities, focusing on the transformative impact of real-time data processing across various industries, from IoT to marketing. With an emphasis on the seamless integration with MongoDB's aggregation framework, this episode illuminates how developers can effortlessly transition their existing aggregation statements into powerful streaming pipelines.-Read the blog: https://mdb.link/asp-blogRead the docs: https://mdb.link/asp-docs

Real-Time Analytics with Tim Berglund
Revolutionizing Stream Processing with Arroyo's Co-Founder Micah Wylde | Ep. 40

Real-Time Analytics with Tim Berglund

Play Episode Listen Later Feb 12, 2024 34:51


Follow: https://stree.ai/podcast | Sub: https://stree.ai/sub | New episodes every Monday! Join us as we dive into the world of stream processing with Micah Wylde, CEO and co-founder of Arroyo. Discover how Arroyo, a cloud-first SQL native stream processing framework, addresses the challenges of previous generations of stream processing technologies. Learn about its unique approach to making stream processing accessible to non-experts and how it aims to revolutionize real-time data analysis. Whether you're a developer, data scientist, or just curious about the future of stream processing, this episode is packed with insights into Arroyo's design, goals, and how it's changing the game.

Rust in Production
Rust in Production Ep 4 - Arroyo's Micah Wylde

Rust in Production

Play Episode Listen Later Jan 25, 2024 55:50


In this episode, we have Micah Wylde from Arroyo as our guest. Micah introduces us to Arroyo, a real-time data processing engine that simplifies stream processing for data engineers using Rust. They explain how Arroyo enables users to write SQL queries with Rust user-defined functions on top of streaming data, highlighting the advantages of real-time data processing and discussing the challenges posed by competitors like Apache Flink. Moving on, we dive into the use of Rust in Arroyo and its benefits in terms of performance and memory safety. We explore the complementarity of workflow engines and stream processors and examine Arroyo's approach to real-time SQL and its compatibility with Postgres. Micah delves into memory and lifetime concerns and elaborates on how Arroyo manages them in its storage layer. Shifting gears, we explore the use of the Tokyo framework in the Arroyo system and how it has enhanced speed and efficiency. Micah shares insights into the challenges and advantages of utilizing Rust, drawing from their experiences with Arroyo projects. Looking ahead, we discuss the future of the Rust ecosystem, addressing the current state of the Rust core and standard library, as well as the challenges of interacting with other languages using FFI or dynamically loading code. We touch upon Rust's limitations regarding a stable ABI and explore potential solutions like WebAssembly. We also touch upon industry perceptions of Rust, investor perspectives, and the hiring process for Rust engineers. The conversation takes us through the crates used in the Arroyo system, our wishlist for Rust ecosystem improvements, and the cost-conscious nature of companies that make Rust an attractive choice in the current macroeconomic environment. As we wrap up, we discuss the challenges Rust faces in competing with slower Java systems and ponder the potential for new languages to disrupt the trend in the future. We touch upon efficiency challenges in application software and the potential for a new language to emerge in this space. We delve into the increasing interest in using Rust in data science and the promising prospects of combining Rust with higher-level languages. Finally, we discuss the importance of fostering a welcoming and drama-free Rust community. I would like to thank Micah for joining us today and sharing their insights. To find more resources related to today's discussion, please refer to the show notes. Stay tuned for our next episode, and thank you for listening!

London Tech Talk
アメリカ在住CTO のキャリアと人生 (Tomohisa)

London Tech Talk

Play Episode Listen Later Dec 23, 2023 61:00


アメリカのJ-Tech Creations, Inc.に所属し、CTOかつアプリケーションコードも書かれているTomohisaさんをゲストにお呼び、エンジニアとしてのキャリアやCTOとしての働き、アメリカにきた経緯などについてお話ししました。 PC-98 DDIA 第6章 Partitioning  DDIA 第11章 Stream Processing www.sekiban.dev ⁠Twitter⁠ ⁠Linkedin

London Tech Talk
DDIA Ch11: Stream Processing (Tomohisa)

London Tech Talk

Play Episode Listen Later Dec 16, 2023 54:17


"Designing Data-Intensive Applications"、通称 ”DDIA" 本の Ch11 を読んで感想を語りました。 ⁠⁠⁠⁠⁠⁠⁠⁠Amazon.co.jp (英語版)⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠Amazon.co.jp (日本語版)⁠⁠⁠⁠⁠⁠⁠⁠ ⁠⁠⁠⁠⁠⁠⁠⁠Designing Data-Intensive Applications⁠⁠⁠ Youtube - Greg Young CQRS and Event Sourcing What is Change Data Capture? - Confluent Beamery Hacking Talent - Kafka ITエンジニアの読書について。- 株式会社ジェイテックジャパン Tomohisa Takaoka Linkedin Twitter 関数型で表現するイベントソーシングの実装とその教育 C#とCosmosDBによる自作イベントソーシングフレームワークの設計とコンセプト Ken さんの関連実績 IoT デバイスのログ基盤 (Kinesis Stream) 広告配信のリアルタイムログ (Kinesis Stream) Platform Engineer: アプリケーションログ (Apache Kafka)

Data Engineering Podcast
Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Data Engineering Podcast

Play Episode Listen Later Oct 15, 2023 68:28


Summary Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It's the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it's real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! As more people start using AI for projects, two things are clear: It's a rapidly advancing field, but it's tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES (https://Neo4j.com/NODES). Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable Interview Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it? What are the notable changes to the Decodable platform since we last spoke? (October 2021) What are the industry shifts that have influenced the product direction? What are the problems that customers are trying to solve when they come to Decodable? When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL? What are the developer experience challenges that are particular to working with streaming data? How have you worked to address that in the Decodable platform and interfaces? As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable? Contact Info esammer (https://github.com/esammer) on GitHub LinkedIn (https://www.linkedin.com/in/esammer/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Decodable (https://www.decodable.co/) Podcast Episode (https://www.dataengineeringpodcast.com/decodable-streaming-data-pipelines-sql-episode-233/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114/) Kafka (https://kafka.apache.org/) Redpanda (https://redpanda.com/) Podcast Episode (https://www.dataengineeringpodcast.com/vectorized-red-panda-streaming-data-episode-152/) Kinesis (https://aws.amazon.com/kinesis/) PostgreSQL (https://www.postgresql.org/) Podcast Episode (https://www.dataengineeringpodcast.com/postgresql-with-jonathan-katz-episode-42/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Databricks (https://www.databricks.com/) Startree (https://startree.ai/) Pinot (https://pinot.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/pinot-embedded-analytics-episode-273/) Rockset (https://rockset.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rockset-serverless-analytics-episode-101/) Druid (https://druid.apache.org/) InfluxDB (https://www.influxdata.com/) Samza (https://samza.apache.org/) Storm (https://storm.apache.org/) Pulsar (https://pulsar.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/pulsar-fast-and-scalable-messaging-with-rajan-dhabalia-and-matteo-merli-episode-17) ksqlDB (https://ksqldb.io/) Podcast Episode (https://www.dataengineeringpodcast.com/ksqldb-kafka-stream-processing-episode-122/) dbt (https://www.getdbt.com/) GitHub Actions (https://github.com/features/actions) Airbyte (https://airbyte.com/) Singer (https://www.singer.io/) Splunk (https://www.splunk.com/) Outbox Pattern (https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

The New Stack Podcast
Kinesis, Kafka and Amazon Managed Service for Apache Flink

The New Stack Podcast

Play Episode Listen Later Sep 12, 2023 27:07


Apache Flink is an open-source framework and distributed processing engine designed for data analytics. It excels at handling tasks such as data joins, aggregations, and ETL (Extract, Transform, Load) operations. Moreover, it supports advanced real-time techniques like complex event processing.In this episode, Deepthi Mohan and Nagesh Honnalii from AWS discussed Apache Flink and the Amazon Managed Service for Apache Flink (MSF) with our host, Alex Williams. MSF is a service that caters to customers with varying infrastructure preferences. Some prefer complete control, while others want AWS to handle all infrastructure-related aspects.Use cases for MSF can be grouped into three categories. First, there's streaming ETL, which involves tasks like log aggregation for later auditing. Second, it supports real-time analytics, enabling customers to create dashboards for tasks like fraud detection. Third, it handles complex event processing, where data from multiple sources is joined and aggregated to extract meaningful insights.The origins of MSF trace back to the evolution of real-time data services within AWS. In 2013, AWS introduced Amazon Kinesis, while the open-source community developed Apache Kafka. These services paved the way for MSF by highlighting the need for real-time data processing.To provide more flexibility, AWS launched Kinesis Data Analytics in 2016, allowing customers to write code in JVM-based languages like Java and Scala. In 2018, AWS decided to incorporate Apache Flink into its Kinesis Data Analytics offering, leading to the birth of MSF.Today, thousands of customers use MSF, and AWS continues to enhance its offerings in the real-time data processing space, including the launch of Amazon MSK (Managed Streaming for Apache Kafka). To align with its foundation on Flink, AWS rebranded Kinesis Data Analytics for Apache Flink to Amazon Managed Service for Apache Flink, making it clearer for customers.Learn more from The New Stack about AWS and Apache Flink:Apache Flink for Real Time Data AnalysisApache Flink for Unbounded Data Streams3 Reasons Why You Need Apache Flink for Stream Processing

Category Visionaries
DeVaris Brown, CEO and Co-Founder of Meroxa: Over $19 Million Raised to Empower Engineering Teams with Better Stream Processing

Category Visionaries

Play Episode Listen Later Sep 6, 2023 38:20


In today's episode of Category Visionaries, we speak with DeVaris Brown, CEO and Co-Founder of Meroxa, a stream processing data application platform that's raised over $19 Million in funding. Topics Discussed: DeVaris' background before launching Meroxa at some of technology's biggest names, including Twitter, Microsoft and Zendesk Growing up inspired by his grandparents' entrepreneurial endeavors, and the dot-com, to one day become an tech entrepreneur How Meroxa helps developers move data around with ease, removing the need for specialized knowledge across different data systems Working with Government Agencies including the Air Force, Space Force, and NASA, handling complex real-time data challenges. Deciding to work with government agencies while aiming to ensure any contributions align with their standards.   Favorite book:  Extreme Ownership: How U.S. Navy SEALs Lead and Win

Category Visionaries
Yingjun Wu, CEO and Co-Founder of RisingWave Labs: $40 Million Raised to Make Stream Processing Simple, Affordable, and Accessible

Category Visionaries

Play Episode Listen Later Jul 26, 2023 21:46


In today's episode of Category Visionaries, we speak with Yingjun Wu, CEO and Co-Founder of RisingWave Labs, a SQL stream processing platform that's raised $40 Million in funding, about why it's time for a better way to leverage the potential of their data. Offering real-time insights, simplified with advanced SQL processing, RisingWave Labs helps their clients make actionable decisions based on the most up-to-date information available. We also speak about Yingjun's background as a software engineer working at big-name brands like Amazon and IBM, the significance of SQL as a tool for data analysis and a way to garner attention for their product, the challenges RisingWave Labs have experienced in their business growth journey, and why they prioritize education and user-friendly experiences. Topics Discussed: Yingjun's background as a software engineer, and the lessons he brought from some of the sector's biggest names as RisingWave Labs' CEO RisingWave Labs product offering of SQL stream processing for real-time data, and what it means for their clients How using the ‘trendy' programming language ‘Rust' helped RisingWave Labs generate growth even before they deployed a real marketing campaign The challenge of dealing with a wide variety of potential customer priorities, and how to allocate resources in response Why RisingWave Labs maintains a focus on education and user-friendly experiences to drive sustained adoption   Favorite book:  Guns, Germs, and Steel: The Fates of Human Societies

Real-Time Analytics with Tim Berglund
Diving Deep into Apache Flink with Robert Metzger | Ep. 14

Real-Time Analytics with Tim Berglund

Play Episode Listen Later Jul 10, 2023 30:48


Follow: https://stree.ai/podcast | Sub: https://stree.ai/sub | New episodes every Monday! In part two of the "Real-Time Analytics" podcast, Robert Metzger, the PMC chair of Apache Flink, elaborates on using Flink as a developer. Metzger discusses the spectrum of APIs in Flink, ranging from expressive APIs to easy-to-use APIs. He mentions the process function, a low-level, flexible API that exposes basic building blocks of Flink, such as real-time events, state, and event time. Metzger also speaks about the windowing API of Flink and the Async I/O operator. He further details how Flink users can work with a combination of SQL and Java code in the data stream API. You won't want to miss this episode!Flink Deployments At Decodable: https://www.decodable.co/blog/flink-deployments-at-decodable3 Reasons Why You Need Apache Flink for Stream Processing: https://thenewstack.io/3-reasons-why-you-need-apache-flink-for-stream-processing/#:~:text=For%20example%2C%20Uber%20uses%20Flink,streaming%20data%20at%20massive%20scale.

The GeekNarrator
Batch vs Realtime Stream Processing - A Deep Dive with Phil Fried from Estuary

The GeekNarrator

Play Episode Listen Later Jul 3, 2023 63:03


In this video I talk to Philip Fried from Estuary about Batch vs Realtime Stream Processing. Philip brings a ton of experience in the world of data processing and has shared some of the best practices in implementing these systems. We dive deep into the world of data processing, covering batch and streaming systems, their challenges, tradeoffs and use cases. Chapters: 00:00 Batch vs Realtime Stream Processing 03:25 What is Batch and Reatlime processing? 18:29 How does Batch and Realtime compare in terms of Latency and Throughput? 27:24 Where is the cost saving coming from? Compute?Storage? or Network? 31:38 Moving from Batch to Stream processing 37:50 How is Idempotency implemented in Streaming systems? 48:50 How do we approach Schema evolution in Batch and Streaming systems? 57:16 Summary - key points to keep in mind Do checkout Estuary if you deal with a ton of data, and don't want to deal with the painful operations, infrastructure management, schema migrations etc and only want to focus on building highly scalable and resilient applications. References: Estuary: https://estuary.dev/ Flow documentation: https://docs.estuary.dev If you like this video please hit the like button, share it with your network (whoever works with a ton of data) and subscribe to the channel. Feel free to watch related episodes in the playlist:    • Distributed Syste...   Modern Databases:    • Modern Databases   Software Engineering:    • Software Engineering   Distributed Systems:    • Distributed Systems   Cheers, The GeekNarrator

The Six Five with Patrick Moorhead and Daniel Newman
The Six Five On the Road — an Inside Look at MongoDB .local NYC

The Six Five with Patrick Moorhead and Daniel Newman

Play Episode Listen Later Jun 23, 2023 16:06


On this episode of The Six Five On The Road, hosts Daniel Newman and Patrick Moorhead provide their analysis and opinion on the Opening Keynote at the MongoDB .local NYC event. Analysis summary: The future of data is hybrid and multi-cloud. MongoDB is uniquely positioned to lead in this new era of AI & data. The company is investing heavily in innovation, including new products and services, to help customers succeed in the hybrid cloud world. MongoDB is committed to providing the best platform for the developer community and the C-suite, which is essential for success in the modern data landscape.

Engenharia de Dados [Cast]
Cloudera CDP & Stream Processing para Real-Time Analytics com André Araújo, Field Engineer, Data in Motion na Cloudera

Engenharia de Dados [Cast]

Play Episode Listen Later Jun 22, 2023 58:00


No episódio de hoje, Luan Moreno & Mateus Oliveira entrevistaram André Araújo , atualmente como Field Engineer, Data in Motion na Cloudera.CDP é uma Plataforma de Dados Enterprise Cloudera, com foco na versatilidade em casos de uso como Streaming Platform, possuindo tecnologias como Apache Kafka e Apache Flink .Com CSP, você tem os seguintes benefícios: Apache Kafka - Plataforma de armazenamento de Streaming de Dados líder de mercado;Apache Flink - Plataforma de Processamento de Dados.Neste bate-papo vamos falar sobre:Plataforma de Dados Cloudera ;Plataforma de transmissão Cloudera .O Cloudera sempre foi uma das plataformas mais utilizadas no mercado, agora com a nova versão e casos de uso que atendem diversos cenários, como o caso do CSP ( Cloudera Stream Platform ).André Araújo  = LinkedinCloudera  =  webpage Luan Moreno = https://www.linkedin.com/in/luanmoreno/

The Data Stack Show
140: Stream Processing for Machine Learning with Davor Bonaci of DataStax

The Data Stack Show

Play Episode Listen Later May 31, 2023 61:30


Highlights from this week's conversation include:Davor's journey from Google and what he was building there (3:32)How work in stream processing changed Davor's journey (5:10)Analytical predictive models and infrastructure (9:39)How Kaskada serves as a recommendation engine with data (14:05)Kaskada's user experience as an event processing platform (20:06)Enhancing typical feature store architecture to achieve better results (23:34)What is needed to improve stream and batch processes (27:39)Using another syntax instead of SQL (36:44)DataStax acquiring Kaskada and what will come from that merger (40:24)Operationalizing and democratizing ML (47:54)Final thoughts and takeaways (56:04) The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The Data Stack Show
Data Council Week (Ep 1) - The Evolution of Stream Processing With Eric Sammer of Decodable

The Data Stack Show

Play Episode Listen Later Apr 23, 2023 42:40


Highlights from this week's conversation include:Eric's journey to becoming CEO of Decoable (0:20)Does real time matter? (2:12)Differences in stream processing systems (7:57)Processing in motion (13:04)Why haven't there been more open source projects around CDC? (20:34)The Decodable experience and future focuses for the company (24:31)Streaming processing and data lakes (32:54)Data flow processing technologies of today (39:01)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Streaming Audio: a Confluent podcast about Apache Kafka
Migrate Your Kafka Cluster with Minimal Downtime

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Mar 1, 2023 61:30 Transcription Available


Migrating Apache Kafka® clusters can be challenging, especially when moving large amounts of data while minimizing downtime. Michael Dunn (Solutions Architect, Confluent) has worked in the data space for many years, designing and managing systems to support high-volume applications. He has helped many organizations strategize, design, and implement successful Kafka cluster migrations between different environments. In this episode, Michael shares some tips about Kafka cluster migration with Kris, including the pros and cons of the different tools he recommends.Michael explains that there are many reasons why companies migrate their Kafka clusters. For example, they may want to modernize their platforms, move to a self-hosted cloud server, or consolidate clusters. He tells Kris that creating a plan and selecting the right tool before getting started is critical for reducing downtime and minimizing migration risks.The good news is that a few tools can facilitate moving large amounts of data, topics, schemas, applications, connectors, and everything else from one Apache Kafka cluster to another.Kafka MirrorMaker/MirrorMaker2 (MM2) is a stand-alone tool for copying data between two Kafka clusters. It uses source and sink connectors to replicate topics from a source cluster into the destination cluster.Confluent Replicator allows you to replicate data from one Kafka cluster to another. Replicator is similar to MM2, but the difference is that it's been battle-tested.Cluster Linking is a powerful tool offered by Confluent that allows you to mirror topics from an Apache Kafka 2.4/Confluent Platform 5.4 source cluster to a Confluent Platform 7+ cluster in a read-only state, and is available as a fully-managed service in Confluent Cloud.At the end of the day, Michael stresses that coupled with a well-thought-out strategy and the right tool, Kafka cluster migration can be relatively painless. Following his advice, you should be able to keep your system healthy and stable before and after the migration is complete.EPISODE LINKSMirrorMaker 2ReplicatorCluster LinkingSchema MigrationMulti-Cluster Apache Kafka with Cluster LinkingWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

Streaming Audio: a Confluent podcast about Apache Kafka
Real-Time Data Transformation and Analytics with dbt Labs

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Feb 22, 2023 43:41 Transcription Available


dbt is known as being part of the Modern Data Stack for ELT processes. Being in the MDS, dbt Labs believes in having the best of breed for every part of the stack. Oftentimes folks are using an EL tool like Fivetran to pull data from the database into the warehouse, then using dbt to manage the transformations in the warehouse. Analysts can then build dashboards on top of that data, or execute tests.It's possible for an analyst to adapt this process for use with a microservice application using Apache Kafka® and the same method to pull batch data out of each and every database; however, in this episode, Amy Chen (Partner Engineering Manager, dbt Labs) tells Kris about a better way forward for analysts willing to adopt the streaming mindset: Reusable pipelines using dbt models that immediately pull events into the warehouse and materialize as views by default.dbt Labs is the company that makes and maintains dbt. dbt Core is the open-source data transformation framework that allows data teams to operate with software engineering's best practices. dbt Cloud is the fastest and most reliable way to deploy dbt. Inside the world of event streaming, there is a push to expand data access beyond the programmers writing the code, and towards everyone involved in the business. Over at dbt Labs they're attempting something of the reverse— to get data analysts to adopt the best practices of software engineers, and more recently, of streaming programmers. They're improving the process of building data pipelines while empowering businesses to bring more contributors into the analytics process, with an easy to deploy, easy to maintain platform. It offers version control to analysts who traditionally don't have access to git, along with the ability to easily automate testing, all in the same place.In this episode, Kris and Amy explore:How to revolutionize testing for analysts with two of dbt's core functionalitiesWhat streaming in a batch-based analytics world should look likeWhat can be done to improve workflowsHow to democratize access to data for everyone in the businessEPISODE LINKSLearn more about dbt labsAn Analytics Engineer's Guide to StreamingPanel discussion: If Streaming Is the Answer, Why Are We Still Doing Batch?All Current 2022 sessions and slidesWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Streaming Audio: a Confluent podcast about Apache Kafka
What can Apache Kafka Developers learn from Online Gaming?

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Feb 8, 2023 55:32 Transcription Available


What can online gaming teach us about making large-scale event management more collaborative in real-time? Ben Gamble (Developer Relations Manager, Aiven)  has come to the world of real-time event streaming from an usual source: the video games industry. And if you stop to think about it, modern online games are complex, distributed real-time data systems with decades of innovative techniques to teach us.In this episode, Ben talks with Kris about integrating gaming concepts with Apache Kafka®. Using Kafka's state management stream processing, Ben has built systems that can handle real-time event processing at a massive scale, including interesting approaches to conflict resolution and collaboration.Building latency into a system is one way to mask data processing time. Ben says that you can efficiently hide latency issues and prioritize performance improvements by setting an initial target and then optimizing from there. If you measure before optimizing, you can add an extra layer to manage user expectations better. Tricks like adding a visual progress bar give the appearance of progress but actually hide latency and improve the overall user experience.To effectively handle challenging activities, like resolving conflicts and atomic edits, Ben suggests “slicing” (or nano batching) to break down tasks into small, related chunks. Slicing allows each task to be evaluated separately, thus producing timely outcomes that resolve potential background conflicts without the user knowing.Ben also explains how he uses pooling to make collaboration seamless. Pooling is a process that links open requests with potential matches. Similar to booking seats on an airplane, seats are assigned when requests are made. As these types of connections are handled through a Kafka event stream, the initial open requests are eventually fulfilled when seats become available.According to Ben, real-world tools that facilitate collaboration (such as Google Docs and Slack) work similarly. Just like multi-player gaming systems, multiple users can comment or chat in real-time and users perceive instant responses because of the techniques ported over from the gaming world.As Ben sees it, the proliferation of these types of concepts across disciplines will also benefit a more significant number of collaborative systems. Despite being long established for gamers, these patterns can be implemented in more business applications to improve the user experience significantly.EPISODE LINKSGoing Multiplayer With Kafka—Current 2022Building a Dependable Real-Time Betting App with Confluent Cloud and AblyEvent Streaming PatternsWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Streaming Audio: a Confluent podcast about Apache Kafka
Real-Time Machine Learning and Smarter AI with Data Streaming

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Jan 5, 2023 38:56 Transcription Available


Are bad customer experiences really just data integration problems? Can real-time data streaming and machine learning be democratized in order to deliver a better customer experience? Airy, an open-source data-streaming platform, uses Apache Kafka® to help business teams deliver better results to their customers. In this episode, Airy CEO and co-founder Steffen Hoellinger explains how his company is expanding the reach of stream-processing tools and ideas beyond the world of programmers.Airy originally built Conversational AI (chatbot) software and other customer support products for companies to engage with their customers in conversational interfaces. Asynchronous messaging created a large amount of traffic, so the company adopted Kafka to ingest and process all messages & events in real time.In 2020, the co-founders decided to open source the technology, positioning Airy as an open source app framework for conversational teams at large enterprises to ingest and process conversational and customer data in real time. The decision was rooted in their belief that all bad customer experiences are really data integration problems, especially at large enterprises where data often is siloed and not accessible to machine learning models and human agents in real time.(Who hasn't had the experience of entering customer data into an automated system, only to have the same data requested eventually by a human agent?)Airy is making data streaming universally accessible by supplying its clients with real-time data and offering integrations with standard business software. For engineering teams, Airy can reduce development time and increase the robustness of solutions they build.Data is now the cornerstone of most successful businesses, and real-time use cases are becoming more and more important. Open-source app frameworks like Airy are poised to drive massive adoption of event streaming over the years to come, across companies of all sizes, and maybe, eventually, down to consumers.EPISODE LINKSLearn how to deploy Airy Open Source - or sign up for an Airy Cloud test instanceGoogle Case Study about Airy & TEDi, a 2,000 store retailerBecome an Expert in Conversational EngineeringSupercharging conversational AI with human agent feedback loopsIntegrating all Communication and Customer Data with Airy and ConfluentHow to Build and Deploy Scalable Machine Learning in Production with Apache KafkaReal-Time Threat Detection Using Machine Learning and Apache KafkaWatch the videoLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get $100 of free Confluent Cloud usage (details) 

Streaming Audio: a Confluent podcast about Apache Kafka
The Present and Future of Stream Processing

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Dec 28, 2022 31:19 Transcription Available


The past year saw new trends emerge in the world of data streaming technologies, as well as some unexpected and novel use cases for Apache Kafka®. New reflections on the future of stream processing and when companies should adopt microservice architecture inspired several talks at this year's industry conferences. In this episode, Kris is joined by his colleagues Danica Fine, Senior Developer Advocate, and Robin Moffatt, Principal Developer Advocate, for an end-of-year roundtable on this year's developments and what they want to see in the year to come.Robin and Danica kick things off with a discussion of the year's memorable conferences. Talk submissions for Kafka Summit London and Current 2022 featuring topics were noticeably more varied than previous years, with fewer talks focused on the basics of Kafka implementation. Many abstracts featured interesting and unusual use cases, in addition to detailed explanations on what went wrong and how others could avoid the same issues.The conferences also made clear that a lot of companies are adopting or considering stream-processing solutions. Are we close to a future where streaming is a part of everything we do? Is there anything helping streaming become more mainstream? Will stream processing replace batch?On the other hand, a lot of in-demand talks focused on the importance of understanding the best practices supporting data mesh and understanding the nuances of the system and configurations. Danica identifies this as her big hope for next year: No more Kafka developers pursuing quick fixes. “No more band aid fixes. I want as many people as possible to understand the nuances of the levers that they're pulling for Kafka, whatever project they're building.”Kris and Robin agree that what will make them happy in 2023 is seeing broader, more diverse client libraries for Kafka. “Getting away from this idea that Kafka is largely a Java shop, which is nonsense, but there is that perception.”Streaming Audio returns in January 2023.EPISODE LINKSPut Your Data To Work: Top 5 Data Technology Trends for 2023Write What You Know: Turning Your Apache Kafka Knowledge into a Technical TalkCommon Apache Kafka Mistakes to AvoidPractical Data Pipeline: Build a Plant Monitoring System with ksqlDBIf Streaming Is the Answer, Why Are We Still Doing Batch?View sessions and slides from Current 2022Watch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

Streaming Audio: a Confluent podcast about Apache Kafka
Learn How Stream-Processing Works The Simplest Way Possible

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Dec 20, 2022 31:29 Transcription Available


Could you explain Apache Kafka® in ways that a small child could understand? When Mitch Seymour, author of Mastering Kafka Streams and ksqlDB, wanted a way to communicate the basics of Kafka and event-based stream processing, he decided to author a children's book on the subject, but it turned into something with a far broader appeal.Mitch conceived the idea while writing a traditional manuscript for engineers and technicians interested in building stream processing applications. He wished he could explain what he was writing about to his 2-year-old daughter, and contemplated the best way to introduce the concepts in a way anyone could grasp.Four months later, he had completed the illustration book: Gently Down the Stream: A Gentle Introduction to Apache Kafka. It tells the story of a family of forest-dwelling Otters, who discover that they can use a giant river to communicate with each other. When more Otter families move into the forest, they must learn to adapt their system to handle the increase in activity.This accessible metaphor for how streaming applications work is accompanied by Mitch's warm, painterly illustrations.For his second book, Seymour collaborated with the researcher and software developer Martin Kleppmann, author of Designing Data-Intensive Applications. Kleppmann admired the illustration book and proposed that the next book tackle a gentle introduction to cryptography. Specifically, it would introduce the concepts behind symmetric-key encryption, key exchange protocols, and the Diffie-Hellman algorithm, a method for exchanging secret information over a public channel.Secret Colors tells the story of a pair of Bunnies preparing to attend a school dance, who eagerly exchange notes on potential dates. They realize they need a way of keeping their messages secret, so they develop a technique that allows them to communicate without any chance of other Bunnies intercepting their messages.Mitch's latest illustration book is—A Walk to the Cloud: A Gentle Introduction to Fully Managed Environments.  In the episode, Seymour discusses his process of creating the books from concept to completion, the decision to create his own publishing company to distribute these books, and whether a fourth book is on the way. He also discusses the experience of illustrating the books side by side with his wife, shares his insights on how editing is similar to coding, and explains why a concise set of commands is equally desirable in SQL queries and children's literature.EPISODE LINKSMinimizing Software Speciation with ksqlDB and Kafka StreamsGently Down the Stream: A Gentle Introduction to Apache KafkaSecret ColorsA Walk to the Cloud: A Gentle Introduction to Fully Managed EnvironmentsApache Kafka On the Go: Kafka Concepts for BeginnersApache Kafka 101 courseWatch the videoJoin the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)

Streaming Audio: a Confluent podcast about Apache Kafka
Building and Designing Events and Event Streams with Apache Kafka

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Dec 15, 2022 53:06 Transcription Available


What are the key factors to consider when developing event-driven architecture? When properly designed, events can connect existing systems with a common language and allow data exchange in near real time. They also help reduce complexity by providing a single source of truth that eliminates the need to synchronize data between different services or applications. They enable dynamic behavior, allowing each service or application to respond quickly to changes in its environment. Using events, developers can create systems that are more reliable, responsive, and easier to maintain.In this podcast, Adam Bellemare, Staff Technologist at Confluent, discusses the 4 dimensions of events and designing event streams along with best practices, and an overview of a new course he just authored. This course, called Introduction to Designing Events and Event Streams, walks you through the process of properly designing events and event streams in any event-driven architecture.Adam explains that the goal of the course is to provide you with a foundation for designing events and event streams. Along with hands-on exercises and best practices, the course explores the four dimensions of events and event stream design and applies them to real-world problems. Most importantly, he talks to Kris about the key factors to consider when deciding what events to write, what events to publish, and how to structure and design them to trigger actions like broadcasting messages to other services or storing results in a database.How you design and implement events and event streams significantly affect not only what you can do today, but how you scale in the future. Head over to Introduction to Designing Events and Event Streams to learn everything you need to know about building an event-driven architecture.EPISODE LINKSIntroduction to Designing Events and Event StreamsPractical Data Mesh: Building Decentralized Data Architecture with Event StreamsThe Data Dichotomy: Rethinking the Way We Treat Data and ServicesCoding in Motion: Sound & Vision—Build a Data Streaming App with JavaScript and Confluent CloudUsing Event-Driven Design with Apache Kafka Streaming Applications ft. Bobby CalderwoodWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Streaming Audio: a Confluent podcast about Apache Kafka
If Streaming Is the Answer, Why Are We Still Doing Batch?

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Nov 9, 2022 43:58 Transcription Available


Is real-time data streaming the future, or will batch processing always be with us? Interest in streaming data architecture is booming, but just as many teams are still happily batching away. Batch processing is still simpler to implement than stream processing, and successfully moving from batch to streaming requires a significant change to a team's habits and processes, as well as a meaningful upfront investment. Some are even running dbt in micro batches to simulate an effect similar to streaming, without having to make the full transition. Will streaming ever fully take over?In this episode, Kris talks to a panel of industry experts with decades of experience building and implementing data systems. They discuss the state of streaming adoption today, if streaming will ever fully replace batch, and whether it even could (or should). Is micro batching the natural stepping stone between batch and streaming? Will there ever be a unified understanding on how data should be processed over time? Is the lack of agreement on best practices for data streaming an insurmountable obstacle to widespread adoption? What exactly is holding teams back from fully adopting a streaming model?Recorded live at Current 2022: The Next Generation of Kafka Summit, the panel includes Adi Polak (Vice President of Developer Experience, Treeverse), Amy Chen (Partner Engineering Manager, dbt Labs), Eric Sammer (CEO, Decodable), and Tyler Akidau (Principal Software Engineer, Snowflake).EPISODE LINKSdbt LabsDecodablelakeFSSnowflakeView sessions and slides from Current 2022Stream Processing vs. Batch Processing: What to KnowFrom Batch to Real-Time: Tips for Streaming Data Pipelines with Apache Kafka ft. Danica FineWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

The Data Stack Show
112: Python Native Stream Processing with Zander Matheson of bytewax

The Data Stack Show

Play Episode Listen Later Nov 9, 2022 50:06


Highlights from this week's conversation include:Zander's background and career journey (2:32)Introducing bytewax (5:16)The difference between systems (10:57)Bytewax's most common use cases (16:15)How bytewax integrates with other systems (20:25)The technology that makes up bytewax (24:31)Comparing bytewax to other systems (34:17)What's next for bytewax (36:31)Try it out: bytewax.ioThe Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The Data Stack Show
The PRQL: Who Needs a Stream Processing Engine?

The Data Stack Show

Play Episode Listen Later Nov 7, 2022 5:10


In this bonus episode, Eric and Kostas preview their upcoming conversation with Zander Matheson of bytewax.

Streaming Audio: a Confluent podcast about Apache Kafka
Security for Real-Time Data Stream Processing with Confluent Cloud

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Nov 3, 2022 48:33 Transcription Available


Streaming real-time data at scale and processing it efficiently is critical to cybersecurity organizations like SecurityScorecard. Jared Smith, Senior Director of Threat Intelligence, and Brandon Brown, Senior Staff Software Engineer, Data Platform at SecurityScorecard, discuss their journey from using RabbitMQ to open-source Apache Kafka® for stream processing. As well as why turning to fully-managed Kafka on Confluent Cloud is the right choice for building real-time data pipelines at scale. SecurityScorecard mines data from dozens of digital sources to discover security risks and flaws with the potential to expose their client' data. This includes scanning and ingesting data from a large number of ports to identify suspicious IP addresses, exposed servers, out-of-date endpoints, malware-infected devices, and other potential cyber threats for more than 12 million companies worldwide.To allow real-time stream processing for the organization, the team moved away from using RabbitMQ to open-source Kafka for processing a massive amount of data in a matter of milliseconds, instead of weeks or months. This makes the detection of a website's security posture risk happen quickly for constantly evolving security threats. The team relied on batch pipelines to push data to and from Amazon S3 as well as expensive REST API based communication carrying data between systems. They also spent significant time and resources on open-source Kafka upgrades on Amazon MSK.Self-maintaining the Kafka infrastructure increased operational overhead with escalating costs. In order to scale faster, govern data better, and ultimately lower the total cost of ownership (TOC), Brandon, lead of the organization's Pipeline team, pivoted towards a fully-managed, cloud-native approach for more scalable streaming data pipelines, and for the development of a new Automatic Vendor Detection (AVD) product. Jared and Brandon continue to leverage the Cloud for use cases including using PostgreSQL and pushing data to downstream systems using CSC connectors, increasing data governance and security for streaming scalability, and more.EPISODE LINKSSecurityScorecard Case StudyBuilding Data Pipelines with Apache Kafka and ConfluentWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Streaming Audio: a Confluent podcast about Apache Kafka
Build a Real Time AI Data Platform with Apache Kafka

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Oct 20, 2022 37:18 Transcription Available


Is it possible to build a real-time data platform without using stateful stream processing? Forecasty.ai is an artificial intelligence platform for forecasting commodity prices, imparting insights into the future valuations of raw materials for users. Nearly all AI models are batch-trained once, but precious commodities are linked to ever-fluctuating global financial markets, which require real-time insights. In this episode, Ralph Debusmann (CTO, Forecasty.ai) shares their journey of migrating from a batch machine learning platform to a real-time event streaming system with Apache Kafka® and delves into their approach to making the transition frictionless. Ralph explains that Forecasty.ai was initially built on top of batch processing, however, updating the models with batch-data syncs was costly and environmentally taxing. There was also the question of scalability—progressing from 60 commodities on offer to their eventual plan of over 200 commodities. Ralph observed that most real-time systems are non-batch, streaming-based real-time data platforms with stateful stream processing, using Kafka Streams, Apache Flink®, or even Apache Samza. However, stateful stream processing involves resources, such as teams of stream processing specialists to solve the task. With the existing team, Ralph decided to build a real-time data platform without using any sort of stateful stream processing. They strictly keep to the out-of-the-box components, such as Kafka topics, Kafka Producer API, Kafka Consumer API, and other Kafka connectors, along with a real-time database to process data streams and implement the necessary joins inside the database. Additionally, Ralph shares the tool he built to handle historical data, kash.py—a Kafka shell based on Python; discusses issues the platform needed to overcome for success, and how they can make the migration from batch processing to stream processing painless for the data science team. EPISODE LINKSKafka Streams 101 courseThe Difference Engine for Unlocking the Kafka Black BoxGitHub repo: kash.pyWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Streaming Audio: a Confluent podcast about Apache Kafka
Application Data Streaming with Apache Kafka and Swim

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Oct 3, 2022 39:10 Transcription Available


How do you set data applications in motion by running stateful business logic on streaming data? Capturing key stream processing events and cumulative statistics that necessitate real-time data assessment, migration, and visualization remains as a gap—for event-driven systems and stream processing frameworks according to Fred Patton (Developer Evangelist, Swim Inc.) In this episode, Fred explains streaming applications and how it contrasts with stream processing applications. Fred and Kris also discuss how you can use Apache Kafka® and Swim for a real-time UI for streaming data.Swim's technology facilitates relationships between streaming data from distributed sources and complex UIs, managing backpressure cumulatively, so that front ends don't get overwhelmed. They are focused on real-time, actionable insights, as opposed to those derived from historical data. Fred compares Swim's functionality to the speed layer in the Lambda architecture model, which is specifically concerned with serving real-time views. For this reason, when sending your data to Swim, it is common to also send a copy to a data warehouse that you control. Web agent—a data entity in the Swim ecosystem, can be as small as a single cellphone or as large as a whole cellular network. Web agents communicate with one another as well as with their subscribers, and each one is a URI that can be called by a browser or the command line. Swim has been designed to instantaneously accommodate requests at widely varying levels of granularity, each of which demands a completely different volume of data. Thus, as you drill down, for example, from a city view on a map into a neighborhood view, the Swim system figures out which web agent is responsible for the view you are requesting, as well as the other web agents needed to show it.Fred also shares an example where they work with a telephony company that requires real-time statuses for a network infrastructure with thousands of cell towers servicing millions of devices. Along with a use case for a transportation company needing to transform raw edge data into actionable insights for its connected vehicle customers. Future plans for Swim include porting more functionality to the cloud, which will enable additional automation, so that, for example, a customer just has to provide database and Kafka cluster connections, and Swim can automatically build out infrastructure. EPISODE LINKSSwim Cellular Network SimulatorContinuous Intelligence - Streaming Apps That Are Always in SyncUsing Swim with Apache KafkaSwim DeveloperWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Open||Source||Data
Stream Processing, Observability, and the User Experience with Eric Sammer

Open||Source||Data

Play Episode Listen Later Sep 28, 2022 42:51


This episode features an interview with Eric Sammer, CEO of Decodable. Eric has been in the tech industry for over 20 years, holding various roles as an early Cloudera employee. He also was the co-founder and CTO of Rocana, which was acquired by Splunk in 2017. During his time at Splunk, Eric served as the VP and Senior Distinguished Engineer responsible for cloud platform services.In this episode, Sam and Eric discuss the gap between operating infrastructure and the analytical world, stream processing innovations, and why it's important to work with people who are smarter than you.-------------------"The thing about Decodable was just like let's connect systems, let's process the data between them. Apache Flink is the right engine and SQL is the language for programming the engine. It doesn't need to be any more complicated. The trick is getting it right, so that people can think about that part of the data infrastructure, the way they think about the network. They don't question whether the packet makes it to the other side because that infrastructure is so burned in and it scales reasonably well these days. You don't even think about it, especially in the cloud." – Eric Sammer-------------------Episode Timestamps:(01:09): What open source data means to Eric(06:57): What led Eric to Cloudera and Hadoop(12:48): What inspired Eric to create Rocana(20:29): The problem Eric is trying to solve at Flink(29:54): What problems in stream processing we'll have to solve in the next 5 years(36:58): Eric's advice for advancing your career-------------------Links:LinkedIn - Connect with EricTwitter - Follow EricTwitter - Follow DecodableDecodable

Streaming Audio: a Confluent podcast about Apache Kafka
How to Build a Reactive Event Streaming App - Coding in Motion

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Sep 20, 2022 1:26 Transcription Available


How do you build an event-driven application that can react to real-time data streams as they happen? Kris Jenkins (Senior Developer Advocate, Confluent) will be hosting another fun, hands-on programming workshop—Coding in Motion: Watching the River Flow, to demonstrate how you can build a reactive event streaming application with Apache Kafka®, ksqlDB using Python.As a developer advocate, Kris often speaks at conferences, and the presentation will be available on-demand through the organizer's YouTube channel. The desire to read comments and be able to interact with the community motivated Kris to set up a real-time event streaming application that would notify him on his mobile phone. During the workshop, Kris will demonstrate the end-to-end process of using Python to process and stream data from YouTube's REST API into a Kafka topic, analyze the data with ksqlDB, and then stream data out via Telegram. After the workshop, you'll be able to use the recipe to build your own event-driven data application.  EPISODE LINKSCoding in Motion: Building a Reactive Data Streaming AppWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

Streaming Audio: a Confluent podcast about Apache Kafka
Real-Time Stream Processing, Monitoring, and Analytics With Apache Kafka

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later Sep 15, 2022 34:07 Transcription Available


Processing real-time event streams enables countless use cases big and small. With a day job designing and building highly available distributed data systems, Simon Aubury (Principal Data Engineer, Thoughtworks) believes stream-processing thinking can be applied to any stream of events. In this episode, Simon shares his Confluent Hackathon '22 winning project—a wildlife monitoring system to observe population trends over time using a Raspberry Pi, along with Apache Kafka®, Kafka Connect, ksqlDB, TensorFlow Lite, and Kibana. He used the system to count animals in his Australian backyard and perform trend analysis on the results. Simon also shares ideas on how you can use these same technologies to help with other real-world challenges.Open-source, object detection models for TensorFlow, which appropriately are collected into "model zoos," meant that Simon didn't have to provide his own object identification as part of the project, which would have made it untenable. Instead, he was able to utilize the open-source models, which are essentially neural nets pretrained on relevant data sets—in his case, backyard animals.Simon's system, which consists of around 200 lines of code, employs a Kafka producer running a while loop, which connects to a camera feed using a Python library. For each frame brought down, object masking is applied in order to crop and reduce pixel density, and then the frame is compared to the models mentioned above. A Python dictionary containing probable found objects is sent to a Kafka broker for processing; the images themselves aren't sent. (Note that Simon's system is also capable of alerting if a specific, rare animal is detected.) On the broker, Simon uses ksqlDB and windowing to smooth the data in case the frames were inconsistent for some reason (it may look back over thirty seconds, for example, and find the highest number of animals per type). Finally, the data is sent to a Kibana dashboard for analysis, through a Kafka Connect sink connector. Simon's system is an extremely low-cost system that can simulate the behaviors of more expensive, proprietary systems. And the concepts can easily be applied to many other use cases. For example, you could use it to estimate traffic at a shopping mall to gauge optimal opening hours, or you could use it to monitor the queue at a coffee shop, counting both queued patrons as well as impatient patrons who decide to leave because the queue is too long.EPISODE LINKSReal-Time Wildlife Monitoring with Apache KafkaWildlife Monitoring GithubksqlDB Fundamentals: How Apache Kafka, SQL, and ksqlDB Work TogetherEvent-Driven Architecture - Common Mistakes and Valuable LessonsWatch the video version of this podcastKris Jenkins' TwitterJoin the Confluent CommunityLearn more on Confluent DeveloperUse PODCAST100 to get $100 of free Confluent Cloud usage (details)   

Catalog & Cocktails
Catalog & Cocktails: Bonus Episode with John Kutay

Catalog & Cocktails

Play Episode Listen Later Sep 3, 2022 61:25


Lucid Streaming; how to take full advantage of data streamingData streaming, Stream Processing, Real-time analytics, operational analytics — what is this? What's the difference?Most important use cases for data streamingThere are lots of misconceptions especially for the MDS crowd (not as much enterprise) between fast batch vs streamingMemory-first processing (in-memory) vs disk space batch jobsChange data capture (and only capture of change)Data warehouses are now tying to support streaming more (like Snowflake)This will be a big deal to make it so that more streaming can happen Streaming warehouses (Rockset, Materialize) vs data streamingLineage - transformed data - can I trust this data I'm looking atHow does data streaming and lineage come together? What's unique about lineage in a streaming context?If time: what does it mean to do streaming data products in a data mesh context?

Engenharia de Dados [Cast]
Cloudera CDP: Plataforma de Cloud Híbrida para Dados

Engenharia de Dados [Cast]

Play Episode Listen Later Jul 12, 2022 62:38


Nesse episódio com os dois maiores especialistas do Brasil sobre esse assunto, Thiago Santiago e Gustavo Gattass, falamos sobre a nova plataforma de dados da Cloudera, como sempre trazendo inovação no mercado de Big Data e Analytics. Doug Cutting, criador do famoso sistema Apache Hadoop fez com que tudo fosse possível em 2006 para processamento de dados massivo e agora, a nova plataforma da Cloudera unificada CDP, traz os seguintes grandes benefícios para seus consumidores:Nuvem HíbridaCloudera SDX para Plataforma de Deployment Unificada com KubernetesEngenharia e Ciência de Dados como Produto de Entrega UnificadaData Warehouse e Visualização de DadosEntenda o futuro da Engenharia e Ciência de Dados em uma plataforma aonde se tem como principal objetivo a entrega de uma solução completa fim a fim, embarque no Cloudera CDP.Thiago Santiago = https://www.linkedin.com/in/thiagosantiago/ Gustavo Gattas = https://www.linkedin.com/in/ggattass/ No YouTube possuímos um canal de Engenharia de Dados com os tópicos mais importantes dessa área e com lives todas as quartas-feiras.https://www.youtube.com/channel/UCnErAicaumKqIo4sanLo7vQ Quer ficar por dentro dessa área com posts e updates semanais, então acesse o LinkedIN para não perder nenhuma notícia.https://www.linkedin.com/in/luanmoreno/ Disponível no Spotify e na Apple Podcasthttps://open.spotify.com/show/5n9mOmAcjra9KbhKYpOMqYhttps://podcasts.apple.com/br/podcast/engenharia-de-dados-cast/  Luan Moreno = https://www.linkedin.com/in/luanmoreno/

The Python Podcast.__init__
Stream Processing In Real Time And At Scale In Pure Python With Bytewax

The Python Podcast.__init__

Play Episode Listen Later Jul 10, 2022 42:32


Analysis of streaming data in real time has long been the domain of big data frameworks, predominantly written in Java. In order to take advantage of those capabilities from Python requires using client libraries that suffer from impedance mis-matches that make the work harder than necessary. Bytewax is a new open source platform for writing stream processing applications in pure Python that don't have to be translated into foreign idioms. In this episode Bytewax founder Zander Matheson explains how the system works and how to get started with it today.

Streaming Audio: a Confluent podcast about Apache Kafka
Flink vs Kafka Streams/ksqlDB: Comparing Stream Processing Tools

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later May 26, 2022 55:55 Transcription Available


Stream processing can be hard or easy depending on the approach you take, and the tools you choose. This sentiment is at the heart of the discussion with Matthias J. Sax (Apache Kafka® PMC member; Software Engineer, ksqlDB and Kafka Streams, Confluent) and Jeff Bean (Sr. Technical Marketing Manager, Confluent). With immense collective experience in Kafka, ksqlDB, Kafka Streams, and Apache Flink®, they delve into the types of stream processing operations and explain the different ways of solving for their respective issues.The best stream processing tools they consider are Flink along with the options from the Kafka ecosystem: Java-based Kafka Streams and its SQL-wrapped variant—ksqlDB. Flink and ksqlDB tend to be used by divergent types of teams, since they differ in terms of both design and philosophy.Why Use Apache Flink?The teams using Flink are often highly specialized, with deep expertise, and with an absolute focus on stream processing. They tend to be responsible for unusually large, industry-outlying amounts of both state and scale, and they usually require complex aggregations. Flink can excel in these use cases, which potentially makes the difficulty of its learning curve and implementation worthwhile.Why use ksqlDB/Kafka Streams?Conversely, teams employing ksqlDB/Kafka Streams require less expertise to get started and also less expertise and time to manage their solutions. Jeff notes that the skills of a developer may not even be needed in some cases—those of a data analyst may suffice. ksqlDB and Kafka Streams seamlessly integrate with Kafka itself, as well as with external systems through the use of Kafka Connect. In addition to being easy to adopt, ksqlDB is also deployed on production stream processing applications requiring large scale and state.There are also other considerations beyond the strictly architectural. Local support availability, the administrative overhead of using a library versus a separate framework, and the availability of stream processing as a fully managed service all matter. Choosing a stream processing tool is a fraught decision partially because switching between them isn't trivial: the frameworks are different, the APIs are different, and the interfaces are different. In addition to the high-level discussion, Jeff and Matthias also share lots of details you can use to understand the options, covering employment models, transactions, batching, and parallelism, as well as a few interesting tangential topics along the way such as the tyranny of state and the Turing completeness of SQL.EPISODE LINKSThe Future of SQL: Databases Meet Stream ProcessingBuilding Real-Time Event Streams in the Cloud, On PremisesKafka Streams 101 courseksqlDB 101 courseWatch the video version of this podcastKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more on Confluent DeveloperUse PODCAST100 for additional $100 of  Confluent Cloud usage (details)

Streaming Audio: a Confluent podcast about Apache Kafka
Practical Data Pipeline: Build a Plant Monitoring System with ksqlDB

Streaming Audio: a Confluent podcast about Apache Kafka

Play Episode Listen Later May 19, 2022 33:56 Transcription Available


Apache Kafka® isn't just for day jobs according to Danica Fine (Senior Developer Advocate, Confluent). It can be used to make life easier at home, too!Building out a practical Apache Kafka® data pipeline is not always complicated—it can be simple and fun. For Danica, the idea of building a Kafka-based data pipeline sprouted with the need to monitor the water level of her plants at home. In this episode, she explains the architecture of her hardware-oriented project and discusses how she integrates, processes, and enriches data using ksqlDB and Kafka Connect, a Raspberry Pi running Confluent's Python client, and a Telegram bot. Apart from the script on the Raspberry Pi, the entire project was coded within Confluent Cloud.Danica's model Kafka pipeline begins with moisture sensors in her plants streaming data that is requested by an endless for-loop in a Python script on her Raspberry Pi. The Pi in turn connects to Kafka on Confluent Cloud, where the plant data is sent serialized as Avro. She carefully modeled her data, sending an ID along with a timestamp, a temperature reading, and a moisture reading. On Confluent Cloud, Danica enriches the streaming plant data, which enters as a ksqlDB stream, with metadata such as moisture threshold levels, which is stored in a ksqlDB table.She windows the streaming data into 12-hour segments in order to avoid constant alerts when a threshold has been crossed. Alerts are sent at the end of the 12-hour period if a threshold has been traversed for a consistent time period within it (one hour, for example). These are sent to the Telegram API using Confluent Cloud's HTTP Sink Connector, which pings her phone when a plant's moisture level is too low.Potential future project improvement plans include visualizations, adding another Telegram bot to register metadata for new plants, adding machine learning to anticipate watering needs, and potentially closing the loop by pushing data backto the Raspberry Pi, which could power a visual indicator on the plants themselves. EPISODE LINKSGitHub: raspberrypi-houseplantsData Pipelines 101 courseTips for Streaming Data Pipelines ft. Danica FineWatch the video version of this podcastDanica Fine's TwitterKris Jenkins' TwitterStreaming Audio Playlist Join the Confluent CommunityLearn more with Kafka tutorials, resources, and guides at Confluent DeveloperLive demo: Intro to Event-Driven Microservices with ConfluentUse PODCAST100 to get an additional $100 of free Confluent Cloud usage (details)   

A Bootiful Podcast
Spring Cloud legend Glenn Renfro about batch processing, tasks, stream processing, data flow, and t-shirts

A Bootiful Podcast

Play Episode Listen Later Apr 22, 2022 71:42


Hi, Spring fans! Welcome to another installment of  A Bootiful Podcast! In this installment, Josh Long (@starbuxman) talks to Spring Cloud luminary and all-around lovable guy Glenn Renfro (@cppwfs) about batch processing, tasks, messaging, integration, data flow, and a million other things. Also: t-shirts!

Datacast
Episode 85: Ad Exchange, Stream Processing, and Data Discovery Platform with Shinji Kim

Datacast

Play Episode Listen Later Mar 2, 2022 71:09


Show Notes(02:00) Shinji reflected on her academic experience studying Software Engineering at the University of Waterloo in the late 2000s.(04:19) Shinji shared valuable lessons learned from her undergraduate co-op experience with statistical analysis at Sun Microsystems, software engineering at Barclays Capital, and growth marketing at Facebook.(08:52) Shinji shared lessons learned from being a Management Consultant at Deloitte.(14:01) Shinji revisited her decision to quit the job at Deloitte and create a social puzzle game called Shufflepix.(17:42) Shinji went over her time working as a Product Manager at the mobile ad exchange network YieldMo.(22:25) Shinji discussed the problem of stream processing at YieldMo, which sparked the creation of Concord.(26:17) Shinji unpacked the pain points with existing stream processing frameworks and the competitive advantage of using Concord.(33:19) Shinji recalled her time at Akamai — initially as a data engineer in the Platform Engineering unit and later as a product manager for the IoT Edge Connect platform.(37:26) Shinji explained why sharing context knowledge around data remains a largely unsolved problem.(42:07) Shinji unpacked the three capabilities of an ideal data discovery platform: (1) exposing up-to-date operational metadata along with the documentation, (2) tracking the provenance of data back to its source, and (3) guiding data usage.(46:59) Shinji unpacked the benefits of plugging BI tools into data discovery platforms and collecting metadata, which facilitates better visibility and understanding.(52:36) Shinji discussed the role of a data discovery platform within the modern data stack.(53:59) Shinji shared the hurdles that her team has to go through while finding early adopters of Select Star.(55:48) Shinji shared valuable hiring lessons learned at Select Star.(01:00:00) Shinji shared fundraising advice for founders currently seeking the right investors for their startups.(01:04:41) Closing segment.Shinji's Contact InfoLinkedInTwitterMediumSelect Star's ResourcesWebsiteBlogLinkedIn | Twitter | MediumMentioned ContentArticles“The Next Evolution of Data Catalogs: Data Discovery Platforms” (Feb 2021)“Data Discovery for Business Intelligence” (May 2021)PeopleMartin Kleppmann (Author of Designing Data-Intensive Applications)Emily Riederer (Senior Analytics Manager at Capital One)Anya Prosvetova (Tableau DataDev Ambassador)Book“Managing Oneself” (by Peter Drucker)NotesMy conversation with Shinji was recorded back in July 2021. Since then, many things have happened at Select Star:General Availability launch on Product Hunt: https://www.producthunt.com/posts/selectstarSnowflake partnership on data governance: https://blog.selectstar.com/selectstar-and-snowflake-partner-to-take-data-governance-to-a-new-level-a9d274e1d4c6Case studies with Pitney Bowes and HandshakeAbout the showDatacast features long-form, in-depth conversations with practitioners and researchers in the data community to walk through their professional journeys and unpack the lessons learned along the way. I invite guests coming from a wide range of career paths — from scientists and analysts to founders and investors — to analyze the case for using data in the real world and extract their mental models (“the WHY and the HOW”) behind their pursuits. Hopefully, these conversations can serve as valuable tools for early-stage data professionals as they navigate their own careers in the exciting data universe.Datacast is produced and edited by James Le. Get in touch with feedback or guest suggestions by emailing khanhle.1013@gmail.com.Subscribe by searching for Datacast wherever you get podcasts or click one of the links below:Listen on SpotifyListen on Apple PodcastsListen on Google PodcastsIf you're new, see the podcast homepage for the most recent episodes to listen to, or browse the full guest list.

The Data Stack Show
60: Architecting a Boring Stream Processing Tool With Ashley Jeffs of Benthos

The Data Stack Show

Play Episode Listen Later Nov 3, 2021 66:54


Highlights from this week's conversation include:A brief overview of Ashley's background (2:47)Benthos' creation and the problems it was meant to address (4:01)Use cases for Benthos (18:25)Key features of Benthos that make it stand out (22:23)Adding windowing to Benthos for fun (29:23)The highs and lows of maintaining an open source project for five years (32:17)The architecture of Benthos (36:23)The importance of ordering in streaming processing (42:15)Gaining traction with an open source project (53:21)Benthos' blobfish mascot (58:03) The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The Data Stack Show
56: Stream Processing and Observability with Jeff Chao of Stripe

The Data Stack Show

Play Episode Listen Later Oct 6, 2021 63:55


Highlights from this week's conversation include:Jeff's history with stream processing (2:52)Working with Mantis to address the impact of Netflix downtime (4:20)Defining observability as operational insight (6:58)Time series data and the value of data today (18:52)Data integration's shift from batch to streaming (29:34)The current state of change data capture (32:20)How an engineer thinks of the end-user (56:21)The Data Stack Show is a weekly podcast powered by RudderStack, the CDP for developers. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

The Data Stack Show
46: A New Paradigm in Stream Processing with Arjun Narayan of Materialize

The Data Stack Show

Play Episode Listen Later Jul 28, 2021 56:13


Highlights from this week's episode include:Introducing Arjun and how he fell in love with databases (2:51)Looking at what Materialize brings to the stack (5:28)Analytics starts with a human in the loop and comes into its own when analysts get themselves out and automate it (15:46)Using Materialize instead of the materialized view from another tool (18:44)Comparing Postgres and Materialize and looking at what's under the hood of Materialize (23:16)Making Materialize simple to use (32:33)Why Materialize doubled down on writing 100% in Rust (35:43)The best use case to start with (42:03)Lessons learned from making Materialize a cloud offering (44:22)Keeping databases to the cloud for low latency (48:31) The Data Stack Show is a weekly podcast powered by RudderStack. Each week we'll talk to data engineers, analysts, and data scientists about their experience around building and maintaining data infrastructure, delivering data and data products, and driving better outcomes across their businesses with data.RudderStack helps businesses make the most out of their customer data while ensuring data privacy and security. To learn more about RudderStack visit rudderstack.com.

Engenharia de Dados [Cast]
Strimzi - Adding Intelligence on Your Kafka on Kubernetes Deployment with Jakub Scholz

Engenharia de Dados [Cast]

Play Episode Listen Later May 6, 2021 70:42


Nesse episódio especial, entrevistamos um dos criadores do projeto Strimzi (Apache Kafka no Kubernetes) Jakub Scholz para nos contar um pouco da história do operador Strimzi.Alguns dos pontos que foram discutidos nessa entrevista:* Apache Kafka no Kubernetes* Operador Strimzi e suas Características* Cenários e Utilização* Apache Kafka e Microsserviços* Tipos de Deployment do Apache Kafka* Benefícios da Remoção do Apache Zookeeper* Novos Recursos no Strimzi * Dicas e RecomendaçõesAlém disso, falamos do grande movimento das empresas para a adoção do Kubernetes para aplicações que guardam estado, e como o Strimzi pode facilitar o deployment do Apache Kafka para que sua jornada seja mais leve e divertida. Luan Moreno = https://www.linkedin.com/in/luanmoreno/