maxime beauchemin podcasts

Ep. #3, Building Tools That Shape Data with Maxime Beauchemin

Play Episode Listen Later Nov 25, 2025 52:42

On episode 3 of Data Renegades, CL Kao and Dori Wilson sit down with Maxime Beauchemin. They explore the origins of Airflow and Superset, the evolution of open source in the data ecosystem, and how today's tooling reshapes the role of the data practitioner. Max also shares a forward-looking perspective on agentic workflows and how AI is accelerating everything from BI to pipeline development.

ai data tools shape bi airflow supersets maxime beauchemin

Building 2 Iconic OSSs Back-to-Back | Maxime Beauchemin (Airflow, Preset)

Software Misadventures

Play Episode Listen Later May 21, 2024 58:55

If you've worked on data problems, you probably have heard of Airflow and Superset, two powerful tools that have cemented their place in the data ecosystem. Building successful open-source software is no easy feat, and even fewer engineers have done this back to back. In part 2 of the conversation, we talk about Max's journey in open source. Segments: (00:03:27) “Project-Community Fit” in Open Source (00:08:31) Fostering Relationships in Open Source (00:10:58) Dealing with Trolls (00:13:40) Attributes of Good Open Source Contributors (00:20:01) How to Get Started with Contributing (00:27:58) Origin Stories of Airflow and Superset (00:33:27) Biggest Surprise since Founding a VC-backed Company? (00:38:47) Picking What to Work On (00:41:46) Advice to Engineers for Building the Next Airflow/Superset? (00:42:35) The 2 New Open Source Projects that Max is Starting (00:52:10) Challenges of Being a Founder (00:57:38) Open Sourcing Ideas Show Notes: Part 1 of our conversation: https://softwaremisadventures.com/p/maxime-beauchemin-llm-ready Max on LinkedIn: https://www.linkedin.com/in/maximebeauchemin/ SQL All Stars: https://github.com/preset-io/allstars Governator: https://github.com/mistercrunch/governator Stay in touch:

Become a LLM-ready Engineer | Maxime Beauchemin (Airflow, Preset)

Software Misadventures

Play Episode Listen Later May 14, 2024 41:05

If you've worked on data problems, you probably have heard of Airflow and Superset, two powerful tools that have cemented their place in the data ecosystem. Building successful open-source software is no easy feat, and even fewer engineers have done this back to back. In Part 1 of this conversation, we chat about how to adapt to the LLM-age as engineers. Segments: (00:01:59) The Rise and Fall of the Data Engineer (00:11:13) The Importance of Executive Skill in the Era of AI (00:13:53) Developing the first reflex to use AI (00:17:47) What are LLMs good at? (00:25:33) Text to SQL (00:28:19) Promptimize (00:32:16) Using tools LangChain (00:35:02) Writing better prompts Show Notes: - Max on Linkedin: https://www.linkedin.com/in/maximebeauchemin/ - Rise of the Data Engineer: https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603 - Downfall of the Data Engineer: https://maximebeauchemin.medium.com/the-downfall-of-the-data-engineer-5bfb701e5d6b - Promptimize: https://github.com/preset-io/promptimize Stay in touch:

ai fall building writing developing engineers era downfall segments llm sql software engineering prompt engineering data engineering data engineers airflow supersets preset langchain maxime beauchemin

Treating Prompt Engineering More Like Code // Maxime Beauchemin // MLOps Podcast #167

MLOps.community

Play Episode Listen Later Jul 25, 2023 74:17

MLOps Coffee Sessions #167 with Maxime Beauchemin, Treating Prompt Engineering More Like Code. // Abstract Promptimize is an innovative tool designed to scientifically evaluate the effectiveness of prompts. Discover the advantages of open-sourcing the tool and its relevance, drawing parallels with test suites in software engineering. Uncover the increasing interest in this domain and the necessity for transparent interactions with language models. Delve into the world of prompt optimization, deterministic evaluation, and the unique challenges in AI prompt engineering. // Bio Maxime Beauchemin is the founder and CEO of Preset, a series B startup supporting and commercializing the Apache Superset project. Max was the original creator of Apache Airflow and Apache Superset when he was at Airbnb. Max has over a decade of experience in data engineering, at companies like Lyft, Airbnb, Facebook, and Ubisoft. // MLOps Jobs board https://mlops.pallet.xyz/jobs // MLOps Swag/Merch https://mlops-community.myshopify.com/ // Related Links Max's first MLOps Podcast episode: https://go.mlops.community/KBnOgN Test-Driven Prompt Engineering for LLMs with Promptimize blog: https://maximebeauchemin.medium.com/mastering-ai-powered-product-development-introducing-promptimize-for-test-driven-prompt-bffbbca91535https://maximebeauchemin.medium.com/mastering-ai-powered-product-development-Test-Driven Prompt Engineering for LLMs with Promptimize podcast: https://talkpython.fm/episodes/show/417/test-driven-prompt-engineering-for-llms-with-promptimizeTaming AI Product Development Through Test-driven Prompt Engineering // Maxime Beauchemin // LLMs in Production Conference lightning talk: https://home.mlops.community/home/videos/taming-ai-product-development-through-test-driven-prompt-engineering --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Max on LinkedIn: https://www.linkedin.com/in/maximebeauchemin/ Timestamps: [00:00] Max introducing the Apache Superset project at Preset [01:04] Max's preferred coffee [01:16] Airflow creator [01:45] Takeaways [03:53] Please like, share, and subscribe to our MLOps channels! [04:31] Check Max's first MLOps Podcast episode [05:20] Promptimize [06:10] Interaction with API [08:27] Deterministic evaluation of SQL queries and AI [12:40] Figuring out the right edge cases [14:17] Reaction with Vector Database [15:55] Promptomize Test Suite [18:48] Promptimize vision [20:47] The open-source blood [23:04] Impact of open source [23:18] Dangers of open source [25:25] AI-Language Models Revolution [27:36] Test-driven design [29:46] Prompt tracking [33:41] Building Test Suites as Assets [36:49] Adding new prompt cases to new capabilities [39:32] Monitoring speed and cost [44:07] Creating own benchmarks [46:19] AI feature adding more value to the end users [49:39] Perceived value of the feature [50:53] LLMs costs [52:15] Specialized model versus Generalized model [56:58] Fine-tuning LLMs use cases [1:02:30] Classic Engineer's Dilemma [1:03:46] Build exciting tech that's available [1:05:02] Catastrophic forgetting [1:10:28] Promt driven development [1:13:23] Wrap up

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Data Engineering Podcast

Play Episode Listen Later Jul 9, 2023 72:54

Summary For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases Interview Introduction How did you get involved in the area of data management? Can you describe what entity-centric modeling (ECM) is and the story behind it? How does it compare to dimensional modeling strategies? What are some of the other competing methods Comparison to activity schema What impact does this have on ML teams? (e.g. feature engineering) What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.) What is the impact on the underlying compute engine on the modeling strategies used? What are some examples of data sources or problem domains for which this approach is well suited? What are some cases where entity centric modeling techniques might be counterproductive? What are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse? What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles? How does this work across business domains within a given organization (especially at "enterprise" scale)? What are the most interesting, innovative, or unexpected ways that you have seen ECM used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM? When is ECM the wrong choice? What are your predictions for the future direction/adoption of ECM or other modeling techniques? Contact Info mistercrunch (https://github.com/mistercrunch) on GitHub LinkedIn (https://www.linkedin.com/in/maximebeauchemin/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Entity Centric Modeling Blog Post (https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/?utm_source=pocket_saves) Max's Previous Apperances Defining Data Engineering with Maxime Beauchemin (https://www.dataengineeringpodcast.com/episode-3-defining-data-engineering-with-maxime-beauchemin) Self Service Data Exploration And Dashboarding With Superset (https://www.dataengineeringpodcast.com/superset-data-exploration-episode-182) Exploring The Evolving Role Of Data Engineers (https://www.dataengineeringpodcast.com/redefining-data-engineering-episode-249) Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations (https://www.dataengineeringpodcast.com/airbnb-alumni-data-driven-organization-episode-319) Apache Airflow (https://airflow.apache.org/) Apache Superset (https://superset.apache.org/) Preset (https://preset.io/) Ubisoft (https://www.ubisoft.com/en-us/) Ralph Kimball (https://en.wikipedia.org/wiki/Ralph_Kimball) The Rise Of The Data Engineer (https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/) The Downfall Of The Data Engineer (https://maximebeauchemin.medium.com/the-downfall-of-the-data-engineer-5bfb701e5d6b) The Rise Of The Data Scientist (https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/) Dimensional Data Modeling (https://www.thoughtspot.com/data-trends/data-modeling/dimensional-data-modeling) Star Schema (https://en.wikipedia.org/wiki/Star_schema) Database Normalization (https://en.wikipedia.org/wiki/Database_normalization) Feature Engineering (https://en.wikipedia.org/wiki/Feature_engineering) DRY == Don't Repeat Yourself (https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) Activity Schema (https://www.activityschema.com/) Podcast Episode (https://www.dataengineeringpodcast.com/narrator-exploratory-analytics-episode-234/) Corporate Information Factory (https://amzn.to/3NK4dpB) (affiliate link) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

reduce saas feature ubisoft python ml databases profiles friction centric entity hug sql ecm business analytics etl airflow supersets preset data modeling apache airflow feature engineering freak fandango orchestra repeat yourself maxime beauchemin dry don

#417: Test-Driven Prompt Engineering for LLMs with Promptimize

Talk Python To Me - Python conversations for passionate developers

Play Episode Listen Later May 30, 2023 73:41

Large language models and chat-based AIs are kind of mind blowing at the moment. Many of us are playing with them for working on code or just as a fun alternative to search. But others of us are building applications with AI at the core. And when doing that, the slightly unpredictable nature and probabilistic nature of LLMs make writing and testing Python code very tricky. Enter promptimize from Maxime Beauchemin and Preset. It's a framework for non-deterministic testing of LLMs inside our applications. Let's dive inside the AIs with Max. Links from the show Max on Twitter: @mistercrunch Promptimize: github.com Introducing Promptimize ("the blog post"): preset.io Preset: preset.io Apache Superset: Modern Data Exploration Platform episode: talkpython.fm ChatGPT: chat.openai.com LeMUR: assemblyai.com Microsoft Security Copilot: blogs.microsoft.com AutoGPT: github.com Midjourney: midjourney.com Midjourney generated pytest tips thumbnail: talkpython.fm Midjourney generated radio astronomy thumbnail: talkpython.fm Prompt engineering: learnprompting.org Michael's ChatGPT result for scraping Talk Python episodes: github.com Apache Airflow: github.com Apache Superset: github.com Tay AI Goes Bad: theverge.com LangChain: github.com LangChain Cookbook: github.com Promptimize Python Examples: github.com TLDR AI: tldr.tech AI Tool List: futuretools.io Watch this episode on YouTube: youtube.com Episode transcripts: talkpython.fm --- Stay in touch with us --- Subscribe to us on YouTube: youtube.com Follow Talk Python on Mastodon: talkpython Follow Michael on Mastodon: mkennedy Sponsors PyCharm RedHat Talk Python Training

ai online training chatgpt web software engineering large developers programming python data science online courses mastodon prompt cloud computing midjourney ide red hat software developers web development mongodb lemur nosql prompt engineering preset autogpt langchain pycharm apache airflow test driven talk python talk python training python3 maxime beauchemin python2

The Creator of Airflow About His Recipe for Smart Data-Driven Companies

The Data Engineering Show

Play Episode Listen Later Aug 3, 2022 45:56

According to Maxime Beauchemin, CEO & Founder at Preset and Creator of Apache Superset and Apache Airflow, building a thriving company is not so straight-forward. So how did he do it? Choosing the right system and services is key for a successful start, and can help you avoid the chaos of having too many tools spread across multiple teams. Max walks the Bros through his recipe for a smart data-driven company, and the genesis of Airflow, Superset & Presto (with some great tidbits about Airflow's old school marketing approach and how the open source platform took on a life of its own).

ceo founders creator data companies recipes analytics bros data driven presto data engineering airflow supersets preset smart data apache airflow maxime beauchemin

How Preset Built a Data-Driven Organization from the Ground Up

The Data Engineering Show

Play Episode Listen Later Aug 3, 2022 45:56

According to Maxime Beauchemin, CEO & Founder at Preset and Creator of Apache Superset and Apache Airflow, it's not so straight-forward to understand what you're really getting into and the vastness of the skills that are required in order to build a thriving company.Picking the right system and services is key for a successful start, and can help you avoid the chaos of having too many tools spread across multiple teams.Plus, Max walks the bros through the genesis of Airflow, Superset & Presto, and Airflow's old school marketing approach that won the hearts of developers across the world. And just like the terminator, once the machine takes over, you can't stop.

ceo founders creator data built picking analytics data driven ground up presto data engineering airflow supersets preset apache airflow maxime beauchemin

MLOps + BI? // Maxime Beauchemin // MLOps Coffee Sessions #104

MLOps.community

Play Episode Listen Later Jun 24, 2022 51:50

MLOps Coffee Sessions #104 with the creator of Apache Airflow and Apache Superset Maxime Beauchemin, Future of BI co-hosted by Vishnu Rachakonda. // Abstract // Bio Maxime Beauchemin is the founder and CEO of Preset. Original creator of Apache Superset. Max has worked at the leading edge of data and analytics his entire career, helping shape the discipline in influential roles at data-dependent companies like Yahoo!, Lyft, Airbnb, Facebook, and Ubisoft. // MLOps Jobs board https://mlops.pallet.xyz/jobs MLOps Swag/Merch https://www.printful.com/ // Related Links Website: https://www.rungalileo.io/ Trade-Off: Why Some Things Catch On, and Others book by Kevin Maney: https://www.amazon.com/Trade-Off-Some-Things-Catch-Others/dp/0385525958 --------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/ Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Vishnu on LinkedIn: https://www.linkedin.com/in/vrachakonda/ Connect with Max on LinkedIn: https://www.linkedin.com/in/maximebeauchemin/ Timestamps: [00:00] Introduction to Maxime Beauchemin [01:28] Takeaways [03:42] Paradigm of data warehouse [06:38] Entity-centric data modeling [11:33] Metadata for metadata [14:24] Problem of data organization for a rapidly scaling organization [18:36] Machine Learning tooling as a subset or of its own [22:28] Airflow: The unsung hero of the data scientists [27:15] Analyzing Airflow [30:44] Disrupting the field [34:45] Solutions to the ladder problem of empowering exploratory work and mortals superpowers with data [38:04] What to watch out for when building for data scientists [41:47] Rapid fire questions [51:12] Wrap up

Exploring The Evolving Role Of Data Engineers

Data Engineering Podcast

Play Episode Listen Later Dec 27, 2021 57:41

Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years.

evolving role data engineering data engineers maxime beauchemin

The Grand Vision And Present Reality of DataOps

Data Engineering Podcast

Play Episode Listen Later May 4, 2021 57:08

The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you.

ceo reality data cto devops monte carlo preset dataops grand vision maxime beauchemin

Self Service Data Exploration And Dashboarding With Superset

Data Engineering Podcast

Play Episode Listen Later Apr 27, 2021 47:24

The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven't already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today.

data exploration self service supersets dashboarding maxime beauchemin

Drill to Detail Ep.88 'Superset, Preset and the Future of Business Intelligence' with Special Guest Maxime Beauchemin

Drill to Detail

Play Episode Listen Later Apr 12, 2021 43:55

Maxime Beauchemin returns to the Drill to Detail Podcast and joins Mark Rittman to talk about what's new with Apache Airflow 2.0, the origin story for Apache Superset and now Preset.io, why the future of business intelligence is open source and news on Marquez, a reference implementation of the OpenLineage open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata sponsored by WeWork.The Rise of the Data EngineerDrill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime BeaucheminApache Airflow 2.0 is here!Apache Superset is a modern data exploration and visualization platformThe Future of Business Intelligence is Open SourcePowerful, easy to use data exploration and visualization platform, powered by Apache Superset™Admunsen: Open source data discovery and metadata engineOpenLineageMarquez: Collect, aggregate, and visualize a data ecosystem's metadata

detail drill wework marquez business intelligence data engineers supersets preset apache airflow maxime beauchemin mark rittman

Drill to Detail Ep.88 'Superset, Preset and the Future of Business Intelligence' with Special Guest Maxime Beauchemin

Drill to Detail

Play Episode Listen Later Apr 12, 2021 43:55

Maxime Beauchemin returns to the Drill to Detail Podcast and joins Mark Rittman to talk about what's new with Apache Airflow 2.0, the origin story for Apache Superset and now Preset.io, why the future of business intelligence is open source and news on Marquez, a reference implementation of the OpenLineage open source metadata service for the collection, aggregation, and visualization of a data ecosystem's metadata sponsored by WeWork.The Rise of the Data EngineerDrill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime BeaucheminApache Airflow 2.0 is here!Apache Superset is a modern data exploration and visualization platformThe Future of Business Intelligence is Open SourcePowerful, easy to use data exploration and visualization platform, powered by Apache Superset™Admunsen: Open source data discovery and metadata engineOpenLineageMarquez: Collect, aggregate, and visualize a data ecosystem's metadata

detail drill wework marquez business intelligence data engineers supersets preset apache airflow maxime beauchemin mark rittman

Be Data Driven At Any Scale With Superset

The Python Podcast.__init__

Play Episode Listen Later Mar 22, 2021 47:33

Becoming data driven is the stated goal of a large and growing number of organizations. In order to achieve that mission they need a reliable and scalable method of accessing and analyzing the data that they have. While business intelligence solutions have been around for ages, they don't all work well with the systems that we rely on today and a majority of them are not open source. Superset is a Python powered platform for exploring your data and building rich interactive dashboards that gets the information that your organization needs in front of the people that need it. In this episode Maxime Beauchemin, the creator of Superset, shares how the project got started and why it has become such a widely used and popular option for exploring and sharing data at companies of all sizes. He also explains how it functions, how you can customize it to fit your specific needs, and how to get it up and running in your own environment.

scale python data driven supersets linode maxime beauchemin

Feature Stores for Accelerating AI Development - #432

This Week in Machine Learning & Artificial Intelligence (AI) Podcast

Play Episode Listen Later Nov 30, 2020 57:00

In this special episode of the podcast, we're joined by Kevin Stumpf, Co-Founder and CTO of Tecton, Willem Pienaar, an engineering lead at Gojek and founder of the Feast Project, and Maxime Beauchemin, Founder & CEO of Preset, for a discussion on Feature Stores for Accelerating AI Development. In this panel discussion, Sam and our guests explored how organizations can increase value and decrease time-to-market for machine learning using feature stores, MLOps, and open source. We also discuss the main data challenges of AI/ML, and the role of the feature store in solving those challenges. The complete show notes for this episode can be found at twimlai.com/go/432.

ai technology tech co founders data development cto feature founder ceo machine learning stores data science ml accelerating ai ml gojek preset maxime beauchemin twiml

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Software Engineering Daily

Play Episode Listen Later Jun 10, 2020 64:22

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.

berlin spark snowflakes vikram hadoop airflow koka software engineering daily apache airflow maxime beauchemin

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Data – Software Engineering Daily

Play Episode Listen Later Jun 10, 2020 64:22

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.

berlin spark snowflakes vikram hadoop airflow koka software engineering daily apache airflow maxime beauchemin

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Podcast – Software Engineering Daily

Play Episode Listen Later Jun 10, 2020 64:22

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake. Since Airflow’s creation, it has powered the The post Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor appeared first on Software Engineering Daily.

berlin spark snowflakes vikram hadoop airflow koka software engineering daily apache airflow maxime beauchemin

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Software Daily

Play Episode Listen Later Jun 10, 2020

Apache Airflow was released in 2015, introducing the first popular open source solution to data pipeline orchestration. Since that time, Airflow has been widely adopted for dependency-based data workflows. A developer might orchestrate a pipeline with hundreds of tasks, with dependencies between jobs in Spark, Hadoop, and Snowflake.Since Airflow's creation, it has powered the data infrastructure at companies like Airbnb, Netflix, and Lyft. It has also been at the center of Astronomer, a startup that helps enterprises build infrastructure around Airflow. Airflow is used to construct DAGs–directed acyclic graphs for managing data workflows.Maxime Beauchemin is the creator of Airflow. Vikram Koka and Ash Berlin-Taylor work at Astronomer. They join the show to talk about the state of Airflow–the purpose of the project, its use cases, and open source ecosystem.

netflix berlin airbnb spark lyft snowflakes astronomers dags vikram hadoop airflow koka apache airflow maxime beauchemin

Apache Superset with Maxime Beauchemin

Open Source – Software Engineering Daily

Play Episode Listen Later Mar 22, 2019 66:34

Upcoming events: A Conversation with Haseeb Qureshi at Cloudflare on April 3, 2019 FindCollabs Hackathon at App Academy on April 6, 2019 Data engineering touches every area of an organization. Engineers need a data platform to build search indexes and microservices. Data scientists need data pipelines to build machine learning models. Business analysts need flexible The post Apache Superset with Maxime Beauchemin appeared first on Software Engineering Daily.

conversations business data engineers apache cloudflare supersets haseeb qureshi software engineering daily app academy maxime beauchemin findcollabs hackathon

Ep. 37 - The Rise of the Data Engineer

The freeCodeCamp Podcast

Play Episode Listen Later Jul 2, 2018 18:40

When Maxime worked at Facebook, his role started evolving. He was developing new skills, new ways of doing things, and new tools. And — more often than not — he was turning his back on traditional methods. He was a pioneer. He was a data engineer! In this podcast, you'll learn about the rise of the data engineer and what it takes to be one. Written by Maxime Beauchemin: https://twitter.com/mistercrunch Read by Abbey Rennemeyer: https://twitter.com/abbeyrenn Original article: https://fcc.im/2tHLCST Learn to code for free at: https://www.freecodecamp.org Intro music by Vangough: https://fcc.im/2APOG02 Transcript: I joined Facebook in 2011 as a business intelligence engineer. By the time I left in 2013, I was a data engineer. I wasn’t promoted or assigned to this new role. Instead, Facebook came to realize that the work we were doing transcended classic business intelligence. The role we’d created for ourselves was a new discipline entirely. My team was at forefront of this transformation. We were developing new skills, new ways of doing things, new tools, and — more often than not — turning our backs to traditional methods. We were pioneers. We were data engineers! Data Engineering? Data science as a discipline was going through its adolescence of self-affirming and defining itself. At the same time, data engineering was the slightly younger sibling, but it was going through something similar. The data engineering discipline took cues from its sibling, while also defining itself in opposition, and finding its own identity. Like data scientists, data engineers write code. They’re highly analytical, and are interested in data visualization. Unlike data scientists — and inspired by our more mature parent, software engineering — data engineers build tools, infrastructure, frameworks, and services. In fact, it’s arguable that data engineering is much closer to software engineering than it is to a data science. In relation to previously existing roles, the data engineering field could be thought of as a superset of business intelligence and data warehousing that brings more elements from software engineering. This discipline also integrates specialization around the operation of so called “big data” distributed systems, along with concepts around the extended Hadoop ecosystem, stream processing, and in computation at scale. In smaller companies — where no data infrastructure team has yet been formalized — the data engineering role may also cover the workload around setting up and operating the organization’s data infrastructure. This includes tasks like setting up and operating platforms like Hadoop/Hive/HBase, Spark, and the like. In smaller environments people tend to use hosted services offered by Amazon or Databricks, or get support from companies like Cloudera or Hortonworks — which essentially subcontracts the data engineering role to other companies. In larger environments, there tends to be specialization and the creation of a formal role to manage this workload, as the need for a data infrastructure team grows. In those organizations, the role of automating some of the data engineering processes falls under the hand of both the data engineering and data infrastructure teams, and it’s common for these teams to collaborate to solve higher level problems. While the engineering aspect of the role is growing in scope, other aspects of the original business engineering role are becoming secondary. Areas like crafting and maintaining portfolios of reports and dashboards are not a data engineer’s primary focus. We now have better self-service tooling where analysts, data scientist and the general “information worker” is becoming more data-savvy and can take care of data consumption autonomously. ETL is changing We’ve also observed a general shift away from drag-and-drop ETL (Extract Transform and Load) tools towards a more programmatic approach. Product know-how on platforms like Informatica, IBM Datastage, Cognos, AbInitio or Microsoft SSIS isn’t common amongst modern data engineers, and being replaced by more generic software engineering skills along with understanding of programmatic or configuration driven platforms like Airflow, Oozie, Azkabhan or Luigi. It’s also fairly common for engineers to develop and manage their own job orchestrator/scheduler. There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software. While it’s beyond the scope of this article to argue on this topic, it’s easy to infer that these same reasons apply to writing ETL as it applies to any other software. Code allows for arbitrary levels of abstractions, allows for all logical operation in a familiar way, integrates well with source control, is easy to version and to collaborate on. The fact that ETL tools evolved to expose graphical interfaces seems like a detour in the history of data processing, and would certainly make for an interesting blog post of its own. Let’s highlight the fact that the abstractions exposed by traditional ETL tools are off-target. Sure, there’s a need to abstract the complexity of data processing, computation and storage. But I would argue that the solution is not to expose ETL primitives (like source/target, aggregations, filtering) into a drag-and-drop fashion. The abstractions needed are of a higher level. For example, an example of a needed abstraction in a modern data environment is the configuration for the experiments in an A/B testing framework: what are all the experiment? what are the related treatments? what percentage of users should be exposed? what are the metrics that each experiment expects to affect? when is the experiment taking effect? In this example, we have a framework that receives precise, high level input, performs complex statistical computation and delivers computed results. We expect that adding an entry for a new experiment will result in extra computation and results being delivered. What is important to note in this example is that the input parameters of this abstraction are not the one offered by a traditional ETL tool, and that a building such an abstraction in a drag and drop interface would not be manageable. To a modern data engineer, traditional ETL tools are largely obsolete because logic cannot be expressed using code. As a result, the abstractions needed cannot be expressed intuitively in those tools. Now knowing that the data engineer’s role consist largely of defining ETL, and knowing that a completely new set of tools and methodology is needed, one can argue that this forces the discipline to rebuild itself from the ground up. New stack, new tools, a new set of constraints, and in many cases, a new generation of individuals. Data modeling is changing Typical data modeling techniques — like the star schema — which defined our approach to data modeling for the analytics workloads typically associated with data warehouses, are less relevant than they once were. The traditional best practices of data warehousing are loosing ground on a shifting stack. Storage and compute is cheaper than ever, and with the advent of distributed databases that scale out linearly, the scarcer resource is engineering time. Here are some changes observed in data modeling techniques: further denormalization: maintaining surrogate keys in dimensions can be tricky, and it makes fact tables less readable. The use of natural, human readable keys and dimension attributes in fact tables is becoming more common, reducing the need for costly joins that can be heavy on distributed databases. Also note that support for encoding and compression in serialization formats like Parquet or ORC, or in database engines like Vertica, address most of the performance loss that would normally be associated with denormalization. Those systems have been taught to normalize the data for storage on their own. blobs: modern databases have a growing support for blobs through native types and functions. This opens new moves in the data modeler’s playbook, and can allow for fact tables to store multiple grains at once when needed dynamic schemas: since the advent of map reduce, with the growing popularity of document stores and with support for blobs in databases, it’s becoming easier to evolve database schemas without executing DML. This makes it easier to have an iterative approach to warehousing, and removes the need to get full consensus and buy-in prior to development. systematically snapshoting dimensions (storing a full copy of the dimension for each ETL schedule cycle, usually in distinct table partitions) as a generic way to handle slowly changing dimension (SCD) is a simple generic approach that requires little engineering effort, and that unlike the classical approach, is easy to grasp when writing ETL and queries alike. It’s also easy and relatively cheap to denormalize the dimension’s attribute into the fact table to keep track of its value at the moment of the transaction. In retrospect, complex SCD modeling techniques are not intuitive and reduce accessibility. conformance, as in conformed dimensions and metrics is still extremely important in modern data environment, but with the need for data warehouses to move fast, and with more team and roles invited to contribute to this effort, it’s less imperative and more of a tradeoff. Consensus and convergence can happen as a background process in the areas where the pain point of divergence become out-of-hand. Also, more generally, it’s arguable to say that with the commoditization of compute cycles and with more people being data-savvy then before, there’s less need to precompute and store results in the warehouse. For instance you can have complex Spark job that can compute complex analysis on-demand only, and not be scheduled to be part of the warehouse. Roles & responsibilities The data warehouse A data warehouse is a copy of transaction data specifically structured for query and analysis. — Ralph Kimball A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. — Bill Inmon The data warehouse is just as relevant as it ever was, and data engineers are in charge of many aspects of its construction and operation. The data engineer’s focal point is the data warehouse and gravitates around it. The modern data warehouse is a more public institution than it was historically, welcoming data scientists, analysts, and software engineers to partake in its construction and operation. Data is simply too centric to the company’s activity to have limitation around what roles can manage its flow. While this allows scaling to match the organization’s data needs, it often results in a much more chaotic, shape-shifting, imperfect piece of infrastructure. The data engineering team will often own pockets of certified, high quality areas in the data warehouse. At Airbnb for instance, there’s a set of “core” schemas that are managed by the data engineering team, where service level agreement (SLAs) are clearly defined and measured, naming conventions are strictly followed, business metadata and documentation is of the highest quality, and the related pipeline code follows a set of well defined best practices. It also becomes the role of the data engineering team to be a “center of excellence” through the definitions of standards, best practices and certification processes for data objects. The team can evolve to partake or lead an education program sharing its core competencies to help other teams become better citizens of the data warehouse. For instance, Facebook has a “data camp” education program and Airbnb is developing a similar “Data University” program where data engineers lead session that teach people how to be proficient with data. Data engineers are also the “librarians” of the data warehouse, cataloging and organizing metadata, defining the processes by which one files or extract data from the warehouse. In a fast growing, rapidly evolving, slightly chaotic data ecosystem, metadata management and tooling become a vital component of a modern data platform. Performance tuning and optimization With data becoming more strategic than ever, companies are growing impressive budgets for their data infrastructure. This makes it increasingly rational for data engineers to spend cycles on performance tuning and optimization of data processing and storage. Since the budgets are rarely shrinking in this area, optimization is often coming from the perspective of achieving more with the same amount of resources or trying to linearize exponential growth in resource utilization and costs. Knowing that the complexity of the data engineering stack is exploding we can assume that the complexity of optimizing such stack and processes can be just as challenging. Where it can be easy to get huge wins with little effort, diminishing returns laws typically apply. It’s definitely in the interest of the data engineer to build [on] infrastructure that scales with the company, and to be resource conscious at all times. Data Integration Data integration, the practice behind integrating businesses and systems through the exchange of data, is as important and as challenging as its ever been. As Software as a Service (SaaS) becomes the new standard way for companies to operate, the need to synchronize referential data across these systems becomes increasingly critical. Not only SaaS needs up-to-date data to function, we often want to bring the data generated on their side into our data warehouse so that it can be analyzed along the rest of our data. Sure SaaS often have their own analytics offering, but are systematically lacking the perspective that the rest of you company’s data offer, so more often than not it’s necessary to pull some of this data back. Letting these SaaS offering redefine referential data without integrating and sharing a common primary key is a disaster that should be avoided at all costs. No one wants to manually maintain two employee or customer lists in 2 different systems, and even worse: having to do fuzzy matching when bringing their HR data back into their warehouse. Worse, company executive often sign deal with SaaS providers without really considering the data integration challenges. The integration workload is systematically downplayed by vendors to facilitate their sales, and leaves data engineers stuck doing unaccounted, under appreciated work to do. Let alone the fact that typical SaaS APIs are often poorly designed, unclearly documented and “agile”: meaning that you can expect them to change without notice. Services Data engineers are operating at a higher level of abstraction and in some cases that means providing services and tooling to automate the type of work that data engineers, data scientists or analysts may do manually. Here are a few examples of services that data engineers and data infrastructure engineer may build and operate. data ingestion: services and tooling around “scraping” databases, loading logs, fetching data from external stores or APIs, … metric computation: frameworks to compute and summarize engagement, growth or segmentation related metrics anomaly detection: automating data consumption to alert people anomalous events occur or when trends are changing significantly metadata management: tooling around allowing generation and consumption of metadata, making it easy to find information in and around the data warehouse experimentation: A/B testing and experimentation frameworks is often a critical piece of company’s analytics with a significant data engineering component to it instrumentation: analytics starts with logging events and attributes related to those events, data engineers have vested interests in making sure that high quality data is captured upstream sessionization: pipelines that are specialized in understand series of actions in time, allowing analysts to understand user behaviors Just like software engineers, data engineers should be constantly looking to automate their workloads and building abstraction that allow them to climb the complexity ladder. While the nature of the workflows that can be automated differs depending on the environment, the need to automate them is common across the board. Required Skills SQL mastery: if english is the language of business, SQL is the language of data. How successful of a business man can you be if you don’t speak good english? While generations of technologies age and fade, SQL is still standing strong as the lingua franca of data. A data engineer should be able to express any degree of complexity in SQL using techniques like “correlated subqueries” and window functions. SQL/DML/DDL primitives are simple enough that it should hold no secrets to a data engineer. Beyond the declarative nature of SQL, she/he should be able to read and understand database execution plans, and have an understanding of what all the steps are, how indices work, the different join algorithm and the distributed dimension within the plan. Data modeling techniques: for a data engineer, entity-relationship modeling should be a cognitive reflex, along with a clear understanding of normalization, and have a sharp intuition around denormalization tradeoffs. The data engineer should be familiar with dimensional modeling and the related concepts and lexical field. ETL design: writing efficient, resilient and “evolvable” ETL is key. I’m planning on expanding on this topic on an upcoming blog post. Architectural projections: like any professional in any given field of expertise, the data engineer needs to have a high level understanding of most of the tools, platforms, libraries and other resources at its disposal. The properties, use-cases and subtleties behind the different flavors of databases, computation engines, stream processors, message queues, workflow orchestrators, serialization formats and other related technologies. When designing solutions, she/he should be able to make good choices as to which technologies to use and have a vision as to how to make them work together. All in all Over the past 5 years working in Silicon Valley at Airbnb, Facebook and Yahoo!, and having interacted profusely with data teams of all kinds working for companies like Google, Netflix, Amazon, Uber, Lyft and dozens of companies of all sizes, I’m observing a growing consensus on what “data engineering” is evolving into, and felt a need to share some of my findings. I’m hoping that this article can serve as some sort of manifesto for data engineering, and I’m hoping to spark reactions from the community operating in the related fields!

The Origins of Airflow

The Airflow Podcast

Play Episode Listen Later Feb 6, 2018 45:09

For the first episode of the Airflow Podcast, we met up with Maxime Beauchemin, creator of Airflow, to explore the motivations behind its creation and the problems it was designed to solve. We asked Maxime for his definition of Airflow, the design principles behind hook/operator use, and his vision for the project. Speaker list: Pete DeJoy - Product at Astronomer Viraj Parekh - Data Engineer at Astronomer Maxime Beauchemin - Software Engineer at Lyft, creator of Airflow Talk mentioned at the end of the podcast- Advanced Data Engineering Patterns with Apache Airflow: http://www.ustream.tv/recorded/109227704 Maxime's Blog: https://medium.com/@maximebeauchemin

speaker blog origins lyft maxime airflow apache airflow maxime beauchemin

Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin

Drill to Detail

Play Episode Listen Later May 15, 2017 60:32

Mark Rittman is joined by Maxime Beauchemin to talk about analytics and data integration at Airbnb, the Apache Airflow and Airbnb Superset open-source projects, and his recent Medium article on "The Rise of the Data Engineer"

airbnb medium detail drill data engineers supersets apache airflow maxime beauchemin mark rittman

Drill to Detail Ep.26 'Airflow, Superset & The Rise of the Data Engineer' with Special Guest Maxime Beauchemin

Drill to Detail

Play Episode Listen Later May 15, 2017 60:32

Mark Rittman is joined by Maxime Beauchemin to talk about analytics and data integration at Airbnb, the Apache Airflow and Airbnb Superset open-source projects, and his recent Medium article on "The Rise of the Data Engineer"

airbnb medium detail drill data engineers supersets apache airflow maxime beauchemin mark rittman

Defining Data Engineering with Maxime Beauchemin - Episode 3

Data Engineering Podcast

Play Episode Listen Later Mar 4, 2017 45:20

What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more.

defining data engineering maxime beauchemin

Airflow with Maxime Beauchemin

The Python Podcast.__init__

Play Episode Listen Later Feb 13, 2016 63:17

Are you struggling with trying to manage a series of related, interdependent batch jobs? Then you should check out Airflow. In this episode we spoke with the project's creator Maxime Beauchemin about what inspired him to create it, how it works, and why you might want to use it. Airflow is a data pipeline management tool that will simplify how you build, deploy, and monitor your complex data processing tasks so that you can focus on getting the insights you need from your data.

airflow maxime beauchemin

Podcasts about maxime beauchemin

Best podcasts about maxime beauchemin

Data Engineering Podcast

Drill to Detail

MLOps.community

Software Misadventures

The Data Engineering Show

The Python Podcast.init

Latest podcast episodes about maxime beauchemin

Ep. #3, Building Tools That Shape Data with Maxime Beauchemin

Building 2 Iconic OSSs Back-to-Back | Maxime Beauchemin (Airflow, Preset)

Become a LLM-ready Engineer | Maxime Beauchemin (Airflow, Preset)

Treating Prompt Engineering More Like Code // Maxime Beauchemin // MLOps Podcast #167

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

#417: Test-Driven Prompt Engineering for LLMs with Promptimize

The Creator of Airflow About His Recipe for Smart Data-Driven Companies

How Preset Built a Data-Driven Organization from the Ground Up

MLOps + BI? // Maxime Beauchemin // MLOps Coffee Sessions #104

Exploring The Evolving Role Of Data Engineers

The Grand Vision And Present Reality of DataOps

Self Service Data Exploration And Dashboarding With Superset

Drill to Detail Ep.88 'Superset, Preset and the Future of Business Intelligence' with Special Guest Maxime Beauchemin

Drill to Detail Ep.88 'Superset, Preset and the Future of Business Intelligence' with Special Guest Maxime Beauchemin

Be Data Driven At Any Scale With Superset

Feature Stores for Accelerating AI Development - #432

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor

Apache Airflow with Maxime Beauchemin, Vikram Koka, and Ash Berlin-Taylor